WO2022142853A1 - 一种声源定位方法以及装置 - Google Patents

一种声源定位方法以及装置 Download PDF

Info

Publication number
WO2022142853A1
WO2022142853A1 PCT/CN2021/132081 CN2021132081W WO2022142853A1 WO 2022142853 A1 WO2022142853 A1 WO 2022142853A1 CN 2021132081 W CN2021132081 W CN 2021132081W WO 2022142853 A1 WO2022142853 A1 WO 2022142853A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
angle
microphone array
radar
data
Prior art date
Application number
PCT/CN2021/132081
Other languages
English (en)
French (fr)
Inventor
应冬文
况丹妮
贺亚农
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022142853A1 publication Critical patent/WO2022142853A1/zh
Priority to US18/215,486 priority Critical patent/US20230333205A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/02Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using radio waves
    • G01S5/0257Hybrid positioning
    • G01S5/0263Hybrid positioning by combining or switching between positions derived from two or more separate positioning systems
    • G01S5/0264Hybrid positioning by combining or switching between positions derived from two or more separate positioning systems at least one of the systems being a non-radio wave positioning system
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S11/00Systems for determining distance or velocity not using reflection or reradiation
    • G01S11/14Systems for determining distance or velocity not using reflection or reradiation using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S13/00Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
    • G01S13/02Systems using reflection of radio waves, e.g. primary radar systems; Analogous systems
    • G01S13/06Systems determining position data of a target
    • G01S13/42Simultaneous measurement of distance and other co-ordinates
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S13/00Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
    • G01S13/86Combinations of radar systems with non-radar systems, e.g. sonar, direction finder
    • G01S13/862Combination of radar systems with sonar systems
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/86Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves with means for eliminating undesired waves, e.g. disturbing noises
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/15Transducers incorporated in visual displaying devices, e.g. televisions, computer displays, laptops

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a sound source localization method and device.
  • Voice interaction is widely used in smart conferences and home products.
  • the primary problem is to pick up voice signals in noisy environments to prevent environmental noise and indoor reverberation from interfering with target voice signals.
  • Microphone array-based beamforming can accurately pick up voice signals and is widely used in various voice interaction products. It can effectively suppress ambient noise and suppress indoor reverberation without significantly damaging the voice.
  • Beamforming depends on the accurate estimation of the orientation of the voice source, especially the adaptive beamforming technology, which is very sensitive to the orientation of the sound source. The position deviation of a few degrees can easily lead to a significant decrease in the pickup performance. Therefore, how to achieve accurate sound source orientation. Positioning has become an urgent problem to be solved.
  • the present application provides a method and device for locating a sound source, which are used for accurately locating a sound source in combination with a microphone array and a radar.
  • the present application provides a method for locating a sound source, including: acquiring first position information through radar echo data, where the first position information includes a first angle of an object relative to the radar; collecting through a microphone array The received voice signal obtains the incident angle, the incident angle is the angle at which the voice signal is incident on the microphone array; the first angle and the incident angle are fused to obtain the second position information, and the second position information uses to represent the location of the sound source that produced the speech signal.
  • the position of the object detected by the radar and the incident angle detected by the microphone array can be combined to obtain the position of the sound source relative to the microphone array, so that the position of the voice used to separate the sound source can be controlled by the position.
  • the beam is turned on, so as to accurately extract the voice data of the sound source from the data collected by the microphone array.
  • the position of the sound source can be accurately determined, and the voice data of the sound source can be extracted more accurately.
  • the fusion of the first angle and the incident angle may include: respectively determining a first weight corresponding to the first angle and a second weight corresponding to the incident angle, wherein the first The weight is positively correlated with the moving speed of the object relative to the radar, and the second weight is negatively correlated with the moving speed of the object relative to the radar; according to the first weight and the second The weight performs weighted fusion of the first angle and the incident angle to obtain a fusion angle, and the second position information includes the fusion angle.
  • the moving speed of the object can be considered to determine the weight, so that various object motion situations can be used to improve the accuracy of the fusion angle.
  • the method further includes: extracting voice data of the sound source from the voice signal collected by the microphone array based on the second position information.
  • the voice data of the sound source can be accurately extracted from the data collected by the microphone array based on the exact position of the sound source.
  • the extracting the voice data of the sound source from the voice signal collected by the microphone array based on the second position information includes: using the data collected by the microphone array as The input of the preset beam separation network outputs the voice data of the sound source.
  • the voice data of the sound source can be separated from the data collected by the microphone array through the beam separation network, that is, the voice data in the direction corresponding to the sound source can be extracted by beamforming , so as to obtain more accurate speech data within the sound source.
  • the beam separation network includes a speech separation model
  • the speech separation model is used to separate speech data and background data of a sound source in the input data
  • the method further includes: according to the feedback The wave data determines the moving speed of the sound source; the voice separation model is updated according to the moving speed to obtain the updated voice separation model.
  • the voice separation model can be adaptively updated in combination with the motion speed of the sound source, so that the voice separation model matches the motion situation of the sound source, and can adapt to the scene where the sound source moves rapidly, so that the The voice data of the sound source is separated from the collected data.
  • the updating the speech separation model according to the moving speed includes: determining a parameter set of the speech separation model according to the moving speed to obtain the updated speech separation model, wherein, the parameter set is related to the rate of change of the parameters of the speech separation model, and the moving speed and the rate of change are positively correlated.
  • slow parameter change can improve the stability of the model and reduce model jitter; fast change is conducive to quickly adapting to changes in the environment, so the rate of model parameter change can be selected according to the target movement speed, thereby affecting The parameter set of the speech separation model to obtain the updated speech separation model.
  • the beam separation network further includes a de-reverberation model, and the de-reverberation model is used to filter out the reverberation signal in the input data; the method further includes: according to the object and the distance between the radar, update the de-reverberation model, and obtain the updated de-reverberation model.
  • the reverberation of the data collected by the microphone array can be contacted by the de-reverberation model, so that the speech data of the sound source separated by the speech separation model is more accurate.
  • the updating the de-reverberation model according to the distance between the object and the radar includes: updating the de-reverberation model according to the distance between the object and the radar
  • the delay parameter and the prediction order in the de-reverberation model are obtained to obtain the updated de-reverberation model, where the delay parameter represents the length of time that the reverberation signal lags behind the speech data of the sound source, and the prediction order
  • the number represents the duration of the reverberation, and both the delay parameter and the prediction order are positively related to the distance.
  • the distance between the sound source and the microphone array significantly affects the reverberation of the signal received by the microphone.
  • the voice signal from the sound source propagates farther and the attenuation is large, while the indoor reverberation remains unchanged, the reverberation interferes with the voice signal more, and the reverberation duration is longer;
  • the speech signal from the sound source has a short propagation distance, the attenuation is small, and the influence of the reverberation is weakened. Therefore, the parameters of the de-reverberation model can be adjusted based on the distance of the sound source and the microphone array.
  • the de-reverberation can be stopped to improve the quality of the resulting speech data.
  • the method further includes: if the voice data of the sound source does not meet a preset condition, removing a beam used for processing the voice signal collected by the microphone array.
  • the processing of the voice signal collected by the microphone array is removed. Use the beam to avoid collecting meaningless data.
  • the method further includes: extracting features from the speech data to obtain acoustic features of the sound source; identifying a first probability that the sound source is a living body according to the acoustic features; According to the echo data of the radar, determine the second probability that the sound source is a living body; fuse the first probability and the second probability to obtain a fusion result, and the fusion result is used to represent the sound source Whether the source is alive.
  • the embodiment of the present application it is also possible to detect whether the sound source is a living body, so that the user can clearly know whether the type of the object currently emitting sound is a living body, and the user experience is improved.
  • the obtaining the incident angle from the voice signal collected by the microphone array includes: if multiple second angles are obtained from the voice signal collected by the microphone array, the first angle and the multiple second angles are obtained. If the second angles are in the same coordinate system, the angle with the smallest difference from the first angle or the difference within the first preset range is selected from the plurality of second angles as the angle of incidence.
  • multiple angles can be collected through the microphone array.
  • the angle collected by the radar can be combined, and the angle closest to the sound source can be selected as the incident angle, so as to improve the accuracy of obtaining the incident angle.
  • the method further includes: if a plurality of third angles are obtained based on the data collected again by the microphone array, then: Based on the moving speed of the object, an angle is selected from the plurality of third angles as the new incident angle.
  • a new incident angle can be selected from the multiple angles based on the moving speed of the object, so as to adapt to the situation that the position of the sound source is constantly changing.
  • selecting a third angle from the multiple angles as the new incident angle based on the moving speed of the object includes: if the moving speed of the object is greater than a preset speed, screen out from the plurality of third angles, and the angle whose difference from the first angle is within the second preset range is used as the new angle of incidence; if the moving speed of the object If the speed is not greater than the preset speed, the angle is selected from the plurality of third angles, and the angle whose difference from the first angle is within the third preset range is used as the new angle of incidence, so The third preset range covers and is larger than the second preset range.
  • a new angle when the moving speed of the object is too large, a new angle can be selected from a far position as the incident angle, and when the speed is slow, a new angle can be selected from a closer position As the incident angle, it adapts to the situation that the position of the object is constantly changing, and the generalization ability is strong.
  • the method further includes: if the first position information does not include the first angle, taking the incident angle as the angle of the sound source relative to the microphone array .
  • the angle of the object relative to the radar may not be detected by the radar.
  • the incident angle obtained by the microphone array can be directly used as the angle of the sound source relative to the microphone array, even if Even if the object is not moving, the sound source can be accurately detected, and the position detection accuracy of the sound source can be improved.
  • the method before acquiring the incident angle of the speech signal collected by the microphone array, the method further includes: if it is determined by the echo data that the object is in a moving state and the object does not emit sound , then the sound source detection threshold of the microphone array for the object is adjusted, and the microphone array is used to collect signals whose sound pressure is higher than the sound source detection threshold.
  • the sound source detection threshold can be lowered, which is equivalent to paying attention to whether the sound source makes sound, increasing the attention to the sound source, and then quickly detecting the sound source. Whether the sound source is producing sound.
  • the first position information further includes a first relative distance between the object and the radar
  • the method further includes: further obtaining the object and the object and the object through the voice signal collected by the microphone array.
  • the second relative distance of the microphone array, the first relative distance and the second relative distance are fused to obtain a fusion distance
  • the fusion distance represents the distance of the sound source relative to the microphone array
  • the The second position information further includes the fusion distance.
  • the distance and the distance collected by the radar can be fused to obtain the distance of the sound source relative to the microphone array or the radar, so as to conduct Subsequent operations, such as updating the beam splitting network, improve the accuracy of the voice data from which the sound source is separated.
  • the present application provides a sound source localization device, comprising:
  • a radar positioning module configured to obtain first position information through radar echo data, where the first position information includes a first angle of an object relative to the radar;
  • a microphone array positioning module configured to obtain an incident angle from the voice signal collected by the microphone array, where the incident angle is the angle at which the voice signal is incident on the microphone array;
  • a sound source localization module configured to fuse the first angle and the incident angle to obtain second position information, where the second position information is used to identify the position of the sound source that generates the speech signal.
  • the sound source localization module is specifically configured to respectively determine a first weight corresponding to the first angle and a second weight corresponding to the incident angle, wherein the first weight and the The moving speed of the object relative to the radar has a positive correlation, and the second weight has a negative correlation with the moving speed of the object relative to the radar; according to the first weight and the second weight The angle and the incident angle are weighted and fused to obtain a fusion angle, and the second position information includes the fusion angle.
  • the device further includes:
  • a voice separation module configured to extract voice data of the sound source from the voice signals collected by the microphone array based on the second position information.
  • the speech separation module is specifically configured to use the data collected by the microphone array as the input of a preset beam separation network, and output the speech data of the sound source.
  • the beam separation network includes a speech separation model, and the speech separation model is used to separate speech data and background data of a sound source in the input data, and the apparatus further includes:
  • an update module configured to determine the moving speed of the sound source according to the echo data; update the voice separation model according to the moving speed to obtain the updated voice separation model.
  • the updating module is specifically configured to determine a parameter set of the speech separation model according to the moving speed, and obtain the updated speech separation model, wherein the parameter set and the The rate of change of the parameters of the speech separation model is correlated, and the moving speed and the rate of change are positively correlated.
  • the beam separation network further includes a de-reverberation model, and the de-reverberation model is used to filter out the reverberation signal in the input data;
  • the updating module is further configured to update the de-reverberation model according to the distance between the object and the radar to obtain the updated de-reverberation model.
  • the updating module is specifically configured to update the delay parameter and the prediction order in the de-reverberation model according to the distance between the object and the radar, to obtain the updated
  • the delay parameter represents the length of time that the reverberation signal lags behind the speech data of the sound source
  • the prediction order represents the duration of the reverberation
  • the delay parameter and the prediction order The numbers are all positively correlated with the distance.
  • the voice separation module is further configured to remove the voice data corresponding to the sound source in the data collected for the microphone array if the voice data of the sound source does not meet a preset condition The beam used for data processing.
  • the apparatus further includes a living body detection unit, configured to: extract features from the speech data to obtain the acoustic features of the sound source; identify the sound source according to the acoustic features as The first probability of a living body; the second probability that the sound source is a living body is determined according to the echo data of the radar; the first probability and the second probability are fused to obtain a fusion result, the fusion result Used to indicate whether the sound source is a living body.
  • a living body detection unit configured to: extract features from the speech data to obtain the acoustic features of the sound source; identify the sound source according to the acoustic features as The first probability of a living body; the second probability that the sound source is a living body is determined according to the echo data of the radar; the first probability and the second probability are fused to obtain a fusion result, the fusion result Used to indicate whether the sound source is a living body.
  • the microphone array positioning module is specifically configured to obtain a plurality of second angles through the voice signals collected by the microphone array, the first angle and the plurality of second angles are at the same In the coordinate system, the angle with the smallest difference from the first angle or the difference within the first preset range is selected from the plurality of second angles as the incident angle.
  • the microphone array positioning module is specifically configured to, after acquiring the incident angle of the speech signal collected by the microphone array, if a plurality of first three angles, an angle is selected from the plurality of third angles as the new incident angle based on the moving speed of the object.
  • the microphone array positioning module is specifically configured to: if the moving speed of the object is greater than a preset speed, filter out the plurality of third angles, which is the same as the first angle. The angle whose difference between the angles is within the second preset range is used as the new incident angle; if the moving speed of the object is not greater than the preset speed, it is filtered from the plurality of third angles , and the angle whose difference from the first angle is within a third preset range is used as the new incident angle, and the third preset range covers and is larger than the second preset range.
  • the sound source localization module is further configured to use the incident angle as the sound source relative to the sound source if the first position information does not include the first angle.
  • the angle of the microphone array is further configured to use the incident angle as the sound source relative to the sound source if the first position information does not include the first angle.
  • the sound source localization module is further configured to, before acquiring the incident angle of the speech signal collected through the microphone array, if it is determined through the echo data that the object is in a moving state, that is, the object is in a moving state.
  • the sound source detection threshold of the microphone array for the object is adjusted, and the microphone array is used for collecting signals whose sound pressure is higher than the sound source detection threshold.
  • the first position information further includes a first relative distance between the object and the radar, and the sound source localization module is also used for, if the voice signal is also collected through the microphone array, Obtain the second relative distance between the object and the microphone array, and fuse the first relative distance and the second relative distance to obtain a fusion distance, where the fusion distance indicates that the sound source is relative to the microphone array
  • the second location information also includes the fusion distance.
  • an embodiment of the present application provides a sound source localization device, and the sound source localization device has the function of implementing the sound source localization method of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides a sound source localization device, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the first aspects above
  • the processing-related functions in the sound source localization method shown in item may be a chip.
  • an embodiment of the present application provides a sound source localization device.
  • the sound source localization device may also be called a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the instructions are executed by a processing unit, and the processing unit is configured to perform processing-related functions as in the first aspect or any of the optional embodiments of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • an embodiment of the present application provides a computer program product including instructions, which, when run on a computer, enables the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • the present application provides a terminal, the terminal includes a radar and a processor, the radar and the processor are connected, and the processor can be used to execute the first aspect or any optional implementation manner of the first aspect The method in this radar is used to collect echo data.
  • the present application provides a sound pickup device, the sound pickup device includes a radar, a microphone array and a processor, the radar can be the radar mentioned in the first aspect, and the microphone array can be the one mentioned in the first aspect. and the microphone array, the processor may be configured to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • the sound pickup device may include devices such as octopus conference equipment, Internet of things (Internet of things, IoT), or intelligent robots.
  • devices such as octopus conference equipment, Internet of things (Internet of things, IoT), or intelligent robots.
  • FIG. 1A is a schematic structural diagram of a sound source localization device provided by the present application.
  • 1B is a schematic structural diagram of a radar provided by the present application.
  • FIG. 1C is a schematic diagram of an application scenario provided by the present application.
  • 1D is a schematic diagram of another application scenario provided by the present application.
  • 1E is a schematic diagram of another application scenario provided by the present application.
  • FIG. 2 is a schematic flowchart of a sound source localization method provided by the present application.
  • Fig. 3 is a kind of angle schematic diagram provided by this application.
  • FIG. 4 is a schematic structural diagram of another sound source localization device provided by the present application.
  • Fig. 5 is another perspective schematic diagram provided by this application.
  • Fig. 6 is another perspective schematic diagram provided by this application.
  • FIG. 7A is a schematic diagram of another application scenario provided by the present application.
  • FIG. 7B is a schematic diagram of another application scenario provided by the present application.
  • FIG. 8 is a schematic flowchart of another sound source localization method provided by the present application.
  • FIG. 9 is a schematic diagram of another application scenario provided by the present application.
  • FIG. 10A is a schematic diagram of another application scenario provided by the present application.
  • 10B is a schematic diagram of another application scenario provided by the present application.
  • FIG. 11 is a schematic flowchart of another sound source localization method provided by the present application.
  • FIG. 12 is a schematic diagram of another application scenario provided by the present application.
  • FIG. 13 is a schematic flowchart of another sound source localization method provided by the present application.
  • FIG. 14 is a schematic structural diagram of another sound source localization device provided by the present application.
  • FIG. 15 is a schematic structural diagram of another sound source localization device provided by the present application.
  • FIG. 16 is a schematic structural diagram of a chip provided by the present application.
  • the sound source localization method provided in this application can be performed by a sound pickup device, and is applied to various scenarios where sound pickup is required, for example, a video call, a voice call, a multi-person conference, recording or video recording and other scenarios.
  • the sound source localization device may include various terminals that can pick up sounds, and the terminals may include smart mobile phones, TVs, tablet computers, wristbands, and head mounted display devices Display, HMD), augmented reality (AR) device, mixed reality (MR) device, cellular phone (cellular phone), smart phone (smart phone), personal digital assistant (personal digital assistant, PDA), Tablet computers, in-vehicle electronic devices, laptop computers (laptop computers), personal computers (PCs), monitoring equipment, robots, in-vehicle terminals, wearable devices or autonomous vehicles, etc.
  • HMD augmented reality
  • AR augmented reality
  • MR mixed reality
  • cellular phone cellular phone
  • smart phone smart phone
  • PDA personal digital assistant
  • Tablet computers in-vehicle electronic devices
  • laptop computers laptop computers
  • PCs personal computers
  • monitoring equipment robots
  • robots in-vehicle terminals
  • wearable devices or autonomous vehicles etc.
  • the structure of the sound source localization device may be as shown in FIG. 1A , and the sound source localization device 10 may include a radar 101 , a microphone array 102 and a processor 103 .
  • the radar 101 may include a laser radar or a millimeter-wave radar with electromagnetic waves above 24 GHz, and the like, and its antenna may be a multiple-transmitting and multiple-receiving antenna, or of course, a single antenna.
  • a millimeter-wave radar is used as an example for illustration, and the millimeter-wave radar mentioned below in the present application may also be replaced by a laser radar.
  • the radar may be a millimeter wave radar with an operating frequency of 60 GHz, such as a frequency modulated continuous wave (FMCW) or a single frequency continuous wave radar.
  • FMCW frequency modulated continuous wave
  • the microphone array 102 may include an array of multiple microphones for collecting speech signals.
  • the structure composed of the plurality of microphones may include a centralized array structure, and may also include a distributed array structure. For example, when the sound pressure of the voice emitted by the user exceeds the sound source detection threshold, the voice signal is collected through the microphone array. Each microphone can form a voice signal, and the multi-channel voice signals are fused to form the data collected in the current environment.
  • Concentrated Array Structure For example, as shown in FIG. 1B , a plurality of microphones are arranged according to a certain distance into a structure of a certain geometric shape, for example, the distance of each microphone is 10 cm, forming a circular array.
  • Distributed Array Structures For example, as shown in FIG. 1B , microphones can be placed at a number of different locations on a conference table.
  • the processor 103 may be configured to process the radar echo data or the data collected by the microphone array, so as to extract the voice data corresponding to the sound source. It can be understood that the steps of the sound source localization method provided by this application can be executed by the processor 103 .
  • the sound source localization device may include octopus conference equipment, Internet of things (Internet of things, IoT) or intelligent robots and other equipment.
  • IoT Internet of things
  • intelligent robots and other equipment.
  • the structure of the radar 101 may be as shown in FIG. 1C , the radar may specifically include a transmitter 1014, a receiver 1015, a power amplifier 1013, a power division coupler 1012, a mixer 1016, a waveform generator 1011, an analog-digital Converter (analogue-to-digital conversion, AD) 1017 and signal processor 1018 and other modules.
  • the radar may specifically include a transmitter 1014, a receiver 1015, a power amplifier 1013, a power division coupler 1012, a mixer 1016, a waveform generator 1011, an analog-digital Converter (analogue-to-digital conversion, AD) 1017 and signal processor 1018 and other modules.
  • AD analog-digital Converter
  • the waveform generator 1011 generates the required frequency modulation signal.
  • the frequency modulated signal is divided into two signals by the power division coupler 1012, one signal is amplified by the power amplifier 1013, and then the transmitter 1014 generates a transmission signal, which is radiated through the transmission antenna.
  • the other signal part acts as a local oscillator, and generates an intermediate frequency signal in the mixer 1016 with the echo signal received by the receiver 1015 through the receiving antenna. Then it is converted into a digital signal by the AD converter 1017, and the main goal of the signal processor 1018 is to extract the frequency information from the intermediate frequency signal, and obtain basic target information such as distance and speed through further processing for subsequent sound source localization.
  • the transmitter of the radar can continuously transmit the modulated signal, and the modulated signal is reflected by the radar receiver after encountering the target.
  • the radar signal carries the distance and angle of the gesture (azimuth or pitch, etc.) , Doppler information, micro-Doppler information, etc. are captured, and the data of the current gesture is formed for subsequent processing.
  • millimeter-wave radar adopts FMCW radar, which has many advantages, such as relatively simple hardware processing, easy implementation, relatively simple structure, small size, light weight and low cost, suitable for data acquisition and digital signal processing; theoretically not There is a ranging blind spot that exists in FMCW radar, and the average power of the transmitted signal is equal to the peak power, so only low-power devices are required, thereby reducing the probability of interception and interference.
  • a sound source localization device 10 can be set on the conference table, the user can use the sound source localization device 10 to conduct a video conference, and the sound source localization device 10 can be used to track the user speaking, so as to extract the The speech of the speaking user, so that the receiver can accurately distinguish who is speaking.
  • the user can control the display screen or control other smart devices through the sound source localization device 10 set on the smart screen, and the sound source localization device Accurate tracking of the sound source can be achieved, thereby accurately extracting the user's voice data.
  • beamforming based on microphone arrays can accurately pick up speech signals, and is widely used in various speech interaction scenarios. It can effectively suppress ambient noise, suppress indoor reverberation, and does not significantly damage speech. Beamforming relies on accurate estimation of the position of the sound source, especially for adaptive beamforming, which is sensitive to the sound source and its localization machine, and a position deviation of a few degrees can lead to a significant drop in pickup performance.
  • the microphone array can solve the localization of a single sound source, but it cannot effectively locate multiple sound sources that overlap in time. However, in the daily acoustic environment, the overlapping and movement of sound sources frequently occur, and the microphone array cannot effectively pick up the sound.
  • the “wake word” method is adopted to simplify the scenario to a single target source.
  • some scenarios such as intelligent conferences, when multiple participants initiate a conversation with the conference system, it is difficult to simplify it into a single-source scenario, and the system cannot pick up the voices of multiple people talking in parallel.
  • Radar can accurately locate and track multiple moving/fretting target sources.
  • This application introduces the positioning capability of radar into sound pickup, forming a strong complementary relationship with the microphone array positioning technology, improving the positioning and performance of multi-sound source scenarios. Accuracy and robustness of tracking, improving the pickup capability of the microphone array.
  • the microphone array does not participate in the detection and positioning of the sound source.
  • the radar detects a human target, the beam is turned on; one object corresponds to one beam, and the microphone array picks up the voice signal.
  • the radar is completely relied on to detect the position of the sound source, but the radar cannot determine the position of the stationary human body, and will miss the stationary sound source.
  • the sound source localization device calculates the overload.
  • the sound source of the sound is the target of interest for the device; the detection of living voice, that is, the distinction between the speaker voice and the voice directly generated by the vocal organ is of great significance for voice interaction.
  • the current living voice detection technology relies on single-modal voice, It can only meet the detection requirements at short distances (such as within 1 meter), and it is difficult to identify long-distance voice sources in noisy environments. It is easy to take the voice produced by the speaker as the voice of a living body, causing misjudgment.
  • the present application provides a sound source localization method, which combines a radar and a microphone array to accurately locate the sound source, and then performs precise beam control for the sound source, extracts the voice data corresponding to the sound source, and further extracts the sound source.
  • a sound source localization method which combines a radar and a microphone array to accurately locate the sound source, and then performs precise beam control for the sound source, extracts the voice data corresponding to the sound source, and further extracts the sound source.
  • FIG. 2 a schematic flowchart of a sound source localization method provided by the present application is as follows.
  • the first position information may include information such as the distance, angle or speed of the object relative to the radar.
  • the radar can transmit a modulated wave to the radiation range, and the modulated wave is reflected by the object and then received by the radar to form an echo signal, thereby obtaining echo data.
  • the echo data includes information generated when the detected one or more objects move within the detection range of the radar, such as information about the trajectory generated when the user moves within the radiation range.
  • the echo data may include, when the sound source is within the radiation range of the radar, the speed relative to the radar, the distance relative to the radar, the angle, the amplitude of the movement of the sound source, the period of the movement of the sound source, and the echo of the radar.
  • the angle may include a pitch angle or an azimuth angle.
  • the radar positioning information can include the distance or angle of the object relative to the radar, and the distance information is contained in the frequency of each echo pulse.
  • the distance information of the object in the current pulse time can be obtained by performing fast Fourier transform on a single pulse. By integrating the distance information of each pulse, the overall distance change information of the object can be obtained.
  • the angle can include the azimuth angle and the elevation angle, and the angle is obtained by measuring the phase difference of each received echo based on the radar's multi-receiving antenna. There may be a certain angle between the echo signal and the receiving antenna due to the position of the reflecting object. This angle can be calculated by calculation, so that the specific position of the reflecting object can be known, and then the position change of the object can be known. There are many ways to calculate the angle, such as establishing a coordinate system with the radar as the center, and calculating the position of the object in the coordinate system based on the echo data, so as to obtain the pitch angle or azimuth angle.
  • the speed of the sound source moving within the radiation range, the distance relative to the radar, the movement amplitude or the relative distance to the radar can be obtained based on the echo signal received by the radar for a period of time. angle, etc.
  • a three-dimensional coordinate system may be established, where (x, y) corresponds to the H-plane plane, and (y, z) corresponds to the E-plane plane.
  • the position of the radar is taken as the origin, and the x-axis is taken as the polar axis.
  • the coordinates of the object can be expressed as (r 1 , ⁇ ), and ⁇ represents the azimuth angle.
  • the position of the radar can be taken as the origin, the z-axis is the polar axis, the coordinates of the object can be expressed as (r 2 , ⁇ ), and ⁇ represents the pitch angle.
  • the radar detects that the object is moving and the object does not emit sound, that is, the sound source is not detected. Incident angle, adjust the sound source detection threshold of the microphone array for the object. For example, reduce the sound source detection threshold to improve the sensitivity of the microphone array to collect voice signals.
  • the microphone array is used to collect signals with sound pressure higher than the sound source detection threshold.
  • the microphone array can usually collect speech signals whose sound pressure exceeds a certain threshold.
  • the sound pressure of speech exceeds the threshold, which is collectively referred to as the sound source detection threshold below, and the speech signals that do not exceed the threshold are usually discarded.
  • the sound pickup sensitivity for the moving object can be improved by controlling the sound source detection threshold of the microphone array.
  • the location area with moving objects detected by the radar can be used as the candidate location.
  • the sound source detection threshold is set lower, so that the microphone array can accurately detect the moving object. pickup.
  • the sound source detection threshold is set to ⁇ 1.
  • the source detection threshold is set to ⁇ 2, and ⁇ 2 ⁇ 1, so as to improve the sensitivity of the microphone array when picking up sound in this direction and reduce the missed detection of the sound source.
  • the radar can indicate the candidate position of the sound source for the microphone array, thereby reducing the sound source detection threshold, thereby improving the detection sensitivity of the microphone array to the candidate area, preventing the missed detection of the sound source, and also improving the sound source detection threshold.
  • the detection threshold outside the candidate area reduces the detection sensitivity and prevents false detection of sound sources.
  • a plurality of microphones are included in the microphone array for converting sound wave signals into digital signals.
  • the signal obtained by the microphone array can be used for sound source detection and sound source localization, and the incident angle of the sound source relative to the microphone array can be obtained.
  • the incident angle detected by the following microphone array and the angle detected by the radar may generally be angles in the same coordinate system.
  • the microphone array 401 may be an array formed by arranging a plurality of microphones, and the center point of the radar 402 may coincide with the center point of the microphone array.
  • the angle detected by the microphone and the angle detected by the radar can be aligned so that the angle detected by the microphone and the angle detected by the radar are in the same coordinate system. in the same coordinate system.
  • the microphone array is a distributed array
  • one of the microphones in the microphone array can be used as a reference microphone, and after obtaining the incident angle of each microphone, the incident angle of each microphone is aligned and fused, and each microphone is obtained.
  • the angle of incidence is converted to the angle of the reference microphone.
  • the angle of incidence on the reference microphone and the angle of incidence detected by the radar are then aligned so that they are in the same coordinate system.
  • the incident direction may be represented by an azimuth angle or an elevation angle.
  • ⁇ and ⁇ shown in represent the azimuth angle, and ⁇ represents the elevation angle.
  • the continuous speech signal received by each microphone is cut into multiple frames according to the preset duration, and there may be overlap between adjacent frames.
  • the speech signal received by the microphone can be cut into 32ms frames, and the overlapping length between the preceding and following frames is 50%, so as to maintain the continuity of the frame signals.
  • Fourier transform is performed on the frame signal, and the complex coefficients of the Fourier transform are output, and then it is judged which directions there are sound sources.
  • hypothesis testing can be used to determine which directions there are sound sources.
  • the hypothesis test can be implemented by grid search, and all possible incident directions are evenly divided into multiple discrete directions.
  • the azimuth angle [0°, 360°] interval is divided into 360 directions at 1 degree intervals.
  • the elevation interval [0, 90] can be divided into 30 intervals according to 3 degrees, and 30*360 is used to represent all directions in the space.
  • the difference between the propagation distances of the sound waves reaching each microphone is judged according to the assumed direction.
  • the distance of each microphone is known, and any microphone in any microphone array is selected as the reference microphone, and then the remaining microphones are selected one by one. as the object of investigation.
  • the obtained grid point is close to or coincident with the real incident direction, after the time difference is eliminated, the similarity of the signals between the various microphones has the highest confidence.
  • the coherence coefficient can be introduced to measure the similarity between the signals. After obtaining the coherence coefficients between all the investigation microphones and the reference microphones, the coherence coefficients between all the investigation microphones and the reference microphones are aggregated, and the overall signal similarity of the array is obtained. Spend.
  • the overall similarity measure between the investigation microphone and the reference microphone can be expressed as: w m represents the weight occupied by the mth microphone signal, s m represents the signal received by the microphone, and M represents the total number of microphones.
  • the poles with the largest coherence coefficients can be selected as the candidate directions of the sound source incident.
  • the candidate direction can be used as the incident direction of the sound source.
  • the candidate direction of the sound source can be selected by the coherence coefficient used to measure the similarity between the signals, and the direction more matching the sound source can be selected based on the extreme value, which can be determined more accurately The position of the sound source.
  • multiple angles may be detected by the microphone array, and the multiple angles correspond to one sound source.
  • the multiple angles need to be screened to filter out invalid angles.
  • the angle before screening is referred to as the candidate angle
  • the angle after screening is referred to as the incident angle.
  • the specific method of screening the incident angle may include: if a moving object is detected through radar echo data, the obtained position information includes the first angle of the object relative to the radar. In this case, a plurality of incident angles may be compared with the first angle respectively, and an angle with the smallest difference from the first angle or the difference within the first preset range is selected as the incident angle.
  • the multiple angles and the incidence angle can be compared, and the difference with the incidence angle can be minimized
  • the angle is taken as the first angle, and the first angle and the incident angle are weighted and fused to obtain the angle of the sound source.
  • the closest angle between the multiple incidence angles and the object’s angle can be used as angle of incidence.
  • multiple angles ie, multiple third angles
  • one of the multiple third angles may be selected as the incident angle based on the moving speed of the object.
  • a third angle whose difference from the first angle is within the second preset range is selected from a plurality of third angles as the new incident angle ; If the moving speed of the object is not greater than the first preset value, screen out a plurality of third angles, and the third angle whose difference from the first angle is within the third preset range is used as the new incident angle , the third preset range covers and is larger than the second preset range.
  • the second preset range may be a range greater than a first threshold
  • the second range may be a range greater than a second threshold
  • the second threshold is less than the first threshold
  • the third preset range covers and is greater than the second preset range .
  • the user's voice position may change slowly during this process, and the voice signals generated by the user at multiple positions may be detected through the microphone array, corresponding to multiple different incident angles.
  • the distance between the object sending the speech signal and the microphone array can also be obtained. For example, if the distance between each microphone in the microphone array is known, you can calculate the distance between the object and each microphone according to the moment when the speech signal reaches each microphone and combine the distance between each microphone, and then compare the object and the reference microphone with the distance between each microphone. The distance between the object and the microphone array is taken as the distance.
  • Step 201 may be executed first, or step 202 may be executed first, or steps 201 and 202 may be executed simultaneously, which may be adjusted according to actual application scenarios.
  • step 203 Determine whether the first position information includes the first angle, and if so, execute step 205, and if not, execute step 204.
  • the first position information After obtaining the first position information, it can be determined whether the first position information includes the angle of the sound source relative to the radar. If the first position information includes the angle, the second position information of the sound source may be obtained by fusion based on the first position information and the incident angle, that is, step 205 is performed. If the first position information does not include the angle of the sound source relative to the radar, the incident angle detected in the microphone array can be directly used as the angle of the sound source relative to the microphone array or the radar, that is, step 204 is performed.
  • the position of the object that has relative motion with the radar can be detected by the radar.
  • the angle of the object cannot be detected by the radar echo data, it means that there is no relative motion between the sound source and the radar. At this time, the radar echo cannot be detected.
  • the wave data determines the sounding object, and the location of the sound source can be determined with reference only to the angle of incidence.
  • this step is aimed at the case where the radar is a millimeter-wave radar.
  • the millimeter-wave radar determines the spatial position of the human body according to the Doppler effect generated by the human body movement, which not only expresses the direction of the human body relative to the radar, but also expresses the distance between the human body in millimeters. range, but cannot detect stationary targets. Therefore, when using millimeter-wave radar for positioning, if there is no moving object within the radiation range, the position information such as the angle and speed of the stationary object may not be obtained through the echo data.
  • step 203 when the radar mentioned in this application is replaced by a laser radar, the object within the radiation range can be directly positioned without judging whether the position information obtained by positioning includes an angle, that is, step 203 does not need to be performed. Therefore, whether to execute step 203 may be judged in combination with an actual application scenario, and the execution of step 203 is only used as an example for illustrative description here, and is not intended to be a limitation.
  • the sound source localization device realizes the sound source localization through two modal localizations, that is, the sound source localization is realized by radar localization and microphone array localization.
  • the dynamic object is positioned, and the microphone array is used to locate the incident direction of the speech signal. Due to the different positioning principles of the two modes, the advantages and disadvantages are significantly different, and there is a strong complementary relationship. Accurate sound source orientation estimation.
  • the incident angle can be directly used as the angle of the sound source relative to the microphone array or radar, thus Get the position of the sound source.
  • the incident angle of the sound source can be collected through the microphone array, but the moving object cannot be determined from the radar echo data, that is, the radar echo data can only be used to determine the moving object.
  • the distance between the object within the radiation range and the radar is detected, but it is not possible to determine which position the object is emitting.
  • the incident angle of the microphone array can be directly used as the angle of the sound source relative to the microphone array or radar, so as to determine the starting point. The location of the sound source.
  • the distance of the object relative to the microphone array is also detected through the data collected by the microphone array, the distance can be directly used as the distance of the sound source relative to the microphone array for subsequent beam separation.
  • the first angle of the sound source relative to the radar is detected through the radar echo data, the first angle and the incident angle can be weighted and fused to obtain the fusion angle, thereby obtaining the position information of the sound source, that is, the second position information.
  • the second position information includes the fusion angle.
  • the first weight corresponding to the first angle and the second weight corresponding to the incident angle may be determined respectively, wherein the first weight is positively correlated with the moving speed of the object relative to the radar, and the second weight is related to the moving speed of the object relative to the radar.
  • a negative correlation is present; the angle and the incident angle are weighted and fused according to the first weight and the second weight to obtain the fusion angle, and the second position information includes the fusion angle.
  • the weight of the first angle can be increased, and the weight of the incident angle can be reduced.
  • the weight of the first angle can be reduced.
  • the first angle can be expressed as ⁇ r
  • the corresponding weight value is c 1
  • the incident angle can be expressed as ⁇ m
  • the corresponding weight value can be expressed as c 2
  • the second position information may also include the distance of the sound source relative to the radar or the microphone array, the movement speed of the sound source, the acceleration of the sound source, and the like. If the second position information further includes information such as distance, movement speed of the sound source, acceleration of the sound source, etc., the distance, movement speed of the sound source, acceleration of the sound source and other information may be obtained through radar echo data.
  • the distance of the object relative to the microphone array can also be detected through the data collected by the microphone array, and the distance collected by the microphone and the distance collected by the radar can be fused to obtain the distance of the sound source relative to the microphone array, so as to facilitate the Subsequent beam splitting.
  • the first relative distance and the second relative distance are fused to obtain a fusion distance, where the fusion distance represents the distance of the sound source relative to the microphone array, the second position information also includes the fusion distance, the first relative distance The distance is the distance of the object relative to the radar, and the second relative distance is the distance of the object relative to the microphone array.
  • the user is moving while making sound.
  • the position information of the moving user can be detected through the radar echo data.
  • the position information includes the angle of the object relative to the radar, such as the azimuth angle or the pitch angle, etc. Then the angle of the object relative to the radar and the incident angle can be weighted and fused to obtain the angle of the sound source relative to the radar or the microphone array. Fusion to get the position of the sound source relative to the microphone.
  • the positioning of the microphone array can only obtain the incident angle of the sound source emitted by the sound source, and the accuracy of the sound source positioning is lower than that of the radar positioning, so the radar has a better effect for tracking the sound source, but the radar has no effect on stationary targets. It is sensitive and will ignore the static sound source. At this time, it is necessary to rely on the microphone for sound source detection and sound source positioning to determine the incidence angle of the sounding target. By fusing the microphone array and radar positioning information, the position of the sound source can be obtained more accurately, and the continuous tracking and pickup of the target can be realized. Especially when the object is moving at a low speed or is still, a higher weight can be given to the incident angle, so that the fusion angle is more accurate.
  • the sound source moves before sounding; the sound source moves after sounding; and the sound source moves at the same time.
  • the user moves from the position S 1 to the position S r , which can be that the radar first captures the position information of the moving object, and the initial position of the sound source is set as the radar detection
  • the position of the arrived object denoted as radar source S r ( ⁇ r ).
  • the position detected by the radar can be used as the initial position of the sound source, and the object can be continuously tracked to track the position change of the object, so that the follow-up can be quickly and accurately to detect whether or not the object is vocalizing.
  • the detection thresholds for the remaining sound sources where the object is located can also be lowered, thereby increasing the attention to the speech uttered by the object.
  • the microphone array first obtains the incident angle of the sound emitted by the sound source, and the initial setting of the sound source is the position of the object detected by the microphone array.
  • sound source CS m ⁇ m
  • an object can make a sound first, but does not move within the radiation range of the radar, then through the radar echo data, it is impossible to determine which object is making a sound.
  • the position detected by the microphone array is used as the initial position of the sound source. In order to obtain a more accurate position of the object through radar echo data when the object moves subsequently.
  • the radar source S r ( ⁇ r ) can be used as the initial position of the sound source, and the sound source CS m ( ⁇ m ) can also be used as the initial position of the sound source. application scenarios to determine.
  • the angle obtained from the radar echo data can be given a higher weight if the object is in motion, and the angle of incidence detected by the microphone array can be given more weight if the object is stationary. high weight. It can be understood that if the object is in a moving state, the position of the object can be detected more accurately by the radar. At this time, the weight of the first angle is increased to make the final angle of the sound source more accurate. When the object is in a stationary state, the sounding object may not be accurately identified through the radar echo data. At this time, the weight of the incident angle can be increased, so that the final angle of the sound source is more accurate.
  • the specific position of the sound source can be known, so that the voice data of the sound source can be extracted from the voice signal collected by the microphone array according to the second position.
  • the beam in the direction can be turned on, thereby extracting the voice data of the sound source.
  • the voice data of the sound source can be output through a beam splitting network.
  • the data collected by the microphone array can be used as the input of the beam separation network, and the voice data of the sound source and the background data can be output, and the background data is other data in the input data except the voice data of the sound source.
  • the position of the object detected by the radar and the incident angle detected by the microphone array can be combined to obtain the position of the sound source relative to the microphone array, so that the position of the voice used to separate the sound source can be controlled by the position.
  • the beam is turned on, so as to accurately extract the voice data of the sound source from the data collected by the microphone array.
  • the position of the sound source can be accurately determined, and the voice data of the sound source can be extracted more accurately.
  • the sound source localization method provided by the present application is described in detail above, and the sound source localization method provided by the present application is described in more detail below with reference to more specific application scenarios.
  • FIG. 8 a schematic flowchart of another sound source localization method provided by the present application is as follows.
  • step 803. Determine whether the first position information includes the first angle, if not, execute step 804, and if so, execute step 805.
  • steps 801-805 reference may be made to the relevant descriptions in the foregoing steps 201-205. Similarities are not repeated here, and this embodiment only introduces steps that are different or more detailed application scenarios.
  • the source location information includes the azimuth angle as an example, and some scenarios are introduced.
  • the azimuth angle mentioned below can also be replaced by the elevation angle in different scenarios.
  • the position of the object can only be tracked by radar. If there are multiple sound sources in this scene, in order to reduce the load of the device, the object can be ignored at this time. For example, if it is determined from radar echo data that there are multiple objects moving and making sounds within the radiation range, and the number of sounding objects exceeds the preset number, the objects that continue to move but do not sound can be ignored at this time, thereby reducing the load on the equipment and reducing the power consumption.
  • the sounding direction detected by the radar is close to the sounding direction detected by the microphone array, and the angle detected by the microphone can be used as the incident angle of the sound source.
  • the angle can be regarded as the incident angle matching the sounding object, and if the incident angle is detected within the range of S r ( ⁇ r ) ⁇ thd0
  • the angle closest to the azimuth angle can be selected as the incident angle of the sound source.
  • the direction of the object relative to the sound source localization device is the direction a, and two candidate angles b and c are detected, wherein the angle difference between a and b is ⁇ 1 , and the angle between c and a If the difference is ⁇ 2 , and ⁇ 2 > ⁇ 1 , the angle corresponding to the candidate sound source b can be used as the incident angle, the candidate sound source c can be discarded, or the angle corresponding to the sound source c can be used as the incident angle of the new sound source, etc.
  • the object may sound first, so the incident angle of the speech signal is first detected by the microphone array.
  • the following situations may occur next.
  • the position of the sounding object cannot be detected by the radar, and the candidate sound source CS m ( ⁇ m ) detected by the microphone array can be directly used as the actual sound source S m ( ⁇ ).
  • the direction, angle or distance of the object movement can be detected through the radar echo data, and the radar source S r ( ⁇ r ) can be obtained, and then the radar source S r ( ⁇ r ) and the sound source S m ( ⁇ m ) can be correlated ) to obtain the actual sound source.
  • the first angle obtained from the wave data may include an azimuth angle or an elevation angle
  • ⁇ m is an incident angle of the speech signal relative to the microphone array, which may include an azimuth angle or an elevation angle.
  • the incident angle collected by the microphone array and the data collected by the radar can be combined to accurately locate the sound source in various scenarios, and the generalization ability is strong. Accuracy of speech data.
  • the sound source may be in motion. Due to the position change of the sound source, the microphone array may detect multiple signal incident angles. The angle of the sound source matching is used as the incident angle, or the incident angle of the new sound source is filtered out.
  • the incident angle which may include: if the difference between the multiple incident angles and the azimuth angle is in the range of S r ( ⁇ r ) ⁇ thd0 Within, the incident angle closest to the azimuth angle can be selected as the incident angle of the sound source.
  • the method of screening out the incident angle of the new sound source may include: screening the plurality of incident angles based on the movement speed of the object, and selecting the incident angle of the new sound source. For example, due to the sound of the object in the process of motion, multiple candidate positions may be obtained through the microphone array, such as: (CS m1 ( ⁇ m1 ), CS m2 ( ⁇ m2 ),...,CS mk ( ⁇ mk )), and are not within the range of S r ( ⁇ r ) ⁇ thd0 , then according to the azimuth angle of the radar source S r ( ⁇ r ), the candidate angles are screened, and new incident angles are screened out.
  • Ways to screen candidate corners can include:
  • candidate angles outside the range of ⁇ thd1 of the radar source S r ( ⁇ r ) are screened out as new incident angles.
  • the object is in motion from time t1 to time tn, and the speed is v1.
  • a candidate angle outside the range of ⁇ thd1 from the radar source S r ( ⁇ ) can be selected as a new incident angle, discard candidate angles within ⁇ thd2 of the radar source S r ( ⁇ r ).
  • candidate angles outside the range of the radar source S r ( ⁇ r ) ⁇ thd2 are screened out as new incident angles.
  • ⁇ relative to the radar source S r ( ⁇ r ) can be selected.
  • the candidate angles outside the range of thd2 are used as new incident angles, ⁇ thd2 > ⁇ thd1 , and the candidate angles within the range of radar source S r ( ⁇ r ) ⁇ thd2 are discarded.
  • the sound source localization device includes a radar 1101 and a microphone array 1102 (ie, a microphone array).
  • the echo data received by the radar 1101 is located to the position S r ( ⁇ r ) 1103 of the object, or the radar source, where ⁇ r is the azimuth or pitch angle of the object relative to the radar.
  • the candidate sound source CS m ( ⁇ m ) 1104 is located through the microphone array 1102 , or referred to as a sound source.
  • Step 1105 is then executed to determine whether the difference between ⁇ r and ⁇ m is less than ⁇ thd0 . That is, it is judged whether ⁇ r and ⁇ m are close to each other.
  • step 1107 can be executed to determine whether the moving speed of the object is greater than the preset speed.
  • the trend of the position of the object changing with time can be obtained according to the radar echo data, and the movement speed of the object can be estimated.
  • the trajectory position of the object in the T time period is ([x 1 ,y 1 ],[ x 2 ,y 2 ],...,[x t ,y t ]), then Then judge whether v is greater than v thd .
  • step 1108 is executed, that is, it is determined whether the difference between ⁇ r and ⁇ m is less than ⁇ thd1 , and ⁇ thd1 > ⁇ thd0 . If the difference between ⁇ r and ⁇ m is less than ⁇ thd1 , shield CS m ( ⁇ m ) (ie, step 1110 ). If the difference between ⁇ r and ⁇ m is not less than ⁇ thd1 , then combine CS m ( ⁇ m ) and S r ( ⁇ r ) to obtain a new sound source (ie, step 1111).
  • step 1108 is executed, that is, it is determined whether the difference between ⁇ r and ⁇ m is less than ⁇ thd1 , where ⁇ thd2 ⁇ thd1 . If the difference between ⁇ r and ⁇ m is less than ⁇ thd2 , then shield CS m ( ⁇ m ) (ie, step 1110 ). If the difference between ⁇ r and ⁇ m is not less than ⁇ thd2 , then combine CS m ( ⁇ m ) and S r ( ⁇ r ) to obtain a new sound source (ie, step 1111).
  • an incident angle matching the sound source or a new incident angle can be determined according to the moving speed of the sound source, so as to adapt to different moving states of the sound source, and the generalization ability powerful.
  • the beam separation network can be updated, so that the microphone array can be collected into the data as the input of the updated beam separation network, and the voice data of the sound source can be separated.
  • the beam separation network may include a speech separation model and a de-reverberation model.
  • the speech separation model is used to extract the speech data of the sound source
  • the de-reverberation model is used to de-reverberate the input data, thereby filtering part of the background data.
  • the beam separation network can also be updated, so that the beam separation model can be adapted to different scenarios, and the voice data matching the sound source can be separated. Next, update the beam separation network The specific steps are exemplified.
  • the speech separation model is usually used to separate the speech data and environmental noise of the sound source.
  • the moving speed may be the moving speed of the sound source relative to the radar or the microphone array, and may be obtained through radar echo data.
  • the moving speed may be set to 0 by default.
  • the separation of speech and environmental noise depends on the speech separation model, and the way of separating speech depends on the incident direction of the speech or the position of the sound source, etc., especially in the case of the sound source moving, the parameters in the model need to come from the first direction. Adaptively changing position, thereby outputting speech data that matches the position of the sound source.
  • the parameter set of the speech separation model can be updated according to the movement speed of the sound source, the movement speed is positively correlated with the parameter change rate of the speech separation model, and the parameter change rate of the speech separation model is related to the parameter set, thereby obtaining an update Post-speech separation model.
  • slow parameter change can improve the stability of the model and reduce model jitter; fast change is conducive to quickly adapting to changes in the environment, so the rate of change of model parameters can be selected according to the target movement speed, thereby affecting the parameter set of the speech separation model. , to get the updated speech separation model.
  • the first-order regression can be used to describe the correlation of parameters in time.
  • K t K t ⁇ t-1 +(1-K t ) ⁇ F(x t )
  • K t is the forgetting factor
  • K t affects the model update speed, and is close to 1 but less than 1)
  • K t is usually determined by the movement speed of the sound source, namely When the current speed is large, the forgetting factor is small, and the model update is accelerated. On the contrary, when the current speed is small, the forgetting factor is large and the model update is slow.
  • the forgetting factor and the speed can be divided into a plurality of corresponding gears in advance, and after determining the range of the gears where the speed is located, the value of the forgetting factor can be determined, thereby updating the speech separation model from the dimension of speed.
  • the lower the speed the closer the forgetting factor is to 1, and the model updates slowly, increasing the stability of the model.
  • the faster the speed, the smaller the forgetting factor, and the faster the update speed of the model which can adapt to the scene where the sound source moves rapidly, so as to separate the voice data of the sound source from the data collected by the microphone array.
  • the speech separation model in this embodiment of the present application can be used to separate the speech from the sound source and the ambient noise in the speech data collected by the microphone array, and the speech separation model may include a beam separation method through generalized sidelobe cancellation or a multi-channel Wiener filtering method A model for speech separation.
  • the weight coefficient vector can be expressed as:
  • the weight coefficient vector can be understood as a speech separation model.
  • the covariance matrix of the signal received by the microphone can be obtained by the following continuous recursive method:
  • K t is the forgetting factor, which determines how fast the parameters are updated with time
  • r f,t is the steering vector of the sound source
  • the parameter set ⁇ t ⁇ R f, t
  • f 1, 2 .
  • the lower the speed the closer the forgetting factor K t is to 1, and the model is updated slowly, increasing the stability of the model.
  • the speech separation model can be adaptively updated in combination with the motion speed of the sound source, so that the speech separation model matches the motion of the sound source, and the output accuracy of the speech separation model is improved.
  • the de-reverberation model can be used to contact the reverberation in the speech signal, and the speech data of the sound source can be accurately output from the data collected by the microphone array in combination with the speech separation model.
  • the distance between the sound source and the microphone array significantly affects the reverberation of the signal received by the microphone.
  • the voice signal from the sound source propagates farther and the attenuation is large, while the indoor reverberation remains unchanged, the reverberation interferes with the voice signal more, and the reverberation duration is longer;
  • the speech signal from the sound source has a short propagation distance, the attenuation is small, and the influence of the reverberation is weakened. Therefore, the parameters of the de-reverberation model can be adjusted based on the distance of the sound source and the microphone array.
  • the de-reverberation can be stopped to improve the quality of the resulting speech data.
  • the delay parameter and the predicted order of the de-reverberation model can be updated according to the distance between the sound source and the microphone array or the radar, so as to obtain the updated de-reverberation model.
  • the delay parameter represents the length of time that the reverberation signal lags behind the speech data of the sound source
  • the prediction order represents the duration of the reverberation. Both the delay parameter and the prediction order are positively correlated with the distance. Therefore, after the distance is determined, that is, The values of the delay parameter and the prediction order can be determined based on this distance, resulting in a new de-reverberation model.
  • the de-reverberation model may specifically include a model of speech de-reverberation algorithm based on blind system identification and equalization, a model of speech de-reverberation algorithm based on source model, or a model of speech de-reverberation algorithm based on room reverberation model and spectral enhancement. Wait.
  • the de-reverberation model in this embodiment may adopt a multi-channel linear prediction model, such as:
  • y t,f,m is the observable signal of the mth microphone on the f frequency components at time t, is the linear prediction coefficient across multiple channels and for the mth channel; here ⁇ represents the time that the late reverberation lags behind the direct signal; K represents the order of the linear prediction model, and also represents the duration of the late reverberation, and the linear prediction coefficient g It can be obtained by autoregressive modeling.
  • the choice of the order K of the model is very important. Too large K value will lead to excessive de-reverberation, and too small K value will lead to insufficient de-reverberation.
  • the prediction order K is determined according to the position of the sound source, and the delay parameter and the prediction order have a positive correlation with the distance, so after the distance is obtained, the delay parameter and the prediction order can be determined, so as to obtain the unmixing matching the sound source. sound model.
  • the value of K is determined by the distance from the object to the microphone.
  • the reverberation is relatively strong relative to the direct signal, so it is necessary to select a large K value for sufficient de-reverberation; when the distance is short, a small K value can be used for mild de-reverberation.
  • d represents the distance between the sound source and the microphone array, and the values of ⁇ 0 , ⁇ 1 , and ⁇ 2 can be adjusted according to actual application scenarios, which are not limited here.
  • the de-reverberation model can be updated based on the distance between the sound source and the radar or the microphone array, so that the de-reverberation model is adapted to the environment where the sound source is currently located, so as to combine the speech separation model, Output the voice signal of the sound source more accurately.
  • the beam separation network includes the speech separation model and the de-reverberation model
  • the data collected by the microphone array can be used as the input of the beam separation network, and the voice data and background of the sound source can be output. data.
  • the background data is the data other than the data of the sound source in the data collected by the microphone array.
  • data can be collected through a microphone array, and the user's voice data and background data generated in the user's environment can be separated from the data through a beam separation network.
  • the speech separation model is updated from the dimension of speed, and the reverberation model is updated from the dimension of distance.
  • the parameters of the beam separation network can be adjusted to adapt to the sound source. state, so as to separate the speech data that is more suitable for the sound source.
  • step 809 Determine whether the voice data meets the preset condition, if yes, continue to step 801, and if not, execute step 810.
  • the voice data of the sound source After the voice data of the sound source is separated from the data collected by the microphone array based on the beam separation network, it can also be determined whether the voice data meets the preset condition. If the voice data does not meet the preset conditions, the beam processing the sound source can be closed, that is, step 810 is executed. If the voice data meets the preset conditions, the voice data of the sound source can be tracked continuously, that is Continue with steps 801-809.
  • the preset condition can be adjusted according to the actual scene.
  • the preset condition can include that the voice data picked up by the beam is smaller than the preset value, or the picked-up voice data is a signal of a non-voice category, or the picked-up voice data is generated by the device voice, or specified by the user to block a specific direction or a specific type of sound source, etc.
  • the preset condition may include that the sound pressure is less than 43dB, or the picked-up voice data is ambient sound or noise, etc., or the picked-up voice data is the voice produced by speakers of a TV, audio, PC, etc., or the user Specify to block a certain direction or a certain type of sound source, such as blocking the sound of a dog, blocking the voice of a child, or comparing the sound of the user opposite.
  • one sound source corresponds to one beam separation model. If there are multiple sound sources, multiple beam separation models can be updated based on the information of each sound source, and the user extracts the speech data of each sound source.
  • the beam separation model can be understood as using a beam to extract data in a certain direction from the data collected by the microphone array, so as to directionally collect speech from a sound source in a certain direction from the microphone array.
  • the type of the sound source can also be detected through the voice data of the sound source, and the type of the sound source is displayed on the display interface.
  • features can be extracted from the speech data through a feature extraction network to obtain the acoustic features of the sound source, and then the first probability that the sound source is a living body can be identified according to the acoustic features; the first probability that the sound source is a living body can also be determined according to radar echo data. Then, the first probability and the second probability are fused to obtain a fusion result of whether the sound source is a living body.
  • the specific fusion method may include weighted summation, multiplication, or logarithmic summation.
  • the probability value after fusion is greater than the preset probability value, it can be determined that the sound source is a living body. For example, if the probability value after fusion is greater than 80%, it can be determined that the sound source is a living body. For example, as shown in FIG. 12, during a multi-person conference, after it is recognized that the object currently emitting the voice is a speaker, if the voice of the object is not shielded, the type of the current emitting object can be displayed as a speaker in the display interface, thereby Improve user experience.
  • the microphone array obtains multiple incident directions through sound source localization, uses a beam separation model to enhance each sound signal, and uses a voice activity detector to exclude non-voice sources and retain voice source signals.
  • Set the above voice source direction as ( ⁇ 1 , ⁇ 2 ,..., ⁇ n ), for each channel of enhanced speech signal, extract the acoustic features, send them to the live speech detector (such as a trained neural network), and output each channel of sound
  • the signal is the posterior probability (p a ( ⁇ 1 ), p a ( ⁇ 2 ), . . . , p a ( ⁇ n )) of the live speech.
  • the radar tracks the trajectory information of the living body in the above-mentioned multiple directions.
  • the voice in the ⁇ direction tends to be regarded as the living voice
  • the prior probability p r of the living voice in this direction is set. ( ⁇ )>0.5; otherwise, set the prior probability to a value less than 0.5.
  • the feature extraction network and feature recognition network can choose deep convolutional neural networks (DCNN), recurrent neural networks (RNNS) and so on.
  • the neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.
  • DNN deep neural network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • residual network other neural networks etc.
  • the acoustic feature of the sound source is extracted from the speech data through the feature extraction network, and then the probability that the sound source corresponding to the acoustic feature is a living body is output through the identification network, and the first probability value is obtained, for example, the probability of the sound source being a living body is 80%.
  • the probability that the sound source is a living body is determined by the radar echo data is 0, even if the probability of identifying the sound source as a living body through the recognition network is very high, such as higher than 95%, the sound source is determined at this time.
  • the source is non-living.
  • the sound source is not identified as having the characteristics of a living body through the voice data, it may be that the living body does not emit sound at this time, and it can be determined that the sound source is not live. It can be understood that when the value of the first probability is lower than the third threshold, when performing weighted fusion, the weight value set for the first probability is much higher than the weight value set for the second probability, so that the fusion result is more inclined The result represented by the first probability value.
  • the weight value set for the second probability is much higher than the weight value set for the first probability, so that the fusion result is more inclined to The result represented by the second probability value. Therefore, in the embodiment of the present application, the acoustic features and the motion detected by the radar can be effectively obtained to determine whether the sound source is a living body, and a more accurate result can be obtained.
  • the aforementioned living body can be replaced by a human body, and features can be extracted from the voice data through a feature extraction network, and then the voice features can be identified to identify the first probability that the sound source issuing the voice is a human body, and whether the sound source is moving according to radar detection. , obtain the second probability that the sound source is a human body, and then perform weighted fusion of the first probability and the second probability to obtain a probability value of whether the sound source is a human body, so as to determine whether the sound source is a human body according to the probability value. Therefore, it can be combined with radar and sound source features to identify whether the object is a human body, and a very accurate identification result can be obtained.
  • the present application accurately identifies whether the current sound source is a living body by combining radar and acoustic features.
  • the traditional radar motion detection mode is prone to misjudgment; while the acoustic feature can distinguish the two, and the loudspeaker is in a long-term static state, the long-term motion feature can also be used. It is excluded as living speech.
  • a beam can be understood as an algorithm or vector for extracting speech data in a certain direction from the microphone array. Closing a beam means not extracting speech data in that direction through the beam, such as closing the aforementioned beam separation network.
  • the beam for the sound source is turned off.
  • the beam for the speaker is turned off.
  • a beam in a certain direction may be designated by the user to be turned off.
  • the position of the sound source can be accurately determined by combining the microphone array and the radar. Regardless of whether the sounding object is still or moving, the specific position of the sound source can be detected, the tracking of the sound source can be realized, it can adapt to more scenes, and the generalization ability is strong. In addition, beam management can also be performed by identifying the type of sound source, thereby avoiding picking up invalid speech, improving work efficiency and reducing load.
  • the radar positioning information is obtained through the radar 1301, and the incident angle of the voice signal incident on the microphone array is obtained through the microphone array 1302, that is, the microphone array.
  • the radar positioning information may include the movement of the object within the radiation range of the radar over a period of time, such as the object's movement trajectory, acceleration, relative speed to the radar, or relative distance to the radar, and other information within the radiation range.
  • a radar can emit modulated waves within the radiation range, and the modulated waves are reflected by an object and then received by the radar to form an echo signal.
  • the echo data includes information generated when the detected one or more objects move within the detection range of the radar, such as information about the change trajectory generated when the user's hand moves within the radiation range.
  • the specific structure of the radar can refer to the aforementioned FIG. 1D , which will not be repeated here.
  • millimeter-wave radars can be used as radars, such as radars with operating frequencies in the 60GHz and 77GHz frequency bands, available bandwidths greater than 4GHz, and range resolutions up to centimeters.
  • the millimeter-wave radar can have a multi-receive and multi-transmit antenna array, and can realize the estimation of the horizontal azimuth and vertical azimuth of the moving object.
  • the radar positioning information can include the distance or angle of the object relative to the radar.
  • the distance information is contained in the frequency of each echo pulse.
  • the distance information of the object in the current pulse time can be obtained by performing fast Fourier transform on a single pulse at a fast time. By integrating the distance information of each pulse, the overall distance change information of the object can be obtained.
  • the angle can include the azimuth angle and the elevation angle, and the acquisition of the angle is realized by measuring the phase difference of each received echo based on the multi-receiving antenna of the radar.
  • This angle can be calculated by calculation, so that the specific position of the reflecting object can be known, and then the position change of the object can be known.
  • There are many ways to calculate the angle such as establishing a coordinate system with the radar as the center, and calculating the position of the object in the coordinate system based on the echo data, so as to obtain the pitch angle or the azimuth angle.
  • the Multiple Signal Classification (MUSIC) algorithm can be used to calculate the angle, including the pitch angle or the azimuth angle, etc.
  • the four-receiving antenna array of the radar can be used to measure the angle change of the object.
  • the sound source localization 1303 is performed to locate the actual position of the sound source relative to the radar or the microphone array.
  • the angle and the incident angle included in the radar positioning information can be weighted and fused to obtain the fusion angle of the sound source relative to the microphone array or radar, thereby determining the actual position of the sound source relative to the radar or the microphone array.
  • the microphone array locates multiple candidate angles, the angle closest to the angle detected by the radar can be selected as the incident angle.
  • the microphone array detects multiple candidate angles within a period of time, and an angle farther from the angle detected by the radar can be selected as the new incident angle.
  • the relevant introduction in the foregoing step 805 please refer to the relevant introduction in the foregoing step 805 .
  • the speech separation model 1304 is updated based on the motion speed of the sound source
  • the de-reverberation model 1305 is updated based on the relative distance between the sound source and the radar.
  • the updated speech separation model and the updated de-reverberation model form a beam separation network, and perform signal separation 1306 on the data collected by the microphone array 1302 to separate the speech data of the sound source.
  • the speech separation model and the de-reverberation model are included in the beam separation network, which can be understood as forming a beam for the sound source through the beam separation network, so as to realize the separation of the data collected by the microphone array, and extract the speech data of the sound source and Speech data generated by background objects.
  • voice detection 1307 is performed on the voice data of the sound source to identify whether the sound source is a living body.
  • the speech data is the sound produced by a living body.
  • it can also combine the motion characteristics detected by the radar (such as the movement generated by the object's walking when speaking or the characteristics generated by other periodic motions, etc.) to further determine the sound source. Whether the source is a living body can accurately detect whether the sound source is a living body.
  • the radar For example, extract the acoustic features of sound source A for detection, and identify the probability that A is a living voice. According to whether the radar detects whether the object is moving, the probability of the existence of a living body in the scene is obtained. Then, the two modal detection results can be fused in the form of a product, and the existence of a living body can be judged according to the fused probability. Usually, when the radar determines that the probability of existence of a living body is zero, even if the existence probability given by the acoustic mode is very high, the fusion probability is close to zero, and it is judged that there is no living voice in the scene.
  • Dual-modal live speech detection effectively overcomes two difficult problems that traditional methods cannot overcome. First of all, it is difficult to distinguish between high-fidelity loudspeaker sound and live speech. The spectral characteristics between the two are almost identical, but the motion detection of radar can easily distinguish the two.
  • the silent person and the loudspeaker exist at the same time, and the traditional radar motion detection mode is prone to misjudgment; and the acoustic feature can distinguish the two, and the loudspeaker is in a long-term static state, and the long-term motion detected by the radar echo is detected.
  • Motion features can also exclude it from live speech.
  • Beam management 1308 is then performed based on the detection results to determine whether to reserve the beam for the sound source.
  • whether to turn off the beam for the sound source may be determined according to the result of the sound source detection.
  • the living body usually moves; (5) The speech signal is sometimes strong and sometimes weak, even if the sound source localization device misses some weak syllables, it may not cause semantic misunderstanding. Therefore, by combining these rules, it is possible to accurately identify whether the sound source is alive, and based on the identification result, it is determined whether to close the beam for speech extraction for the sound source. Therefore, in the embodiment of the present application, the radar and the microphone array are combined to locate the sound source, so that the beam for extracting the voice of the sound source is determined based on the positioning, and the voice data of the sound source is accurately extracted.
  • the present application provides a sound source localization device for performing the steps of the aforementioned methods in FIGS. 2-13 , the sound source localization device may include:
  • a radar positioning module configured to obtain first position information through radar echo data, where the first position information includes position information of an object relative to the radar;
  • a microphone array positioning module configured to obtain an incident angle from the voice signal collected by the microphone array, where the incident angle is the angle at which the voice signal is incident on the microphone array;
  • a sound source localization module configured to fuse based on the first position information and the incident angle to obtain a second position if the first position information includes a first angle of the object relative to the radar information, and the second location information includes location information of a sound source that generates the speech signal.
  • the device further includes:
  • a voice separation module configured to extract voice data of the sound source from the voice signals collected by the microphone array based on the second position information.
  • the speech separation module is specifically configured to use the data collected by the microphone array as the input of a preset beam separation network, and output the speech data of the sound source.
  • the beam separation network includes a speech separation model, and the speech separation model is used to separate speech data and background data of a sound source in the input data, and the apparatus further includes:
  • an update module configured to determine the moving speed of the sound source according to the echo data before the voice signal collected by the microphone array is used as the input of the preset beam separation network; update the sound source according to the moving speed the voice separation model to obtain the updated voice separation model.
  • the updating module is specifically configured to determine a parameter set of the speech separation model according to the moving speed, and obtain the updated speech separation model, wherein the parameter set and the The rate of change of the parameters of the speech separation model is correlated, and the moving speed and the rate of change are positively correlated.
  • the beam separation network further includes a de-reverberation model, and the de-reverberation model is used to filter out the reverberation signal in the input data;
  • the updating module is further configured to update the unmixing according to the distance between the object and the radar before the voice signal collected by the microphone array is used as the input of the preset beam separation network
  • the reverberation model is obtained to obtain the updated de-reverberation model.
  • the updating module is specifically configured to update the delay parameter and the prediction order in the de-reverberation model according to the distance between the object and the radar, to obtain the updated
  • the delay parameter represents the length of time that the reverberation signal lags behind the speech data of the sound source
  • the prediction order represents the duration of the reverberation
  • the delay parameter and the prediction order The numbers are all positively correlated with the distance.
  • the voice separation module is further configured to remove the voice data corresponding to the sound source in the data collected for the microphone array if the voice data of the sound source does not meet a preset condition The beam used for data processing.
  • the apparatus further includes a living body detection unit, configured to: extract features from the speech data to obtain the acoustic features of the sound source; identify the sound source according to the acoustic features as The first probability of a living body; the second probability that the sound source is a living body is determined according to the echo data of the radar; the first probability and the second probability are fused to obtain a fusion result, the fusion result Used to indicate whether the sound source is a living body.
  • a living body detection unit configured to: extract features from the speech data to obtain the acoustic features of the sound source; identify the sound source according to the acoustic features as The first probability of a living body; the second probability that the sound source is a living body is determined according to the echo data of the radar; the first probability and the second probability are fused to obtain a fusion result, the fusion result Used to indicate whether the sound source is a living body.
  • the first angle and the incident angle are in the same coordinate system
  • the sound source localization module is specifically configured to determine the first weight corresponding to the first angle and the The incident angle corresponds to a second weight, wherein the first weight is positively correlated with the moving speed of the object relative to the radar, and the second weight is negatively correlated with the moving speed of the object relative to the radar relationship; weighted fusion of the angle and the incident angle is performed according to the first weight and the second weight to obtain a fusion angle, and the second position information includes the fusion angle.
  • the microphone array positioning module is specifically configured to obtain a plurality of second angles through the voice signals collected by the microphone array, the first angle and the plurality of second angles are at the same In the coordinate system, the angle with the smallest difference from the first angle or the difference within the first preset range is selected from the plurality of second angles as the incident angle.
  • the microphone array positioning module is specifically configured to, after acquiring the incident angle of the speech signal collected by the microphone array, if a plurality of first three angles, an angle is selected from the plurality of third angles as the new incident angle based on the moving speed of the object.
  • the microphone array positioning module is specifically configured to: if the moving speed of the object is greater than a preset speed, filter out the plurality of third angles, which is the same as the first angle. The angle whose difference between the angles is within the second preset range is used as the new incident angle; if the moving speed of the object is not greater than the preset speed, it is filtered from the plurality of third angles , and the angle whose difference from the first angle is within a third preset range is used as the new incident angle, and the third preset range covers and is larger than the second preset range.
  • the sound source localization module is further configured to use the incident angle as the sound source relative to the sound source if the first position information does not include the first angle.
  • the angle of the microphone array is further configured to use the incident angle as the sound source relative to the sound source if the first position information does not include the first angle.
  • the sound source localization module is further configured to, before acquiring the incident angle of the speech signal collected through the microphone array, if the echo data is used to determine the detection range of the radar If the position information of the moving object is obtained, and the object does not emit sound, the sound source detection threshold of the microphone array for the object is adjusted, and the microphone array is used to collect signals whose sound pressure is higher than the sound source detection threshold.
  • the first position information further includes a first relative distance between the object and the radar, and the sound source localization module is also used for, if the voice signal is also collected through the microphone array, Obtain the second relative distance between the object and the microphone array, and fuse the first relative distance and the second relative distance to obtain a fusion distance, where the fusion distance indicates that the sound source is relative to the microphone array
  • the second location information also includes the fusion distance.
  • FIG. 15 is a schematic structural diagram of another sound source localization device provided by the present application, as described below.
  • the sound source localization apparatus may include a processor 1501 and a memory 1502 .
  • the processor 1501 and the memory 1502 are interconnected by wires. Among them, the memory 1502 stores program instructions and data.
  • the memory 1502 stores program instructions and data corresponding to the steps in the foregoing FIG. 2 to FIG. 13 .
  • the processor 1501 is configured to execute the method steps executed by the sound source localization apparatus shown in any of the foregoing embodiments in FIG. 2 to FIG. 13 .
  • the sound source localization apparatus may further include a transceiver 1503 for receiving or transmitting data.
  • the sound source localization device may further comprise a radar and/or a microphone array (not shown in FIG. 15 ), or establish a connection with a radar and/or a microphone array (not shown in FIG. 15 ), the radar and/or the microphone array (not shown in FIG. 15 ).
  • a radar and/or a microphone array not shown in FIG. 15
  • the microphone array reference may be made to the radar and/or microphone array mentioned above in FIGS. 2 to 13 , which will not be repeated here.
  • Embodiments of the present application also provide a computer-readable storage medium, where a program for generating a vehicle's running speed is stored in the computer-readable storage medium, and when the computer is running on a computer, the computer is made to execute the above-mentioned Fig. 2-Fig. 13
  • the illustrated embodiment describes the steps in the method.
  • the aforementioned sound source localization device shown in FIG. 15 is a chip.
  • the embodiment of the present application also provides a sound source localization device, which may also be called a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are processed.
  • the unit is executed, and the processing unit is configured to execute the method steps executed by the sound source localization apparatus shown in any of the foregoing embodiments in FIG. 2 to FIG. 13 .
  • the embodiments of the present application also provide a digital processing chip.
  • the digital processing chip integrates circuits and one or more interfaces for realizing the above-mentioned processor 1501 or the functions of the processor 1501 .
  • the digital processing chip can perform the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface.
  • the digital processing chip implements the actions performed by the sound source localization apparatus in the above embodiment according to the program codes stored in the external memory.
  • the embodiments of the present application also provide a computer program product, which, when driving on the computer, causes the computer to execute the steps performed by the sound source localization apparatus in the method described in the embodiments shown in the foregoing FIG. 2 to FIG. 13 .
  • the sound source localization device may be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc. .
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip executes the driving decision selection method described in the embodiments shown in FIG. 2 to FIG. 13 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or it may be any conventional processor or the like.
  • FIG. 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip may be represented as a neural network processor NPU 160, and the NPU 160 is mounted as a coprocessor to the main CPU ( Host CPU), the task is allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1603, which is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 1603 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 1603 is a two-dimensional systolic array. The arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1603 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1602 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1601 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1608 .
  • Unified memory 1606 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (direct memory access controller, DMAC) 1605, and the DMAC is transferred to the weight memory 1602.
  • Input data is also moved into unified memory 1606 via the DMAC.
  • a bus interface unit (BIU) 1610 is used for the interaction between the AXI bus and the DMAC and an instruction fetch buffer (instruction fetch buffer, IFB) 1609.
  • IFB instruction fetch buffer
  • the bus interface unit 1610 (bus interface unit, BIU) is used for the instruction fetch memory 1609 to acquire instructions from the external memory, and also for the storage unit access controller 1605 to acquire the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 , the weight data to the weight memory 1602 , or the input data to the input memory 1601 .
  • the vector calculation unit 1607 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
  • vector computation unit 1607 can store the processed output vectors to unified memory 1606 .
  • the vector calculation unit 1607 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1607 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1603, eg, for use in subsequent layers in a neural network.
  • the instruction fetch buffer (instruction fetch buffer) 1609 connected to the controller 1604 is used to store the instructions used by the controller 1604;
  • the unified memory 1606, the input memory 1601, the weight memory 1602 and the instruction fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1603 or the vector calculation unit 1607 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods in FIGS. 2-13 .
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk U disk
  • mobile hard disk ROM
  • RAM random access memory
  • disk or CD etc.
  • a computer device which can be a personal computer, server, or network device, etc. to execute the methods described in the various embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Otolaryngology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种声源定位方法、装置、计算机可读存储介质以及拾音装置,用于结合麦克风阵列和雷达对声源进行准确地定位。该方法包括:通过雷达回波数据获取第一位置信息(201),该第一位置信息中包括对象相对于雷达的第一角度;通过麦克风阵列采集到的语音信号获取入射角(202),该入射角为语音信号入射至麦克风阵列的角度;对第一角度和入射角进行融合,以得到第二位置信息(205),该第二位置信息用于表示产生语音信号的声源的位置。

Description

一种声源定位方法以及装置
本申请要求于2020年12月31日提交中国专利局、申请号为CN202011637064.4、申请名称为“一种声源定位方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种声源定位方法以及装置。
背景技术
语音交互广泛应用于智能会议和家居产品,其首要问题在于嘈杂环境下的语音信号拾音,防止环境噪音和室内混响对目标语音信号的干扰。基于麦克风阵列的波束形成能够准确拾取语音信号,被广泛应用于各类语音交互产品,它能够有效抑制环境噪音,压制室内混响,而不明显损伤语音。波束形成有赖于语音源方位的准确估计,尤其是自适应波束形成技术,对声源方位及其敏感,几度的位置偏差容易导致拾音性能的大幅下降,因此,如何实现对声源的准确定位,成为亟待解决的问题。
发明内容
本申请提供一种声源定位方法以及装置,用于结合麦克风阵列和雷达对声源进行准确地定位。
第一方面,本申请提供一种声源定位方法,包括:通过雷达回波数据获取第一位置信息,所述第一位置信息中包括对象相对于所述雷达的第一角度;通过麦克风阵列采集到的语音信号获取入射角,所述入射角为语音信号入射至所述麦克风阵列的角度;融合所述第一角度和所述入射角,以得到第二位置信息,所述第二位置信息用于表示产生所述语音信号的声源的位置。
因此,在本申请实施方式中,可以结合雷达检测到的对象的位置和麦克风阵列检测到的入射角,得到声源相对于麦克风阵列的位置,从而通过该位置控制用于分离声源的语音的波束的开启,从而准确地从麦克风阵列采集到的数据中提取到声源的语音数据。并且,无论发声对象处于静止或者运动状态,都可以准确地确定出声源的位置,可以更准确地提取到声源的语音数据。
在一种可能的实施方式中,所述融合第一角度和入射角,可以包括:分别确定所述第一角度对应的第一权重和所述入射角对应第二权重,其中,所述第一权重和所述对象相对于所述雷达的移动速度呈正相关关系,所述第二权重和所述对象相对于所述雷达的移动速度呈负相关关系;根据所述第一权重和所述第二权重对所述第一角度和所述入射角进行加权融合,得到融合角度,所述第二位置信息中包括所述融合角度。
因此,在本申请实施方式中,在对第一角度和入射角进行加权融合时,可以考虑对象的移动速度来确定权重,从而可以使用多种对象运动的情况,提高融合角度的准确度。
在一种可能的实施方式中,所述方法还包括:基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据。
因此,在声源定位之后,即可基于准确的声源的位置,从而麦克风阵列采集到的数据中准确提取到声源的语音数据。
在一种可能的实施方式中,所述基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据,包括:将所述麦克风阵列采集到的数据作为预设的波束分离网络的输入,输出所述声源的所述语音数据。
因此,在本申请实施方式中,可以通过波束分离网络来从麦克风阵列采集到的数据中分离出声源的语音数据,即通过波束形成的方式,提取到与声源对应的方向上的语音数据,从而得到更准确的声源内的语音数据。
在一种可能的实施方式中,所述波束分离网络包括语音分离模型,所述语音分离模型用于分离输入数据中的声源的语音数据和背景数据,所述方法还包括:根据所述回波数据确定所述声源的移动速度;根据所述移动速度更新所述语音分离模型,得到更新后的所述语音分离模型。
因此,在本申请实施方式中,可以结合声源的运动速度,适应性地更新语音分离模型,使语音分离模型与声源的运动情况匹配,能够适应声源快速移动的场景,以便从麦克风阵列采集到的数据中分离出声源的语音数据。
在一种可能的实施方式中,所述根据所述移动速度更新所述语音分离模型,包括:根据所述移动速度确定所述语音分离模型的参数集,得到更新后的所述语音分离模型,其中,所述参数集和所述语音分离模型的参数的变化速率相关,所述移动速度和所述变化速率呈正相关关系。
因此,本申请实施方式中,参数慢变可以提高模型的稳定性,减少模型的抖动;快变则有利于快速适应环境的变化,因此可以根据目标运动速度来选择模型参数变化的速率,从而影响语音分离模型的参数集,得到更新后的语音分离模型。
在一种可能的实施方式中,所述波束分离网络还包括解混响模型,所述解混响模型用于滤除输入的数据中的混响信号;所述方法还包括:根据所述对象和所述雷达之间的距离,更新所述解混响模型,得到更新后的所述解混响模型。
因此,在本申请实施方式中,可以通过解混响模型来接触麦克风阵列采集到的数据的混响,从而使语音分离模型分离出的声源的语音数据更准确。
在一种可能的实施方式中,所述根据所述对象和所述雷达之间的距离,更新所述解混响模型,包括:根据所述对象和所述雷达之间的距离,更新所述解混响模型中的延迟参数和预测阶数,得到更新后的所述解混响模型,所述延迟参数表示所述混响信号滞后于所述声源的语音数据的时长,所述预测阶数表示混响的持续时长,所述延迟参数和所述预测阶数都与所述距离呈正相关关系。
通常,声源和麦克风阵列的距离显著影响麦克风接收到的信号的混响。当距离较大时,声源发出的语音信号传播距离较远,衰减较大,而室内混响保持不变,混响对于语音信号的干扰较大,混响持续时间较长;而距离越近时,声源发出的语音信号传播距离较近,衰减较小,混响的影响减弱。因此,解混响模型的参数可以基于声源和麦克风阵列的距离来进行调整。当距离较远时,加大解混响的程度;当距离较近时,减少解混响的程度,防止 过度解混响而干扰语音信号。甚至在距离非常小的情况下,如小于预设最小值,则可以停止解混响,以提高得到的语音数据的质量。
在一种可能的实施方式中,所述方法还包括:若所述声源的语音数据不符合预设条件,则去除对所述麦克风阵列采集到的语音信号进行处理所使用的波束。
因此,在本申请实施方式中,当声源的语音数据不符合预设条件,如声源不是活体,或者声源的位置改变等,则去除对所述麦克风阵列采集到的语音信号进行处理所使用的波束,避免采集到无意义的数据。
在一种可能的实施方式中,所述方法还包括:从所述语音数据中提取特征,得到所述声源的声学特征;根据所述声学特征识别所述声源为活体的第一概率;根据所述雷达的回波数据,确定所述声源为活体的第二概率;对所述第一概率和所述第二概率进行融合,得到融合结果,所述融合结果用于表示所述声源是否为活体。
因此,在本申请实施方式中,还可以检测声源是否为活体,从而可以是用户清楚地获知当前发声的对象的类型是否为活体,提高用户体验。
在一种可能的实施方式中,所述通过麦克风阵列采集到的语音信号获取入射角,包括:若通过麦克风阵列采集到的语音信号得到多个第二角度,所述第一角度和所述多个第二角度处于同一坐标系中,则从所述多个第二角度中选取与所述第一角度之间的差值最小或者所述差值在第一预设范围内的角度作为所述入射角。
因此,在本申请实施方式中,可以通过麦克风阵列采集到多个角度,此时可以结合雷达采集到角度,选择出与声源最接近的角度作为入射角,提高得到入射角的准确率。
在一种可能的实施方式中,在所述通过麦克风阵列采集到的语音信号获取入射角之后,所述方法还包括:若基于所述麦克风阵列再次采集到的数据得到多个第三角度,则基于所述对象的移动速度,从所述多个第三角度中选取角度作为新的所述入射角。
因此,在本申请实施方式中,当通过麦克风阵列得到多个角度之后,可以基于对象的移动速度从该多个角度中选择新的入射角,从而可以适应声源的位置不断改变的情况。
在一种可能的实施方式中,所述基于所述对象的移动速度,从所述多个角度中选取第三角度作为新的所述入射角,包括:若所述对象的移动速度大于预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第二预设范围内的角度作为新的所述入射角;若所述对象的移动速度不大于所述预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第三预设范围内的角度作为新的所述入射角,所述第三预设范围覆盖且大于所述第二预设范围。
因此,在本申请实施方式中,当对象的移动速度过大时,则可以从较远的位置选择新的角度作为入射角,当速度较慢时,则可以从较近的位置选择新的角度作为入射角,适应对象的位置不断变化的情况,泛化能力强。
在一种可能的实施方式中,所述方法还包括:若所述第一位置信息中不包括所述第一角度,则将所述入射角作为所述声源相对于所述麦克风阵列的角度。
因此,在本申请实施方式中,若对象未移动,通过雷达可能不能检测到对象相对于雷达的角度,此时可以直接将通过麦克风阵列得到的入射角作为声源相对于麦克风阵列的角 度,即使对象未移动,也可以准确检测到声源,提高声源的位置检测准确性。
在一种可能的实施方式中,在所述通过麦克风阵列采集到的语音信号获取入射角之前,所述方法还包括:若通过所述回波数据确定对象处于运动状态,且所述对象未发声,则调整所述麦克风阵列针对所述对象的声源检测阈值,所述麦克风阵列用于采集声压高于所述声源检测阈值的信号。
在本申请实施方式中,若检测到对象在移动而未发声时,则可以降低声源检测阈值,相当于关注该声源是否发声,提高对该声源的关注度,进而可以快速检测到该声源是否发声。
在一种可能的实施方式中,所述第一位置信息中还包括对象和所述雷达的第一相对距离,所述方法还包括:还通过麦克风阵列采集到的语音信号,获取到对象和所述麦克风阵列的第二相对距离,对所述第一相对距离和所述第二相对距离进行融合,得到融合距离,所述融合距离表示所述声源相对于所述麦克风阵列的距离,所述第二位置信息中还包括所述融合距离。
在本申请实施方式中,若通过麦克风阵列采集到对象和麦克风之间的距离,则可以对该距离和雷达采集到的距离进行融合,从而得到声源相对于麦克风阵列或者雷达的距离,以便进行后续操作,如更新波束分离网络,提高分离出声源的语音数据的准确度。
第二方面,本申请提供一种声源定位装置,包括:
雷达定位模块,用于通过雷达回波数据获取第一位置信息,所述第一位置信息中包括对象相对于所述雷达的第一角度;
麦阵定位模块,用于通过麦克风阵列采集到的语音信号获取入射角,所述入射角为语音信号入射至所述麦克风阵列的角度;
声源定位模块,用于融合所述第一角度和所述入射角,以得到第二位置信息,所述第二位置信息用于标识产生所述语音信号的声源的位置。
在一种可能的实施方式中,所述声源定位模块,具体用于分别确定所述第一角度对应的第一权重和所述入射角对应第二权重,其中,所述第一权重和所述对象相对于所述雷达的移动速度呈正相关关系,所述第二权重和所述对象相对于所述雷达的移动速度呈负相关关系;根据所述第一权重和所述第二权重对所述角度和所述入射角进行加权融合,得到融合角度,所述第二位置信息中包括所述融合角度。
在一种可能的实施方式中,所述装置还包括:
语音分离模块,用于基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据。
在一种可能的实施方式中,语音分离模块,具体用于将所述麦克风阵列采集到的数据作为预设的波束分离网络的输入,输出所述声源的所述语音数据。
在一种可能的实施方式中,所述波束分离网络包括语音分离模型,所述语音分离模型用于分离输入数据中的声源的语音数据和背景数据,所述装置还包括:
更新模块,用于根据所述回波数据确定所述声源的移动速度;根据所述移动速度更新所述语音分离模型,得到更新后的所述语音分离模型。
在一种可能的实施方式中,所述更新模块,具体用于根据所述移动速度确定所述语音分离模型的参数集,得到更新后的所述语音分离模型,其中,所述参数集和所述语音分离模型的参数的变化速率相关,所述移动速度和所述变化速率呈正相关关系。
在一种可能的实施方式中,所述波束分离网络还包括解混响模型,所述解混响模型用于滤除输入的数据中的混响信号;
所述更新模块,还用于根据所述对象和所述雷达之间的距离,更新所述解混响模型,得到更新后的所述解混响模型。
在一种可能的实施方式中,所述更新模块,具体用于根据所述对象和所述雷达之间的距离,更新所述解混响模型中的延迟参数和预测阶数,得到更新后的所述解混响模型,所述延迟参数表示所述混响信号滞后于所述声源的语音数据的时长,所述预测阶数表示混响的持续时长,所述延迟参数和所述预测阶数都与所述距离呈正相关关系。
在一种可能的实施方式中,所述语音分离模块,还用于若所述声源的语音数据不符合预设条件,则去除针对所述麦克风阵列采集到的数据中所述声源对应的数据进行处理所使用的波束。
在一种可能的实施方式中,所述装置还包括活体检测单元,用于:从所述语音数据中提取特征,得到所述声源的声学特征;根据所述声学特征识别所述声源为活体的第一概率;根据所述雷达的回波数据,确定所述声源为活体的第二概率;对所述第一概率和所述第二概率进行融合,得到融合结果,所述融合结果用于表示所述声源是否为活体。
在一种可能的实施方式中,所述麦阵定位模块,具体用于若通过麦克风阵列采集到的语音信号得到多个第二角度,所述第一角度和所述多个第二角度处于同一坐标系中,则从所述多个第二角度中选取与所述第一角度之间的差值最小或者所述差值在第一预设范围内的角度作为所述入射角。
在一种可能的实施方式中,所述麦阵定位模块,具体用于在所述通过麦克风阵列采集到的语音信号获取入射角之后,若基于所述麦克风阵列再次采集到的数据得到多个第三角度,则基于所述对象的移动速度,从所述多个第三角度中选取角度作为新的所述入射角。
在一种可能的实施方式中,所述麦阵定位模块,具体用于:若所述对象的移动速度大于预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第二预设范围内的角度作为新的所述入射角;若所述对象的移动速度不大于所述预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第三预设范围内的角度作为新的所述入射角,所述第三预设范围覆盖且大于所述第二预设范围。
在一种可能的实施方式中,所述声源定位模块,还用于若所述第一位置信息中不包括所述第一角度,则将所述入射角作为所述声源相对于所述麦克风阵列的角度。
在一种可能的实施方式中,所述声源定位模块,还用于在所述通过麦克风阵列采集到的语音信号获取入射角之前,若通过所述回波数据确定对象处于运动状态,即对象在移动,且所述对象未发声,则调整所述麦克风阵列针对所述对象的声源检测阈值,所述麦克风阵列用于采集声压高于所述声源检测阈值的信号。
在一种可能的实施方式中,所述第一位置信息中还包括对象和所述雷达的第一相对距 离,所述声源定位模块,还用于若还通过麦克风阵列采集到的语音信号,获取到对象和所述麦克风阵列的第二相对距离,对所述第一相对距离和所述第二相对距离进行融合,得到融合距离,所述融合距离表示所述声源相对于所述麦克风阵列的距离,所述第二位置信息中还包括所述融合距离。
第三方面,本申请实施例提供一种声源定位装置,该声源定位装置具有实现上述第一方面声源定位方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第四方面,本申请实施例提供一种声源定位装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的声源定位方法中与处理相关的功能。可选地,该声源定位装置可以是芯片。
第五方面,本申请实施例提供了一种声源定位装置,该声源定位装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。
第六方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。
第七方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。
第八方面,本申请提供一种终端,该终端中包括雷达和处理器,该雷达和处理器之间连接,该处理器可以用于执行上述第一方面或第一方面任一可选实施方式中的方法,该雷达用于采集回波数据。
第九方面,本申请提供一种拾音装置,该拾音装置包括雷达、麦克风阵列和处理器,该雷达可以是前述第一方面所提及的雷达,该麦克风阵列可以是前述第一方面提及的麦克风阵列,该处理器可以用于执行述第一方面或第一方面任一可选实施方式中的方法。
可选地,该拾音装置可以包括八爪鱼会议设备、物联网(internet of things,IoT)或智能机器人等设备。
附图说明
图1A是本申请提供一种声源定位装置的结构示意图;
图1B是本申请提供一种雷达的结构示意图;
图1C是本申请提供一种应用场景示意图;
图1D是本申请提供另一种应用场景示意图;
图1E是本申请提供另一种应用场景示意图;
图2是本申请提供一种声源定位方法的流程示意图;
图3是本申请提供的一种角度示意图;
图4是本申请提供另一种声源定位装置的结构示意图;
图5是本申请提供的另一种角度示意图;
图6是本申请提供的另一种角度示意图;
图7A是本申请提供另一种应用场景示意图;
图7B是本申请提供另一种应用场景示意图;
图8是本申请提供另一种声源定位方法的流程示意图;
图9是本申请提供另一种应用场景示意图;
图10A是本申请提供另一种应用场景示意图;
图10B是本申请提供另一种应用场景示意图;
图11是本申请提供另一种声源定位方法的流程示意图;
图12是本申请提供另一种应用场景示意图;
图13是本申请提供另一种声源定位方法的流程示意图;
图14是本申请提供另一种声源定位装置的结构示意图;
图15是本申请提供另一种声源定位装置的结构示意图;
图16是本申请提供一种芯片的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的声源定位方法可以由拾音设备执行,应用于各种需要进行拾音的场景,例如,视频通话、语音通话、多人会议、录音或者录视频等场景。
首先介绍本申请提供的声源定位装置,该声源定位装置可以包括多种可以进行拾音的终端,该终端可以包括智能移动电话、电视、平板电脑、手环、头戴显示设备(Head Mount Display,HMD)、增强现实(augmented reality,AR)设备,混合现实(mixed reality,MR)设备、蜂窝电话(cellular phone)、智能电话(smart phone)、个人数字助理(personal digital assistant,PDA)、平板型电脑、车载电子设备、膝上型电脑(laptop computer)、个人电脑(personal computer,PC)、监控设备、机器人、车载终端、穿戴设备或者自动驾驶车辆等。当然,在以下实施例中,对该终端的具体形式不作任何限制。
示例性地,声源定位装置(或者也可以称为拾音装置)的结构可以如图1A所示,该声源定位装置10可以包括雷达101、麦克风阵列102和处理器103。
该雷达101可以包括激光雷达或者24GHz以上电磁波的毫米波雷达等,其天线可以是多发多收天线,当然也可以是单天线。在本申请的以下实施方式中,以毫米波雷达进行示例性说明,本申请以下所提及的毫米波雷达也可以替换为激光雷达。例如,该雷达可以是工作频率为60GHz的毫米波雷达,如调频连续波(frequency modulated continuous Wave,FMCW)或单频连续波雷达等。
麦克风阵列102可以包括多个麦克风组成的阵列,用于采集语音信号。该多个麦克风组成的结构可以包括集中式阵列结构,也可以包括分布式阵列结构。例如,当用户发出的 语音声压超过音源检测阈值,则通过麦克风阵列来采集语音信号,每个麦克风可以形成一路语音信号,多路语音信号融合后形成当前环境下采集到的数据。集中式阵列结构例如,如图1B所示,多个麦克风按照一定距离排列成一定几何形状的结构,如每个麦克风的距离为10cm,组成圆形阵列。分布式阵列结构例如,如图1B所示,可以在会议桌上的多个不同的位置设置麦克风。
处理器103可以用于对雷达回波数据或者麦克风阵列采集到的数据进行处理,从而提取出声源对应的语音数据。可以理解为,可以通过该处理器103执行本申请提供的声源定位方法的步骤。
可选地,该声源定位装置可以包括八爪鱼会议设备、物联网(internet of things,IoT)或智能机器人等设备。
示例性地,雷达101的结构可以如图1C所示,该雷达具体可以包括发射机1014、接收机1015、功率放大器1013、功分耦合器1012、混频器1016、波形发生器1011、模数转换器(analogue-to-digital conversion,AD)1017及信号处理器1018等模块。
其中,波形发生器1011产生所需要的频率调制信号。该频率调制信号经功分耦合器1012后分为两路信号,一路信号经功率放大器1013放大后,经发射机1014产生发射信号,并通过发射天线进行辐射。另一路信号部分作为本振,与接收机1015通过接收天线接收到的回波信号于混频器1016中产生中频信号。然后经AD转换器1017转换为数字信号,信号处理器1018的主要目标为于中频信号中提取频率信息,并通过进一步处理得到距离、速度等目标基本信息,用于后续声源定位。
具体地,雷达的发射机可以不断发射调制信号,调制信号遇到目标物后反射被雷达接收机接收,在进行手势这段时间,雷达信号携带手势的距离、角度(方位角或俯仰角等)、多普勒信息、微多普勒信息等被捕获,形成当前手势的数据再进行后续处理。
例如,毫米波雷达采用FMCW雷达,该雷达具有多种优点,如硬件处理相对简单,容易实现、结构相对简单、尺寸小、重量轻以及成本低,适合数据采集并进行数字信号处理;理论上不存在FMCW雷达所存在的测距盲区,并且发射信号的平均功率等于峰值功率,因此只需要小功率的器件,从而降低了被截获和干扰的概率。
下面对本申请提供的声源定位装置的应用场景进行示例性说明。
例如,如图1D所示,可以在会议桌设置声源定位装置10,用户可以使用该声源定位装置10进行视频会议,可以通过声源定位装置10来对讲话的用户进行跟踪,从而提取出讲话的用户的语音,以便接收端可以准确地区分出讲话的对象。
还例如,如图1E所示,在智能电视场景中,用户可以通过设置在智慧屏上的声源定位装置10来实现对显示屏的控制或者实现对其他智能设备的控制,通过声源定位装置可以实现对声源的准确跟踪,从而准确地提取到用户的语音数据。
通常,基于麦克风阵列的波束形成能够准确拾取到语音信号,被广泛应用于各类语音交互的场景中,能够有效抑制环境噪声,压制室内混响,且不明显损伤语音。波束形成依赖于声源的位置的准确估计,尤其针对自适应波束形成,对声源及其定位机器敏感,几度的位置偏差就会导致拾音性能的大幅下降。并且,麦克风阵列能够解决单声源定位,但不 能有效定位多个时间上交叠的声源,尤其在多声源移动的场景下,几乎不能正常工作。然而,日常声学环境中声源交叠与移动频繁发生,麦克风阵列不能有效拾音,若以限制交互应用范围和牺牲用户体验为代价,采取“唤醒词”的方式简化为单目标源场景。但在某些诸如智能会议场景下,多个与会成员对会议系统发起会话的时候,很难将其简化为单源场景,系统不能并行拾取多人说话的声音。
雷达可以做到准确的定位和跟踪运动/微动的多个目标源,本申请将雷达的定位能力引入到拾音中,和麦克风阵列定位技术形成强互补关系,提高多声源场景的定位与追踪的准确性和鲁棒性,提升麦克风阵列的拾音能力。
在一种情况中,若采用毫米波雷达定位人体目标方位,然后采用波束形成技术驱动波束指向人体方位,麦克风阵列不参与声源检测与定位。雷达检测到人体目标,波束开启;一对象对应一波束,麦克风阵列拾取语音信号。此情况下,完全依赖雷达检测声源的位置,但是雷达不能确定静止人体的位置,会遗漏静止发声声源,在人体密集场景下,雷达检测到的目标过多,易形成过多波束,导致声源定位装置计算过载。
并且实际上,发声声源才是设备感兴趣的目标;活体语音的检测即区分扬声器语音和发音器官直接产生的语音对于语音交互具有重要意义,当前的活体语音检测技术依赖于单模态语音,只能满足近距离(如1米内)的检测需求,难以鉴别嘈杂环境下的远距离语音源。容易将扬声器产生的语音作为活体发出的语音,引起误判。
因此,本申请提供一种声源定位方法,结合了雷达和麦克风阵列,对声源进行准确定位,进而针对声源进行精准的波束控制,提取到声源对应的语音数据,且进一步可以提取到由活体发出的语音,下面对本申请提供的方法进行详细介绍。
参阅图2,本申请提供的一种声源定位方法的流程示意图,如下所述。
201、通过雷达回波数据确定第一位置信息。
其中,该第一位置信息可以包括对象相对于雷达的距离、角度或者速度等信息。
具体地,雷达可以向辐射范围发射调制波,该调制波经对象反射后被雷达接收,形成回波信号,从而得到回波数据。该回波数据包括了检测到的一个或者多个对象在雷达的检测范围内进行运动时产生的信息,如用户在辐射范围内进行移动时产生的轨迹的信息。
更具体地,该回波数据中可以包括声源在雷达的辐射范围内时,相对于雷达的速度、相对于雷达的距离、角度、声源运动幅度、声源运动的周期、雷达的回波相对于发射信号的频移、雷达的回波相对于发射信号相位或声源运动的加速度等中的一项或者多项。该角度可以包括俯仰角或方位角。
例如,雷达定位信息中可以包括对象相对于雷达的距离或角度,距离信息蕴含于各回波脉冲的频率中,可通过对单个脉冲进行快速傅立叶变换,获得对象于当前脉冲时间内的距离信息,对各脉冲距离信息进行整合,即可得到对象的整体距离变化信息。该角度可以包括方位角和俯仰角,角度的获取基于雷达的多接收天线,通过测量各接收回波的相位差实现。回波信号与接收天线之间可能因反射对象的位置而存在一定角度,可以通过计算的计算出该角度,从而可以获知到反射对象的具体位置,进而获知对象的位置变化情况。计算角度的方式可以包括多种,如以雷达为中心建立坐标系,基于回波数据计算对象在该坐 标系内的位置,从而得到俯仰角或方位角。
例如,当声源在辐射范围内移动时,即可基于雷达接收到的一段时长内的回波信号,得到声源在辐射范围内移动的速度、相对于雷达的距离、运动幅度或者相对于雷达的角度等信息。
示例性地,如图3所示,可以建立三维坐标系,(x,y)对应H-plane平面,(y,z)对应E-plane平面。在H-plane平面中,以雷达所在的位置为原点,以x轴为极轴,在该平面中对象的坐标可以表示为(r 1,α),α表示方位角。在E-plane平面中,可以以雷达所在的位置为原点,z轴为极轴,对象的坐标可以表示为(r 2,β),β表示俯仰角。
在一种可能的实施方式中,若先通过回波数据确定雷达的检测范围内运动的对象的位置信息,即通过雷达检测到对象在运动,且对象未发声,即未检测出该声源的入射角,则调整麦克风阵列针对对象的声源检测阈值,如减少该声源检测阈值,以提高麦克风阵列采集语音信号的灵敏度,麦克风阵列用于采集声压高于声源检测阈值的信号。
通常,麦克风阵列通常可以采集声压超过一定阈值的语音信号,如语音的声压超过阈值,以下将该阈值统称为音源检测阈值,未超过阈值的语音信号通常丢弃。当通过雷达回波数据确定辐射范围内存在运动的对象时,可以通过控制麦克风阵列的音源检测阈值,来提高针对运动的对象的拾音灵敏度。通常,音源检测阈值越高,拾音灵敏度越低,声源检测阈值越低,拾音灵敏度越高。例如,可以将雷达检测到的具有移动的对象的位置区域作为候选位置,麦克风阵列在对候选位置进行拾音时,所设置的音源检测阈值更低,从而使麦克风阵列可以对移动的对象进行准确拾音。具体例如,在雷达未检测到移动对象的区域,声源检测阈值设置为μ1,当通过雷达回波数据检测到某一方向存在运动对象时,则将对该方向上进行拾音所使用的声源检测阈值设置为μ2,且μ2<μ1,从而提高麦克风阵列对该方向上进行拾音时的灵敏度,减少声源漏检的情况。
可以理解为,本实施例中,可以通过雷达为麦阵指示声源的候选位置,从而降低声源检测阈值,从而提高提高麦阵对候选区域的检测灵敏度,防止声源漏检,还可以提高候选区域外的检测阈值,降低检测灵敏度,防止声源误检。
202、通过麦克风阵列采集到的语音信号获取入射角。
其中,麦克风阵列中包括了多个麦克风,用于将声波信号转换为数字信号。可以利用麦阵获得的信号进行声源检测以及声源定位,获得声源相对于麦克风阵列的入射角。
本申请实施方式中,以下麦克风阵列检测到的入射角和雷达检测到的角度,通常可以是同一坐标系下的角度。例如,如图4所示,麦克风阵列401可以是多个麦克风排列而成的阵列,雷达402的中心点可以与麦克风阵列的中心点重合。
此外,若麦克风检测到的角度和雷达检测到的角度不在同一坐标系中,则可以对麦克风检测到的角度和雷达检测到的角度进行对齐,使麦克风检测到的角度和雷达检测到的角度在同一坐标系中。
例如,若麦克风阵列为分布式阵列,则可以将麦克风阵列中的其中一个麦克风作为参考麦克风,在得到每个麦克风的入射角之后,对每个麦克风的入射角进对齐融合,将每个麦克风得到的入射角转换为参考麦克风的角度。然后将参考麦克风上的入射角和雷达检测 到的就角度进行对齐,使之处于相同的坐标系中。
具体地,入射方向可以通过方位角或者仰角来表示。
示例性地,以集中式麦克风阵列的平面阵列为例,如图5所示,以阵列的中心作为坐标系的原点,建立了三维坐标系,声源发出的声音信号的传播方向可以如图5中所示的γ,α表示方位角,β表示俯仰角。
为便于理解,下面示例性地,结合前述图5,对本申请中获取入射角的方式进行示例性说明。
在每个时刻,将每个麦克收到的连续语音信号按照预设时长切割为多帧,相邻帧之间可能存在交叠。例如,可以将麦克收到的语音信号切割为32ms的帧,前后帧之间交叠的长度为50%,从而保持帧信号的连续。对帧信号进行傅里叶变换,输出傅里叶变换的复系数,然后判断哪些方向存在声源。
具体可以通过假设测试,来确定哪些方向存在声源。例如,可以采取格点搜索的方式来实施假设测试,将所有可能的入射方向均匀划分为多个离散的方向,如方位角[0°,360°]区间按照间隔1度划分为360个方向,仰角区间[0,90]可按照3度划分为30个区间,用30*360来表示空间内的所有方向。
在每个格点上,根据假设方向来判断声波到达各个麦克的传播距离的差值,各个麦克的距离是已知的,选取任意麦克风阵列中的任意一个麦克作为参考麦克,然后逐个选择其余麦克作为考察对象。
该考察麦克到参考麦克之间的向量记为平面内的空间单位矢量:g m=[g m,1,g m,2,0] T,将格点对应的方向标记为γ=[cosαcosβ,sinαcosβ,sinβ] T,则声波传播在考察麦克和参考麦克之间的时间差为
Figure PCTCN2021132081-appb-000001
如图6所示。在每个频率分量ω上获得对应的延迟因子
Figure PCTCN2021132081-appb-000002
并乘上该声源信号的频率分量,则将该考察麦克上的信号与参考麦克对齐时间,从而消除了时间差的影响。通常,若得到的格点和真实入射方向接近或者吻合,则在消除了时间差之后,各个麦克之间的信号的相似性置信度最高。
通常,可以引入相干系数来衡量信号之间的相似性,在得到所有考察麦克与参考麦克之间的相干性系数之后,汇聚所有考察麦克和参考麦克之间的相干系数,得到阵列总体的信号相似度。例如,考察麦克和参考麦克之间的整体相似度度量可以表示为:
Figure PCTCN2021132081-appb-000003
Figure PCTCN2021132081-appb-000004
w m表示第m个麦克信号所占权重,s m表示该麦克接收到的信号,M表示麦克的总数。
随后,可以选取相干系数最大的几个极点作为声源入射的候选方向。通常,当每个候选方向对应的极值超过预设值时,即可将该候选方向作为声源的入射方向。
因此,本申请实施方式中,可以通过用于衡量信号之间的相似度的相干系数,来选取声源的候选方向,并基于极值来选择与声源更匹配的方向,可以更准确地确定出声源的位置。
在一种可能的场景中,通过麦克风阵列可能检测出多个角度,且该多个角度对应一个声源,此时需要对该多个角度进行筛选,滤除无效的角度。此处将筛选之前的角度称为候选角,筛选之后的角度称为入射角。
具体筛选入射角的方式可以包括:若通过雷达回波数据检测到运动的对象,得到的位置信息中包括该对象相对于雷达的第一角度。此时,可以对多个入射角分别与第一角度进行对比,选择出与第一角度的差值最小或者差值在第一预设范围内的角度作为入射角。
此外,若通过雷达回波数据确定了多个移动的对象,得到了多个角度,且通过麦克风阵列获取到入射角,则可以比较该多个角度和入射角,将与入射角的差值最小的角度作为第一角度,并对该第一角度和入射角进行加权融合,得到声源的角度。或者,若通过麦克风阵列确定了多个入射角,但仅有一个对象发声,且通过雷达回波数据确定了移动的对象的位置,则可以将该多个入射角和对象的角度最近的角度作为入射角。
在另一种可能的场景中,可能通过麦克风阵列获取到多个角度,即多个第三角度,则可以基于对象的移动速度从该多个第三角度中选择其中一个角度作为入射角。
具体地,若对象的移动速度大于第一预设值,则从多个第三角度中筛选出与第一角度之间的差值在第二预设范围内的第三角度作为新的入射角;若对象的移动速度不大于第一预设值,则从多个第三角度中筛选出,与第一角度之间的差值在第三预设范围内的第三角度作为新的入射角,第三预设范围覆盖且大于第二预设范围。
例如,第二预设范围可以是大于第一阈值的范围,第二范围可以是大于第二阈值的范围,第二阈值小于第一阈值,因此第三预设范围覆盖且大于第二预设范围。在某一个时刻检测到用户的方位角或者俯仰角为第一角度,在此后的一段时间内,用户处于运动状态,用户的发声位置改变,通过麦克风采集到多个入射角。若用户的移动速度较快,在此过程中用户的发声位置可能改变较快,此时可以选择与第一角度之间的角度在第二预设范围内的角度作为新的入射角。若用户移动速度较慢或者接近静止,在此过程中用户的发声位置可能较慢改变,通过麦克风阵列可能检测到用户在多个位置产生的语音信号,对应多个不同的入射角,此时可以选择与第一角度之间的差值在第三预设范围内的角度作为新的入射角。
在一种可能的实施方式中,若麦克风阵列采用分布式阵列,则除了可以得到语音信号的入射角之外,还可以得到发出语音信号的对象与麦克风阵列的距离。例如,已知麦克风阵列中各个麦克风之间的距离,可以根据语音信号到达各个麦克风的时刻,结合各个麦克风之间的距离,计算出对象和每个麦克风之间的距离,然后将对象和参考麦克风之间的距离作为对象和麦克风阵列的距离。
需要说明的是,本申请对步骤201和步骤202的执行顺序不作限定,可以先执行步骤201,也可以先执行步骤202,还可以同时执行步骤201和202,具体可以根据实际应用场景进行调整。
203、判断第一位置信息中是否包括第一角度,若是,则执行步骤205,若否,则执行步骤204。
在得到第一位置信息之后,可以判断该第一位置信息中是否包括声源相对于雷达的角度。若该第一位置信息中包括该角度,则可以基于该第一位置信息和入射角进行融合,得到声源的第二位置信息,即执行步骤205。若该第一位置信息中不包括声源相对于雷达的角度,则可以直接将通过麦克风阵列中检测到的入射角作为声源相对于麦克风阵列或者雷 达的角度,即执行步骤204。
通常,可以通过雷达来检测与雷达之间存在相对运动的对象的位置,当通过雷达回波数据不能检测到对象的角度,表示声源与雷达之间不存在相对运动,此时不能通过雷达回波数据确定发声的对象,可以仅参考入射角来确定声源的位置。
此外,本步骤针对雷达为毫米波雷达的情况,通常,毫米波雷达根据人体运动产生的多普勒效应确定人体所在的空间位置,既表达了人体相对于雷达的方向,也表达了人体距离毫米波雷达的距离,但不能检测静止目标。因此,在使用毫米波雷达进行定位时,若辐射范围内并不存在运动对象,则通过回波数据可能并不能得到静止对象的角度、速度等位置信息。而当本申请所提及的雷达替换为激光雷达时,则可以对辐射范围内的对象直接进行定位,而无需判断定位得到的位置信息中是否包括角度,即无需执行步骤203。因此,是否执行步骤203可以结合实际应用场景进行判断,此处仅以执行步骤203为例进行示例性说明,并不作为限定。
在本申请实施方式中,若采用毫米波雷达进行定位,即声源定位装置通过两个模态定位来实现声源定位,即通过雷达定位和麦阵定位来实现声源定位,毫米波雷达对动态对象进行定位,麦阵定位语音信号的入射方向,由于两个模态定位的原理不同,优缺点差异显著,存在强互补关系,融合能够产生“1+1>2”效果,从而实现稳定而准确的声源方位估计。
204、将入射角作为声源相对于麦克风阵列的角度。
其中,当雷达检测到的信息中不包括声源的入射角,即通过雷达回波数据不能确定哪个对象作为声源,则可以直接将入射角作为声源相对于麦克风阵列或者雷达的角度,从而得到声源的位置。
例如,当雷达辐射范围内有用户在讲话而并未移动,则可以通过麦克风阵列采集到声源的入射角,而雷达回波数据中无法判断出移动的对象,即通过雷达回波数据仅能检测出辐射范围内的对象与雷达之间的距离,而不能确定是哪个位置的对象在发声,此时可以直接将麦克风阵列的入射角作为声源相对于麦克风阵列或者雷达的角度,从而确定出发声源的位置。
此外,若还通过麦克风阵列采集到的数据检测到对象相对于麦克风阵列的距离,则可以直接将该距离作为声源相对于麦克风阵列的距离,以用于后续进行波束分离。
205、对第一角度和入射角进行加权融合,以得到第二位置信息。
若通过雷达回波数据检测到声源相对于雷达的第一角度,则可以对该第一角度和入射角进行加权融合,得到融合角度,从而得到声源的位置信息,即第二位置信息。该第二位置信息中包括该融合角度。
具体地,可以分别确定第一角度对应的第一权重和入射角对应第二权重,其中,第一权重和对象相对于雷达的移动速度呈正相关关系,第二权重和对象相对于雷达的移动速度呈负相关关系;根据第一权重和第二权重对角度和入射角进行加权融合,得到融合角度,第二位置信息中包括融合角度。通常,运动对象的移动速度超过预设移动速度值时,可以提高第一角度的权重,降低入射角的权重,运动对象的速度不超过预设移动速度值时,则 可以降低第一角度的权重,提高入射角的权重,从而可以适用不同状态下的声源,提高声源定位的准确性。可以理解为,通常,若声源移动速度较快,此时语音信号入射至麦克风阵列的变化也就越快,入射角的变化也就越快,可能不能确定出哪一个角度是与声源匹配的角度,而雷达可以准确地对移动的对象进行定位。因此,对象的移动速度越快,则可以给予雷达检测到的角度越高的权重,可以使得到的融合角度更准确。因此,结合雷达定位和麦克风阵列定位,可以准确地定位出声源的位置。
例如,第一角度可以表示θ r,对应权重值为c 1,入射角可以表示为θ m,对应权重值为c 2,融合后的角度表示为:θ fusion=c 1θ r+c 2θ m
该第二位置信息中除了可以包括融合角度之外,还可以包括声源相对于雷达或者麦克风阵列的距离、声源的运动速度、声源的加速度等。若该第二位置信息还包括距离、声源的运动速度、声源的加速度等信息,则该距离、声源的运动速度、声源的加速度等信息可以是通过雷达回波数据得到。
此外,还可以通过麦克风阵列采集到的数据检测到对象相对于麦克风阵列的距离,对通过麦克风采集到的距离和雷达采集到的距离进行融合,得到声源相对于麦克风阵列的距离,以便于进行后续的波束分离。例如,对第一相对距离和第二相对距离进行融合,得到融合距离,融合距离表示所述声源相对于所述麦克风阵列的距离,第二位置信息中还包括该融合距离,该第一相对距离是对象相对于雷达的距离,第二相对距离是对象相对于麦克风阵列的距离。
例如,用户在发声的同时也在进行移动,此时可以通过雷达回波数据检测到移动的用户的位置信息,该位置信息中包括了对象相对于雷达的角度,如方位角或者俯仰角等,则可以对该对象相对于雷达的角度和入射角进行加权融合,从而得到声源相对于雷达或者麦克风阵列的角度,若通过麦克风阵列采集到距离,在可以对该距离和雷达采集到的距离进行融合,从而得到声源相对于麦克风的位置。
通常,麦克风阵列的定位只能获取声源发出声源的入射角度,且声源定位的准确度要低于雷达定位,因此雷达用于跟踪声源具有更好的效果,但是雷达对静止目标不敏感,会忽视静止发声声源,此时需要依靠麦克风进行声源检测和声源定位,确定发声目标的入射角。融合麦克风阵列和雷达定位信息,可以得到更准确地声源的位置,实现目标的持续跟踪拾音。尤其在对象运动速度较低或者静止的情况,可以给予入射角度更高的权重,从而使融合角度更准确。
更具体地,声源发声的情况分为多种,声源发声前运动;声源发声后运动;声源发声的同时运动,下面结合具体场景进行示例性说明。
当声源在发声前即运动时,如图7A中所示用户从位置S 1运动至位置S r,可以是雷达先捕获到运动的对象的位置信息,将声源的初始位置设置为雷达检测到的对象的位置,表示为雷达源S rr)。例如,对象可以在雷达的辐射范围内运动,但未发声,则可以将通过雷达检测到的位置作为声源的初始位置,并对对象进行持续跟踪,跟踪对象的位置变化,以便后续可以快速准确地检测到该对象是否发声。此外,在此场景下,还可以降低针对该对象所在其余的音源检测阈值,从而提高对该对象发出的语音的关注度。
当声源先发声后运动或者不运动时,如图7B所示,可能是麦克风阵列先获取到声源发出的声音的入射角,将声源的初始设置为麦克风阵列检测到的对象的位置,表示为发声源CS mm)。例如,对象可以先发声,但未在雷达的辐射范围内运动,则通过雷达回波数据,并不能确定是哪一个对象在发声,此时将麦克风阵列检测到的位置作为声源的初始位置,以便在后续对象运动时,通过雷达回波数据得到对象更准确的位置。
当声源在发声的同时开始运动,则可以将雷达源S rr)作为声源的初始位置,也可以将发声源CS mm)作为声源的初始位置,具体可以根据实际应用场景来确定。
在融合对象的位置时,若对象处于运动状态,则可以给予从雷达回波数据中得到的角度更高的权重,而若对象处于静止状态时,则可以给予通过麦克风阵列检测到的入射角更高的权重。可以理解为,若对象处于运动状态,则可以通过雷达更准确地检测到对象的位置,此时提高第一角度的权重,使最终得到的声源的角度更准确。而在对象处于静止状态时,通过雷达回波数据可能不能准确地识别出发声对象,此时可以提高入射角的权重,使最终得到的声源的角度更准确。
206、基于第二位置信息从麦克风阵列采集到的语音信号中提取声源的语音数据。
在得到第二位置信息之后,即可获知声源的具体位置,从而可以根据该第二位置从麦克风阵列采集到的语音信号中提取到该声源的语音数据。
例如,在获知声源相对于麦克风阵列的方向之后,即可开启该方向上的波束,从而提取到声源的语音数据。
在一种可能的实施方式中,可以通过波束分离网络来输出声源的语音数据。例如,可以将麦克风阵列采集到的数据作为该波束分离网络的输入,输出声源的语音数据和背景数据,该背景数据即输入数据中除了声源的语音数据外的其他数据。
因此,在本申请实施方式中,可以结合雷达检测到的对象的位置和麦克风阵列检测到的入射角,得到声源相对于麦克风阵列的位置,从而通过该位置控制用于分离声源的语音的波束的开启,从而准确地从麦克风阵列采集到的数据中提取到声源的语音数据。并且,无论发声对象处于静止或者运动状态,都可以准确地确定出声源的位置,可以更准确地提取到声源的语音数据。
前述对本申请提供的声源定位方法进行了详细介绍,下面结合更具体的应用场景,对本申请提供的声源定位方法进行更详细的介绍。
参阅图8,本申请提供的另一种声源定位方法的流程示意图,如下所述。
801、通过麦克风阵列采集到的语音信号获取入射角。
802、通过雷达接收到的回波数据确定第一位置信息。
803、判断第一位置信息中是否包括第一角度,若否,则执行步骤804,若是,则执行步骤805。
804、将入射角作为声源相对于麦克风阵列的角度。
805、对第一角度和入射角进行加权融合,以得到第二位置信息。
其中,步骤801-805可以参阅前述步骤201-205中的相关描述,对于类似之处此处不再赘述,本实施例仅对存在区别的步骤或者更详细的应用场景进行介绍。
场景一、首次检测到声源
在一种可能的实施方式中,在声源的初始位置为雷达源S rr)的场景下时,接下来可能发生多种情况,示例性地,以通过雷达回波数据得到的声源位置信息中包括方位角为例,对一些场景进行介绍,以下所提及的方位角在不同的场景中也可以替换为俯仰角。
1、若通过雷达回波数据确定辐射范围内存在运动的对象(表示为雷达源S rr)),但该对象未发声,此时通过麦克风阵列不能检测到准确的该对象的入射角。但接下来可能出现以下情况:
(1)、该对象持续未发声。
此时,该对象的位置仅能由雷达跟踪得到,若此场景下出现多个发声源,则为降低设备的负载,此时可以忽略该对象。例如,若通过雷达回波数据确定辐射范围内存在多个对象运动并发声,发声的对象的数量超过预设数量,此时可以忽略持续运动但未发声的对象,从而降低设备的负载,降低设备功耗。
(2)、该对象发声。
此时,可以理解为通过雷达检测到的发声的方向与通过麦克风阵列检测到的发声的方向接近,可以将麦克风检测到的角度作为声源的入射角。例如,若在S rr)±θ thd0的范围内检测到入射角度,则可以将该角度作为与发声对象匹配的入射角,若在S rr)±θ thd0的范围内检测到多个角度,则可以选择与方位角最接近的角度作为声源的入射角。
如图9所示,对象相对于声源定位装置的方向为a方向,检测到两个候选角b和c,其中,a和b之间的角度差为θ 1,c和a之间的角度差为θ 2,且θ 21,则可以将候选声源b对应的角度作为入射角,丢弃候选声源c或者将声源c对应角度作为新声源的入射角等。
2、对象可能先发声,因此由麦克风阵列先检测到语音信号的入射角,在此场景下,示例性地,接下来可能出现以下多种情况。
(1)、对象静止
此时,通过雷达不能检测到发声对象的位置,可以直接将通过麦克风阵列检测到的候选声源CS mm)作为实际声源S m(θ)。
(2)、对象运动。
此时,可以通过雷达回波数据检测到对象运动的方向、角度或者距离等,得到雷达源S rr),然后关联该雷达源S rr)和声源S mm),得到实际声源,实际声源相对于雷达或者麦克风阵列的角度可以表示为:θ fusion=c 1θ r+c 2θ m,c 1、c 2为权重值,θ r为通过雷达回波数据得到的第一角度,可以包括方位角或者俯仰角,θ m为语音信号相对于麦克风阵列的入射角,可以包括方位角或者俯仰角。
因此,在本申请实施方式中,可以结合麦克风阵列采集到的入射角和雷达采集到的数据,在各种场景下都可以对声源进行准确定位,泛化能力强,提高后续得到声源的语音数据的准确性。
场景二、持续检测到声源
在对声源的发声进行持续跟踪的过程中,声源可能处于运动状态,因声源的位置变化可能导致麦克风阵列检测到多个信号入射角度,此时需要从多个入射角度中筛选出与声源匹配的角度作为入射角,或者筛选出新声源的入射角。
1、从多个入射角度中筛选出与声源匹配的角度作为入射角,可以包括:若该多个入射角度中存在与方位角之间差值在S rr)±θ thd0的范围内,则可以选择与方位角最接近的入射角度作为声源的入射角。
2、筛选出新声源的入射角的方式可以包括:基于对象的运动速度对该多个入射角度进行筛选,选择出新声源的入射角。例如,因对象在运动的过程中发声,通过麦克风阵列可能得到多个候选位置,如表示为:(CS m1m1),CS m2m2),…,CS mkmk)),且都不在S rr)±θ thd0的范围内,则根据雷达源S rr)的方位角,对候选角进行筛选,筛选出新的入射角。
筛选候选角的方式可以包括:
当对象的速度小于预设速度时,筛选出在雷达源S rr)的±θ thd1范围外(即第二预设范围)的候选角作为新的入射角。
例如,如图10A所示,t1时刻至tn时刻对象处于运动中,速度为v1,在这个过程中,可以选择与雷达源S r(θ)的±θ thd1范围外的候选角作为新的入射角,丢弃雷达源S rr)的±θ thd2范围内的候选角。
当对象的移动速度不小于预设速度时,筛选出在雷达源S rr)±θ thd2范围外(即第三预设范围)的候选角作为新的入射角。
例如,如图10B所示,t 1时刻至t n时刻对象处于运动中,速度为v 2,v 2>v 1,在这个过程中,可以选择与雷达源S rr)的±θ thd2范围外的候选角作为新的入射角,θ thd2thd1,丢弃雷达源S rr)±θ thd2范围内的候选角。
为便于理解,下面结合图11对具体的应用场景进行示例性介绍。
声源定位装置包括雷达1101和麦阵1102(即麦克风阵列)。
通过雷达1101接收到的回波数据定位到对象的位置S rr)1103,或称为雷达源,θ r即对象相对于雷达的方位角或者俯仰角等角度。
通过麦阵1102定位到候选声源CS mm)1104,或者称为发声源,θ m即语音信号相对于麦克风阵列的角度,具体也可以包括方位角或者俯仰角。
然后执行步骤1105,判断θ r和θ m之间的差值是否小于θ thd0。即判断θ r和θ m是否接近。
若θ r和θ m之间的差值小于θ thd0,则表示存在与θ r接近的入射角,则执行步骤1106,融合对象θ r和θ m,即得到融合角度θ fusion=c 1θ r+c 2θ m,c 1、c 2为权重值。
若θ r和θ m之间的差值不小于θ thd0,则表示不存在与θ r接近的入射角,然后可以执行步骤1107,判断对象的运动速度是否大于预设速度。具体地,根据雷达回波数据可以得到对象的位置随时间变化的趋势,即可估计出对象的运动速度,例如,对象在T时间段内的轨迹位置为([x 1,y 1],[x 2,y 2],…,[x t,y t]),则
Figure PCTCN2021132081-appb-000005
然后判断v是否大于v thd
若对象的运动速度大于预设速度,即v>v thd,则执行步骤1108,即判断θ r和θ m之间 的差值是否小于θ thd1,θ thd1thd0。若θ r和θ m的差值小于θ thd1,则屏蔽CS mm)(即步骤1110),若θ r和θ m的差值不小于θ thd1,则将结合CS mm)和S rr)得到新声源(即步骤1111)。
若对象的运动速度不大于预设速度v≤v thd,则执行步骤1108,即判断θ r和θ m之间的差值是否小于θ thd1,θ thd2thd1。若θ r和θ m的差值小于θ thd2,则屏蔽CS mm)(即步骤1110),若θ r和θ m的差值不小于θ thd2,则将结合CS mm)和S rr)得到新声源(即步骤1111)。
因此,在本申请实施方式中,当声源在移动时,可以根据声源的移动速度,确定与声源匹配的入射角或者新的入射角,从而可以适应声源不同移动状态,泛化能力强。
在对声源进行定位之后,即可更新波束分离网络,从而可以将麦克风阵列采集到数据中作为更新后的波束分离网络的输入,分离出声源的语音数据。
具体地,波束分离网络可以包括语音分离模型和解混响模型,语音分离模型用于提取声源的语音数据,解混响模型用于对输入的数据进行解混响,从而对部分背景数据进行过滤。在使用波束分离网络输出声源的语音数据之前,还可以对波束分离网络进行更新,从而使波束分离模型可以适应不同的场景,分离出与声源匹配的语音数据,下面对更新波束分离网络的具体步骤进行示例性说明。
806、根据移动速度更新语音分离模型,得到更新后的语音分离模型。
其中,该语音分离模型通常用于分离声源的语音数据和环境噪声。
该移动速度可以是声源相对于雷达或者麦克风阵列的移动速度,具体可以是通过雷达回波数据得到,当雷达未检测到运动的对象时,该移动速度可以默认设置为0。
通常,语音和环境噪声的分离依赖于语音分离模型,其分离语音的方式依赖于语音的入射方向或声源的位置等,尤其在声源运动的情况下,模型中对方向一来的参数需要适应性地不断变化的位置,从而输出与声源的位置匹配的语音数据。
具体地,可以根据声源的移动速度来更新语音分离模型的参数集,该移动速度和语音分离模型的参数变化速率呈正相关关系,该语音分离模型的参数变化速率与参数集相关,从而得到更新后的语音分离模型。
通常,参数慢变可以提高模型的稳定性,减少模型的抖动;快变则有利于快速适应环境的变化,因此可以根据目标运动速度来选择模型参数变化的速率,从而影响语音分离模型的参数集,得到更新后的语音分离模型。
例如,假设x t为声源在t时刻的位置,F为根据当前位置和局部观察值生成的模型特征参数。局部参数数量过少,生成的模型不够稳定,同时上下时刻位置差异较小,因而参数集在时间上存在相关性。此处可以采用一阶回归的形式描述参数在时间上的相关性,回归平滑后的参数集具体表述为:π t=K t×π t-1+(1-K t)×F(x t),K t为忘记因子,K t影响模型更新速度,且接近1但小于1),K t通常由声源的运动速度决定,即
Figure PCTCN2021132081-appb-000006
当前速度较大时,忘记因子较小,模型更新加快,反之当前速度较小时,忘记因子较大,模型更新变慢。
具体地,可以预先将忘记因子和速度划分为多个对应的档位,在确定了速度所在的档位的范围之后,即可确定忘记因子的值,从而从速度的维度更新语音分离模型。通常,速 度越低,忘记因子越接近1,模型更新缓慢,增加了模型的稳定性。速度越快,则忘记因子越小,模型的更新速度也就越快,能够适应声源快速移动的场景,以便从麦克风阵列采集到的数据中分离出声源的语音数据。
本申请实施例中的语音分离模型可以用于分离麦克风阵列采集到的语音数据中声源的语音和环境噪声,该语音分离模型可以包括通过广义旁瓣抵消波束分离方式或者多通道维纳滤波方式进行语音分离的模型。具体例如,语音分离模型导出目标源的权重系数w f,t,使得分离后的目标在t时刻和第f个频率上的复信号表示为:
Figure PCTCN2021132081-appb-000007
其中,y f,t=[y f,t,1,y f,t,2,……,y f,t,M],y f,t,m为第m个麦克接收信号的频域复信号,
Figure PCTCN2021132081-appb-000008
表示复矩阵的共轭转置。以最小方差无畸变响应(MVDR)分离算法为例,权重系数向量可以表达为:
Figure PCTCN2021132081-appb-000009
权重系数向量即可以理解为语音分离模型。
其中,麦克接收信号的协方差矩阵,可采用如下连续递归的方式求取:
Figure PCTCN2021132081-appb-000010
其中,K t为忘记因子,决定参数随着时间更新快慢;r f,t为声源的导向矢量,
Figure PCTCN2021132081-appb-000011
这里随着声源位置变化的参数集π t={R f,t|f=1,2…,F,F为最大频率分量的索引。
在声源运动的情况下,速度越低,忘记因子K t越接近1,模型更新缓慢,增加了模型的稳定性。速度越快,则忘记因子K t越小,模型的更新速度也就越快,能够适应声源快速移动的场景,以便从麦克风阵列采集到的数据中分离出声源的语音数据。
因此,本申请实施方式中,可以结合声源的运动速度,适应性地更新语音分离模型,使语音分离模型与声源的运动情况匹配,提高语音分离模型的输出准确性。
807、根据对象和雷达之间的距离,更新解混响模型,得到更新后的解混响模型。
其中,解混响模型可以用于接触语音信号中的混响,结合语音分离模型从麦克风阵列采集到的数据中准确地输出声源的语音数据。
通常,声源和麦克风阵列的距离显著影响麦克风接收到的信号的混响。当距离较大时,声源发出的语音信号传播距离较远,衰减较大,而室内混响保持不变,混响对于语音信号的干扰较大,混响持续时间较长;而距离越近时,声源发出的语音信号传播距离较近,衰减较小,混响的影响减弱。因此,解混响模型的参数可以基于声源和麦克风阵列的距离来进行调整。当距离较远时,加大解混响的程度;当距离较近时,减少解混响的程度,防止过度解混响而干扰语音信号。甚至在距离非常小的情况下,如小于预设最小值,则可以停止解混响,以提高得到的语音数据的质量。
具体地,可以根据声源和麦克风阵列或者雷达之间的距离,更新解混响模型的延迟参数和预测阶数,从而得到更新后的解混响模型。其中,延迟参数表示混响信号滞后于声源的语音数据的时长,预测阶数表示混响的持续时长,延迟参数和预测阶数都与距离呈正相关关系,因此,在确定了距离之后,即可基于该距离确定延迟参数和预测阶数的值,得到新的解混响模型。
解混响模型具体可以包括基于盲系统辨识和均衡的语音去混响算法的模型,基于源模型的语音去混响算法的模型或者基于房间混响模型和谱增强的语音去混响算法的模型等。 例如,本实施例中的解混响模型可以采用多通道线性预测模型,如表示为:
Figure PCTCN2021132081-appb-000012
其中,y t,f,m为第m个麦克在t时刻的f个频率分量上的可观察信号,
Figure PCTCN2021132081-appb-000013
为跨越多个通道且针对第m个通道的线性预测系数;这里Δ表示晚期混响迟滞于直达信号的时间;K表示线性预测模型的阶数,也表示晚期混响持续时长,线性预测系数g可通过自回归建模得到。但模型的阶数K选择至关重要,K值过大导致过度解混响,K值过小导致解混响不足。预测阶数K为根据声源的位置确定,延迟参数和预测阶数与距离为正相关关系,从而在得到距离之后,即可确定延迟参数和预测阶数,从而得到与声源匹配的解混响模型。
在本申请实施方式中,通过对象到麦克的距离来决定K的取值。距离较大时,混响相对直达信号较强,因此需要选择较大的K值进行足够的解混响;距离较近时,较小的K值进行轻度解混响即可。如
Figure PCTCN2021132081-appb-000014
d表示声源和麦克风阵列之间的距离,δ 0、δ 1、δ 2的值可以根据实际应用场景调整,此处不作限定。
因此,在本申请实施方式中,可以基于声源和雷达或者麦克风阵列的距离,来更新解混响模型,使解混响模型和声源当前所处的环境适配,从而结合语音分离模型,更准确地输出声源的语音信号。
808、将麦克风阵列采集到的数据作为波束分离网络的输入,输出声源的语音数据和背景数据。
因波束分离网络包括了语音分离模型和解混响模型,在更新了语音分离模型和解混响模型之后,即可将麦克风阵列采集到的数据作为波束分离网络的输入,输出声源的语音数据和背景数据。
其中,背景数据即麦克风阵列采集到的数据中,除声源的数据之外的数据。例如,在用户讲话的场景中,可以通过麦克风阵列采集数据,并通过波束分离网络从该数据中分离出用户的语音数据和用户所处环境中产生的背景数据。
因此,在本申请实施方式中,从速度的维度更新了语音分离模型,从距离的维度更新了解混响模型,无论声源运动或者静止,都可以通过调整波束分离网络的参数来适应声源的状态,从而分离出与声源更适配的语音数据。
809、判断语音数据是否符合预设条件,若是,则继续执行步骤801,若否,则执行步骤810。
在基于波束分离网络从麦克风阵列采集到的数据中分离出声源的语音数据之后,还可以判断该语音数据是否符合预设条件。若该语音数据不符合预设条件,则可以关闭针对该声源进行处理的波束,即执行步骤810,若该语音数据符合预设条件,则可以持续对该声源的语音数据进行跟踪,即继续执行步骤801-809。
该预设条件可以根据实际场景调整,如该预设条件可以包括通过波束拾取的语音数据小于预设值、或者拾取到的语音数据为非语音类别的信号、或者拾取到的语音数据为设备产生的语音、或者由用户指定屏蔽特定方向或者特定类型的声源等。例如,该预设条件可 以包括声压小于43dB,或者拾取到的语音数据为环境声或噪声等,或者拾取到的语音数据为电视机、音响、PC等的扬声器产生的语音等,或者由用户指定屏蔽某个方向或者某种类型的声源等,如屏蔽狗的声音、屏蔽儿童的声音或者拼比用户对面的声音等。
通常,一个声源对应一个波束分离模型,若存在多个声源,则可以基于每个声源的信息更新得到多个波束分离模型,用户对每个声源的语音数据进行提取。该波束分离模型可以理解为,使用波束对麦克风阵列采集到的数据中的某个方向的数据进行提取,从而有指向性地从麦克风阵列中采集到某个方向上的声源发出的语音。
此外,在一种可能的场景中,还可以通过声源的语音数据检测声源的类型,并在显示界面中展示声源的类型。具体地,可以通过特征提取网络从语音数据中提取特征,得到声源的声学特征,然后根据该声学特征识别声源为活体的第一概率;还根据雷达回波数据确定声源为活体的第二概率,然后对该第一概率和第二概率进行融合,得到该声源是否为活体的融合结果。具体的融合方式可以包括加权求和、乘积、或者取对数求和的方式等进行融合,当融合后的概率值大于预设概率值,即可确定声源为活体。例如,若融合后的概率值大于80%,即可确定声源为活体。例如,如图12所示,在进行多人会议时,识别出当前发出语音的对象为扬声器之后,若不屏蔽该对象的语音,则可以在显示界面中显示当前发声对象的类型为扬声器,从而提高用户体验。
具体例如,麦阵通过声源定位得到多个入射方向,采用波束分离模型增强每一路声信号,采用语音活动性检测器排除非语音源,保留语音源信号。设定上述语音源方向为(α 12,…,α n),对于每一路增强语音信号,提取声学特征,送入活体语音检测器(如训练好的神经网络),输出每一路声信号为活体语音的后验概率(p a1),p a2),…,p an))。通过雷达跟踪上述多个方向上的活体运动轨迹信息,假设α方向上存在运动信息(轨迹),则倾向于认定α方向上的语音为活体语音,设定该方向上活体语音先验概率p r(α)>0.5;反之,设定先验概率为小于0.5的值。非活体语音的先验概率为1-p r(α);采用乘积方式,计算α方向为活体语音的概率,p true(α)=p a(α)×p r(α),为非活体语音的概率p false(α)=(1-p a(α))×(1-p r(α))。如果p true(α)>p false(α),则认为α方向上的发声源为活体语音。
例如,特征提取网络和特征识别网络可以选取深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等。本申请提及的神经网络可以包括多种类型,如深度神经网络(deep neural network,DNN)、卷积神经网络(convolutional neural network,CNN)、循环神经网络(recurrent neural networks,RNN)或残差网络其他神经网络等。通过特征提取网络从语音数据中提取出声源的声学特征,然后通过识别网络输出该声学特征对应的声源是活体的概率,得到第一概率值,如声源是活体的概率为80%。其次,还通过雷达回波数据确定声源的运动幅度、运动速度、运动周期等信息,判断声源为活体的概率,即判断声源的是否运动,得到第二概率,如声源是否为活体的概率是75%,则可以对80%和75%进行加权融合,如分别确定权重为0.6和0.4,则融合概率为85%*0.6+75%*0.4=81%,即声源为活体的概率为81%。
通常,在一些场景中,当通过雷达回波数据确定声源为活体的概率为0时,即使通过识别网络识别出声源为活体的概率很高,如高于95%,此时也判定声源为非活体。
在另一些场景中,若通过雷达回波数据确定当前场景中存在活体,但通过语音数据未识别出声源具有活体的特征,则此时可能是活体未发声的情况,可以判定声源为非活体。可以理解为,当第一概率的值低于第三阈值时,在进行加权融合时,为该第一概率设置的权重值远高于为第二概率设置的权重值,从而使融合结果更倾向于第一概率值所表示的结果。相应地,当第二概率的值低于第四阈值时,在进行加权融合时,为该第二概率设置的权重值远高于为第一概率设置的权重值,从而使融合结果更倾向于第二概率值所表示的结果。因此,本申请实施方式中,可以有效结果声学特征和雷达检测到的运动情况,来判断声源是否为活体,得到更准确的结果。
进一步地,前述的活体可以替换为人体,可以通过特征提取网络从语音数据中提取特征,然后识别该语音特征识别发出该语音的声源为人体的第一概率,并根据雷达检测声源是否运动,得到声源为人体的第二概率,然后对第一概率和第二概率进行加权融合,得到声源是否为人体的概率值,从而根据该概率值判断声源确定声源是否为人体。因此,可以结合雷达以及声源特征识别出发生对象是否为人体,得到非常准确的识别结果。
因此,对于现有方案难以识别的扬声器发声与活体发声的问题,本申请通过结合雷达和声学特征准确识别出当前发声的声源是否为活体。其次,当无声的人与发声的扬声器同时存在时,传统的雷达的运动检测模式容易产生误判;而声学特征可以将二者鉴别,且扬声器处于长时静止状态,长时运动特征也可将其排除为活体语音。
810、关闭波束。
其中,为便于理解,波束可以理解为从麦克风阵列中提取某一个方向的语音数据的算法或者向量,关闭波束即不通过该波束提取该方向上的语音数据,如关闭前述的波束分离网络。
例如,当通过波束拾取的语音信号逐渐消失,如声压低于43dB,即关闭针对声源的波束。又例如,当确定语音数据为扬声器产生的数据,则关闭针对该扬声器的波束。还例如,可以由用户指定关闭某个方向的波束。
因此,在本申请实施方式中,可以结合麦克风阵列和雷达,准确地确定出声源所在的位置。无论发声对象静止或者运动,都可以检测出声源的具体位置,实现对声源的跟踪,可以适应更多的场景,泛化能力强。并且,还可以通过识别声源的类型来进行波束管理,从而避免拾取无效的语音,提高工作效率,减少负载。
为进一步便于理解,参阅图13,下面对本申请提供的声源定位方法的应用场景进行示例性介绍。
首先,通过雷达1301得到雷达定位信息,通过麦阵1302,即麦克风阵列得到语音信号入射至麦阵的入射角。
其中,雷达定位信息可以包括对象在一段时间内在雷达的辐射范围内的运动情况,如对象在辐射范围内的运动轨迹、加速度、与雷达的相对速度或者与雷达的相对距离等信息。例如,雷达可以在辐射范围内发射调制波,该调制波经对象反射后被雷达接收,形成回波 信号。该回波数据包括了检测到的一个或者多个对象在雷达的检测范围内进行运动时产生的信息,如用户的手部在辐射范围内进行移动时产生的变化轨迹的信息。雷达的具体结构可以参阅前述图1D,此处不再赘述。
例如,雷达可以采用毫米波雷达,如工作频率在60GHz、77GHz频段,可用带宽大于4GHz,距离分辨率达厘米级的雷达。该毫米波雷达可以具有多收多发的天线阵列,可以实现移动对象的水平方位角和垂直方位角的估计。雷达定位信息中可以包括对象相对于雷达的距离或角度,距离信息蕴含于各回波脉冲的频率中,可通过在快时间对单个脉冲进行快速傅立叶变换,获得对象于当前脉冲时间内的距离信息,对各脉冲距离信息进行整合,即可得到对象的整体距离变化信息。该角度可以包括方位角和俯仰角,角度的获取基于雷达的多接收天线,通过测量各接收回波的相位差实现。回波信号与接收天线之间可能因反射对象的位置而存在一定角度,可以通过计算的计算出该角度,从而可以获知到反射对象的具体位置,进而获知对象的位置变化情况。计算角度的方式可以包括多种,如以雷达为中心建立坐标系,基于回波数据计算对象在该坐标系内的位置,从而得到俯仰角或方位角。具体例如,可以采用多信号分类算法(Multiple Signal classification,MUSIC)算法来计算角度,包括俯仰角或者方位角等,利用雷达的四接收天线阵列,对对象的角度变化进行测量。
然后基于雷达定位信息和入射角进行声源定位1303,定位出声源相对于雷达或者麦阵的实际位置。
具体地,可以对雷达定位信息中包括的角度和入射角进行加权融合,得到声源相对于麦阵或者雷达的融合角度,从而确定出声源相对于雷达或者麦阵的实际位置。
在确定融合角度的过程中,存在多种选择,若麦阵定位出多个候选角度,则可以选择与雷达检测到的角度最接近的角度作为入射角。或者,当对象处于较快速度运动时,麦阵在一段时间内检测到多个候选角,可以选择与雷达检测到的角度较远的角度作为新的入射角。具体可以参阅前述步骤805中的相关介绍。
随后,基于声源的运动速度更新语音分离模型1304,以及基于声源和雷达之间的相对距离更新解混响模型1305。
更新后的语音分离模型和更新后的解混响模型组成波束分离网络,对麦阵1302采集到的数据进行信号分离1306,分离出声源的语音数据。
其中,语音分离模型和解混响模型包括于波束分离网络,可以理解为通过波束分离网络,形成针对声源的波束,从而实现对麦阵采集到的数据的分离,提取出声源的语音数据和背景对象产生的语音数据。
然后对声源的语音数据进行语音检测1307,识别出声源是否为活体。
可以理解为,通过对语音数据的声学特征进行识别,确定语音数据是否是活体发出的声音。此外,除了可以通过声源的声学特征来识别声源是否为活体,还可以结合雷达检测到的运动特征(如对象讲话时走动产生的运动或者其他周期运动产生的特征等),来进一步判定声源是否为活体,从而可以准确地检测出声源是否为活体。
例如,提取声源A的声学特征进行检测,识别出A为活体语音的概率。根据雷达检测 对象是否运动,得到场景中存在活体的概率。然后可以采用乘积的形式对两个模态检测结果进行融合,根据融合后的概率判断活体的存在性。通常,当雷达判断存在活体的概率为零时,即便声学模态给出的存在概率很高,但融合概率接近于零,判断场景中不存在活体语音。当场景中的目标活体没有发声,即便雷达模态判断活体存在概率很高,但声学模态会给出较低的语音存在概率,仍然判断不存在活体。双模态活体语音检测有效克服传统方法难以克服的两个难题。首先,高保真扬声器发声与活体语音的鉴别困难,两者之间的频谱特性几乎完全相同,但雷达的运动检测很容易将二者鉴别。其次,无声的人与发声的扬声器同时存在,传统的雷达运动检测模式容易产生误判;而声学特征可以将二者鉴别,且扬声器处于长时静止状态,通过雷达回波检测到的长时的运动特征也可将其排除为活体语音。
然后基于检测结果进行波束管理1308,确定是否保留针对声源的波束。
其中,在进行声源检测之后,可以根据声源检测的结果确定是否关闭针对声源的波束。通常,在家居场景下存在一些基本规则:如(1)通常只有人体运动,雷达检测到的动体很可能是人体,即便人体当前时刻没有发声,但在未来有很高的发声概率;(2)扬声器发声装置,如电视、音响等,通常处于静止状态,当然,在一些场景下也可能运动,但其具有一定的运动规律;(3)人有时在静止状态下说话,有时边走动边说话;(4)活体通常是会运动的;(5)语音信号时强时弱,声源定位装置即便漏掉个别弱音节,也可能不会产生语义误解。因此,结合这些规则,即可准确识别出声源是否活体,并基于识别结果确定是否关闭针对声源进行语音提取的波束。因此,在本申请实施方式中,结合了雷达和麦阵对声源进行定位,从而基于定位确定针对声源的语音进行提取的波束,准确提取到声源的语音数据。
前述对本申请提供的方法的流程进行了详细介绍,下面结合前述的方法流程,对本申请提供的装置的结构进行详细介绍。
首先,本申请提供一种声源定位装置,用于执行前述图2-13的方法的步骤,该声源定位装置可以包括:
雷达定位模块,用于通过雷达回波数据获取第一位置信息,所述第一位置信息中包括对象相对于所述雷达的位置信息;
麦阵定位模块,用于通过麦克风阵列采集到的语音信号获取入射角,所述入射角为语音信号入射至所述麦克风阵列的角度;
声源定位模块,用于若所述第一位置信息中包括所述对象相对于所述雷达的第一角度,则基于所述第一位置信息和所述入射角进行融合,以得到第二位置信息,所述第二位置信息包括产生所述语音信号的声源的位置信息。
在一种可能的实施方式中,所述装置还包括:
语音分离模块,用于基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据。
在一种可能的实施方式中,语音分离模块,具体用于将所述麦克风阵列采集到的数据作为预设的波束分离网络的输入,输出所述声源的所述语音数据。
在一种可能的实施方式中,所述波束分离网络包括语音分离模型,所述语音分离模型用于分离输入数据中的声源的语音数据和背景数据,所述装置还包括:
更新模块,用于在所述将所述麦克风阵列采集到的语音信号作为预设的波束分离网络的输入之前,根据所述回波数据确定所述声源的移动速度;根据所述移动速度更新所述语音分离模型,得到更新后的所述语音分离模型。
在一种可能的实施方式中,所述更新模块,具体用于根据所述移动速度确定所述语音分离模型的参数集,得到更新后的所述语音分离模型,其中,所述参数集和所述语音分离模型的参数的变化速率相关,所述移动速度和所述变化速率呈正相关关系。
在一种可能的实施方式中,所述波束分离网络还包括解混响模型,所述解混响模型用于滤除输入的数据中的混响信号;
所述更新模块,还用于在所述将所述麦克风阵列采集到的语音信号作为预设的波束分离网络的输入之前,根据所述对象和所述雷达之间的距离,更新所述解混响模型,得到更新后的所述解混响模型。
在一种可能的实施方式中,所述更新模块,具体用于根据所述对象和所述雷达之间的距离,更新所述解混响模型中的延迟参数和预测阶数,得到更新后的所述解混响模型,所述延迟参数表示所述混响信号滞后于所述声源的语音数据的时长,所述预测阶数表示混响的持续时长,所述延迟参数和所述预测阶数都与所述距离呈正相关关系。
在一种可能的实施方式中,所述语音分离模块,还用于若所述声源的语音数据不符合预设条件,则去除针对所述麦克风阵列采集到的数据中所述声源对应的数据进行处理所使用的波束。
在一种可能的实施方式中,所述装置还包括活体检测单元,用于:从所述语音数据中提取特征,得到所述声源的声学特征;根据所述声学特征识别所述声源为活体的第一概率;根据所述雷达的回波数据,确定所述声源为活体的第二概率;对所述第一概率和所述第二概率进行融合,得到融合结果,所述融合结果用于表示所述声源是否为活体。
在一种可能的实施方式中,所述第一角度和所述入射角处于同一坐标系中,所述声源定位模块,具体用于分别确定所述第一角度对应的第一权重和所述入射角对应第二权重,其中,所述第一权重和所述对象相对于所述雷达的移动速度呈正相关关系,所述第二权重和所述对象相对于所述雷达的移动速度呈负相关关系;根据所述第一权重和所述第二权重对所述角度和所述入射角进行加权融合,得到融合角度,所述第二位置信息中包括所述融合角度。
在一种可能的实施方式中,所述麦阵定位模块,具体用于若通过麦克风阵列采集到的语音信号得到多个第二角度,所述第一角度和所述多个第二角度处于同一坐标系中,则从所述多个第二角度中选取与所述第一角度之间的差值最小或者所述差值在第一预设范围内的角度作为所述入射角。
在一种可能的实施方式中,所述麦阵定位模块,具体用于在所述通过麦克风阵列采集到的语音信号获取入射角之后,若基于所述麦克风阵列再次采集到的数据得到多个第三角度,则基于所述对象的移动速度,从所述多个第三角度中选取角度作为新的所述入射角。
在一种可能的实施方式中,所述麦阵定位模块,具体用于:若所述对象的移动速度大于预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第二预设范围内的角度作为新的所述入射角;若所述对象的移动速度不大于所述预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第三预设范围内的角度作为新的所述入射角,所述第三预设范围覆盖且大于所述第二预设范围。
在一种可能的实施方式中,所述声源定位模块,还用于若所述第一位置信息中不包括所述第一角度,则将所述入射角作为所述声源相对于所述麦克风阵列的角度。
在一种可能的实施方式中,所述声源定位模块,还用于在所述通过麦克风阵列采集到的语音信号获取入射角之前,若通过所述回波数据确定所述雷达的检测范围内运动的对象的位置信息,且所述对象未发声,则调整所述麦克风阵列针对所述对象的声源检测阈值,所述麦克风阵列用于采集声压高于所述声源检测阈值的信号。
在一种可能的实施方式中,所述第一位置信息中还包括对象和所述雷达的第一相对距离,所述声源定位模块,还用于若还通过麦克风阵列采集到的语音信号,获取到对象和所述麦克风阵列的第二相对距离,对所述第一相对距离和所述第二相对距离进行融合,得到融合距离,所述融合距离表示所述声源相对于所述麦克风阵列的距离,所述第二位置信息中还包括所述融合距离。
请参阅图15,本申请提供的另一种声源定位装置的结构示意图,如下所述。
该声源定位装置可以包括处理器1501和存储器1502。该处理器1501和存储器1502通过线路互联。其中,存储器1502中存储有程序指令和数据。
存储器1502中存储了前述图2-图13中的步骤对应的程序指令以及数据。
处理器1501用于执行前述图2-图13中任一实施例所示的声源定位装置执行的方法步骤。
可选地,该声源定位装置还可以包括收发器1503,用于接收或者发送数据。
可选地,该声源定位装置还可以包括雷达和/或麦克风阵列(图15中未示出),或者与雷达和/或麦克风阵列建立了连接(图15中未示出),该雷达和/或麦克风阵列可以参阅前述图2-图13中所提及的雷达和/或麦克风阵列,此处不再赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于生成车辆行驶速度的程序,当其在计算机上行驶时,使得计算机执行如前述图2-图13所示实施例描述的方法中的步骤。
可选地,前述的图15中所示的声源定位装置为芯片。
本申请实施例还提供了一种声源定位装置,该声源定位装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图2-图13中任一实施例所示的声源定位装置执行的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器1501,或者处理器1501的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。 当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中声源定位装置执行的动作。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述图2-图13所示实施例描述的方法中声源定位装置所执行的步骤。
本申请实施例提供的声源定位装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述图2-图13所示实施例描述的行驶决策选择方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
示例性地,请参阅图16,图16为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 160,NPU 160作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1603,通过控制器1604控制运算电路1603提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1603内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1603是二维脉动阵列。运算电路1603还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1603是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1602中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1601中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1608中。
统一存储器1606用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)1605,DMAC被搬运到权重存储器1602中。输入数据也通过DMAC被搬运到统一存储器1606中。
总线接口单元(bus interface unit,BIU)1610,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)1609的交互。
总线接口单元1610(bus interface unit,BIU),用于取指存储器1609从外部存储器获取指令,还用于存储单元访问控制器1605从外部存储器获取输入矩阵A或者权重矩阵B 的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1606或将权重数据搬运到权重存储器1602中或将输入数据数据搬运到输入存储器1601中。
向量计算单元1607包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1607能将经处理的输出的向量存储到统一存储器1606。例如,向量计算单元1607可以将线性函数和/或非线性函数应用到运算电路1603的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1607生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1603的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1604连接的取指存储器(instruction fetch buffer)1609,用于存储控制器1604使用的指令;
统一存储器1606,输入存储器1601,权重存储器1602以及取指存储器1609均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路1603或向量计算单元1607执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图2-图13的方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。

Claims (34)

  1. 一种声源定位方法,其特征在于,包括:
    通过雷达回波数据获取第一位置信息,所述第一位置信息中包括对象相对于所述雷达的第一角度;
    通过麦克风阵列采集到的语音信号获取入射角,所述入射角为所述语音信号入射至所述麦克风阵列的角度;
    融合所述第一角度和所述入射角,以得到第二位置信息,所述第二位置信息用于表示产生所述语音信号的声源的位置。
  2. 根据权利要求1所述的方法,其特征在于,所述融合所述第一角度和所述入射角,包括:
    分别确定所述第一角度对应的第一权重和所述入射角对应第二权重,其中,所述第一权重和所述对象相对于所述雷达的移动速度呈正相关关系,所述第二权重和所述对象相对于所述雷达的移动速度呈负相关关系;
    根据所述第一权重和所述第二权重对所述第一角度和所述入射角进行加权融合,得到融合角度,所述第二位置信息中包括所述融合角度。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据,包括:
    将所述麦克风阵列采集到的数据作为预设的波束分离网络的输入,输出所述声源的所述语音数据。
  5. 根据权利要求4所述的方法,其特征在于,所述波束分离网络包括语音分离模型,所述语音分离模型用于分离输入数据中的声源的语音数据和背景数据,所述方法还包括:
    根据所述回波数据确定所述声源的移动速度;
    根据所述移动速度更新所述语音分离模型,得到更新后的所述语音分离模型。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述移动速度更新所述语音分离模型,包括:
    根据所述移动速度确定所述语音分离模型的参数集,得到更新后的所述语音分离模型,其中,所述参数集和所述语音分离模型的参数的变化速率相关,所述移动速度和所述变化速率呈正相关关系。
  7. 根据权利要求5或6所述的方法,其特征在于,所述波束分离网络还包括解混响模型,所述解混响模型用于滤除输入的数据中的混响信号;
    所述方法还包括:
    根据所述对象和所述雷达之间的距离,更新所述解混响模型,得到更新后的所述解混响模型。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述对象和所述雷达之间的距 离,更新所述解混响模型,包括:
    根据所述对象和所述雷达之间的距离,更新所述解混响模型中的延迟参数和预测阶数,得到更新后的所述解混响模型,所述延迟参数表示所述混响信号滞后于所述声源的语音数据的时长,所述预测阶数表示混响的持续时长,所述延迟参数和所述预测阶数都与所述距离呈正相关关系。
  9. 根据权利要求3-8中任一项所述的方法,其特征在于,所述方法还包括:
    若所述声源的语音数据不符合预设条件,则去除针对所述麦克风阵列采集到的数据中所述声源对应的数据进行处理所使用的波束。
  10. 根据权利要求3-9中任一项所述的方法,其特征在于,所述方法还包括:
    从所述语音数据中提取特征,得到所述声源的声学特征;
    根据所述声学特征识别所述声源为活体的第一概率;
    根据所述雷达的回波数据,确定所述声源为活体的第二概率;
    对所述第一概率和所述第二概率进行融合,得到融合结果,所述融合结果用于表示所述声源是否为活体。
  11. 根据权利要求1-10中任一项所述的方法,其特征在于,所述通过麦克风阵列采集到的语音信号获取入射角,包括:
    若通过麦克风阵列采集到的语音信号得到多个第二角度,所述第一角度和所述多个第二角度处于同一坐标系中,则从所述多个第二角度中选取与所述第一角度之间的差值最小或者所述差值在第一预设范围内的角度作为所述入射角。
  12. 根据权利要求1-11中任一项所述的方法,其特征在于,在所述通过麦克风阵列采集到的语音信号获取入射角之后,所述方法还包括:
    若基于所述麦克风阵列再次采集到的数据得到多个第三角度,则基于所述对象的移动速度,从所述多个第三角度中选取角度作为新的所述入射角。
  13. 根据权利要求12所述的方法,其特征在于,所述基于所述对象的移动速度,从所述多个角度中选取第三角度作为新的所述入射角,包括:
    若所述对象的移动速度大于预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第二预设范围内的角度作为新的所述入射角;
    若所述对象的移动速度不大于所述预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第三预设范围内的角度作为新的所述入射角,所述第三预设范围覆盖且大于所述第二预设范围。
  14. 根据权利要求1-13中任一项所述的方法,其特征在于,在所述通过麦克风阵列采集到的语音信号获取入射角之前,所述方法还包括:
    若通过所述回波数据确定所述对象处于运动状态,且所述对象未发声,则调整所述麦克风阵列针对所述对象的声源检测阈值,所述麦克风阵列用于采集声压高于所述声源检测阈值的语音信号。
  15. 根据权利要求1-14中任一项所述的方法,其特征在于,所述第一位置信息中还包括所述对象和所述雷达的第一相对距离,所述方法还包括:
    通过所述麦克风阵列采集到的语音信号,获取到所述对象和所述麦克风阵列的第二相对距离;
    对所述第一相对距离和所述第二相对距离进行融合,得到融合距离,所述融合距离表示所述声源相对于所述麦克风阵列的距离,所述第二位置信息中还包括所述融合距离。
  16. 一种声源定位装置,其特征在于,包括:
    雷达定位模块,用于通过雷达回波数据获取第一位置信息,所述第一位置信息中包括对象相对于所述雷达的第一角度;
    麦阵定位模块,用于通过麦克风阵列采集到的语音信号获取入射角,所述入射角为语音信号入射至所述麦克风阵列的角度;
    声源定位模块,用于融合所述第一角度和所述入射角,以得到第二位置信息,所述第二位置信息用于表示产生所述语音信号的声源的位置。
  17. 根据权利要求16所述的装置,其特征在于,
    所述声源定位模块,具体用于分别确定所述第一角度对应的第一权重和所述入射角对应第二权重,其中,所述第一权重和所述对象相对于所述雷达的移动速度呈正相关关系,所述第二权重和所述对象相对于所述雷达的移动速度呈负相关关系;
    根据所述第一权重和所述第二权重对所述角度和所述入射角进行加权融合,得到融合角度,所述第二位置信息中包括所述融合角度。
  18. 根据权利要求17所述的装置,其特征在于,所述装置还包括:
    语音分离模块,用于基于所述第二位置信息从所述麦克风阵列采集到的语音信号中提取所述声源的语音数据。
  19. 根据权利要求18所述的装置,其特征在于,
    语音分离模块,具体用于将所述麦克风阵列采集到的数据作为预设的波束分离网络的输入,输出所述声源的所述语音数据。
  20. 根据权利要求19所述的装置,其特征在于,所述波束分离网络包括语音分离模型,所述语音分离模型用于分离输入数据中的声源的语音数据和背景数据,所述装置还包括:
    更新模块,用于根据所述回波数据确定所述声源的移动速度;根据所述移动速度更新所述语音分离模型,得到更新后的所述语音分离模型。
  21. 根据权利要求20所述的装置,其特征在于,
    所述更新模块,具体用于根据所述移动速度确定所述语音分离模型的参数集,得到更新后的所述语音分离模型,其中,所述参数集和所述语音分离模型的参数的变化速率相关,所述移动速度和所述变化速率呈正相关关系。
  22. 根据权利要求20或21所述的装置,其特征在于,所述波束分离网络还包括解混响模型,所述解混响模型用于滤除输入的数据中的混响信号;
    所述更新模块,还用于根据所述对象和所述雷达之间的距离,更新所述解混响模型,得到更新后的所述解混响模型。
  23. 根据权利要求22所述的装置,其特征在于,
    所述更新模块,具体用于根据所述对象和所述雷达之间的距离,更新所述解混响模型 中的延迟参数和预测阶数,得到更新后的所述解混响模型,所述延迟参数表示所述混响信号滞后于所述声源的语音数据的时长,所述预测阶数表示混响的持续时长,所述延迟参数和所述预测阶数都与所述距离呈正相关关系。
  24. 根据权利要求18-23中任一项所述的装置,其特征在于,
    所述语音分离模块,还用于若所述声源的语音数据不符合预设条件,则去除针对所述麦克风阵列采集到的数据中所述声源对应的数据进行处理所使用的波束。
  25. 根据权利要求18-24中任一项所述的装置,其特征在于,所述装置还包括活体检测单元,用于:
    从所述语音数据中提取特征,得到所述声源的声学特征;
    根据所述声学特征识别所述声源为活体的第一概率;
    根据所述雷达的回波数据,确定所述声源为活体的第二概率;
    对所述第一概率和所述第二概率进行融合,得到融合结果,所述融合结果用于表示所述声源是否为活体。
  26. 根据权利要求16-25中任一项所述的装置,其特征在于,
    所述麦阵定位模块,具体用于若通过麦克风阵列采集到的语音信号得到多个第二角度,所述第一角度和所述多个第二角度处于同一坐标系中,则从所述多个第二角度中选取与所述第一角度之间的差值最小或者所述差值在第一预设范围内的角度作为所述入射角。
  27. 根据权利要求16-26中任一项所述的装置,其特征在于,
    所述麦阵定位模块,具体用于在所述通过麦克风阵列采集到的语音信号获取入射角之后,若基于所述麦克风阵列再次采集到的数据得到多个第三角度,则基于所述对象的移动速度,从所述多个第三角度中选取角度作为新的所述入射角。
  28. 根据权利要求27所述的装置,其特征在于,所述麦阵定位模块,具体用于:
    若所述对象的移动速度大于预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第二预设范围内的角度作为新的所述入射角;
    若所述对象的移动速度不大于所述预设速度,则从所述多个第三角度中筛选出,与所述第一角度之间的差值在第三预设范围内的角度作为新的所述入射角,所述第三预设范围覆盖且大于所述第二预设范围。
  29. 根据权利要求16-28中任一项所述的装置,其特征在于,
    所述声源定位模块,还用于在所述通过麦克风阵列采集到的语音信号获取入射角之前,若通过所述回波数据确定所述对象处于运动状态,且所述对象未发声,则调整所述麦克风阵列针对所述对象的声源检测阈值,所述麦克风阵列用于采集声压高于所述声源检测阈值的信号。
  30. 根据权利要求16-29中任一项所述的装置,其特征在于,所述第一位置信息中还包括对象和所述雷达的第一相对距离,
    所述声源定位模块,还用于通过所述麦克风阵列采集到的语音信号,获取到所述对象和所述麦克风阵列的第二相对距离,对所述第一相对距离和所述第二相对距离进行融合,得到融合距离,所述融合距离表示所述声源相对于所述麦克风阵列的距离,所述第二位置信 息中还包括所述融合距离。
  31. 一种声源定位装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至15中任一项所述的方法。
  32. 一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如权利要求1至15中任一项所述的方法。
  33. 一种声源定位装置,其特征在于,包括处理单元和通信接口,所述处理单元通过所述通信接口获取程序指令,当所述程序指令被所述处理单元执行时实现权利要求1至15中任一项所述的方法。
  34. 一种拾音装置,其特征在于,包括:雷达、麦克风阵列和处理器,所述雷达、所述麦克风阵列和所述处理器之间连接;
    所述雷达用于发射调制波并接收回波数据;
    所述麦克风阵列包括至少一个麦克风,所述麦克风阵列用于采集声源发出的语音信号;
    所述处理器,用于执行如权利要求1至15中任一项所述的方法。
PCT/CN2021/132081 2020-12-31 2021-11-22 一种声源定位方法以及装置 WO2022142853A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/215,486 US20230333205A1 (en) 2020-12-31 2023-06-28 Sound source positioning method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011637064.4A CN112859000B (zh) 2020-12-31 2020-12-31 一种声源定位方法以及装置
CN202011637064.4 2020-12-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/215,486 Continuation US20230333205A1 (en) 2020-12-31 2023-06-28 Sound source positioning method and apparatus

Publications (1)

Publication Number Publication Date
WO2022142853A1 true WO2022142853A1 (zh) 2022-07-07

Family

ID=76000448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132081 WO2022142853A1 (zh) 2020-12-31 2021-11-22 一种声源定位方法以及装置

Country Status (3)

Country Link
US (1) US20230333205A1 (zh)
CN (1) CN112859000B (zh)
WO (1) WO2022142853A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112859000B (zh) * 2020-12-31 2023-09-12 华为技术有限公司 一种声源定位方法以及装置
CN114173273B (zh) * 2021-12-27 2024-02-13 科大讯飞股份有限公司 麦克风阵列检测方法、相关设备及可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109506568A (zh) * 2018-12-29 2019-03-22 苏州思必驰信息科技有限公司 一种基于图像识别和语音识别的声源定位方法及装置
US20190208318A1 (en) * 2018-01-04 2019-07-04 Stmicroelectronics, Inc. Microphone array auto-directive adaptive wideband beamforming using orientation information from mems sensors
CN110085258A (zh) * 2019-04-02 2019-08-02 深圳Tcl新技术有限公司 一种提高远场语音识别率的方法、系统及可读存储介质
CN110223686A (zh) * 2019-05-31 2019-09-10 联想(北京)有限公司 语音识别方法、语音识别装置和电子设备
CN110716180A (zh) * 2019-10-17 2020-01-21 北京华捷艾米科技有限公司 一种基于人脸检测的音频定位方法及装置
CN110970049A (zh) * 2019-12-06 2020-04-07 广州国音智能科技有限公司 多人声识别方法、装置、设备及可读存储介质
CN111445920A (zh) * 2020-03-19 2020-07-24 西安声联科技有限公司 一种多声源的语音信号实时分离方法、装置和拾音器
CN112859000A (zh) * 2020-12-31 2021-05-28 华为技术有限公司 一种声源定位方法以及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190208318A1 (en) * 2018-01-04 2019-07-04 Stmicroelectronics, Inc. Microphone array auto-directive adaptive wideband beamforming using orientation information from mems sensors
CN109506568A (zh) * 2018-12-29 2019-03-22 苏州思必驰信息科技有限公司 一种基于图像识别和语音识别的声源定位方法及装置
CN110085258A (zh) * 2019-04-02 2019-08-02 深圳Tcl新技术有限公司 一种提高远场语音识别率的方法、系统及可读存储介质
CN110223686A (zh) * 2019-05-31 2019-09-10 联想(北京)有限公司 语音识别方法、语音识别装置和电子设备
CN110716180A (zh) * 2019-10-17 2020-01-21 北京华捷艾米科技有限公司 一种基于人脸检测的音频定位方法及装置
CN110970049A (zh) * 2019-12-06 2020-04-07 广州国音智能科技有限公司 多人声识别方法、装置、设备及可读存储介质
CN111445920A (zh) * 2020-03-19 2020-07-24 西安声联科技有限公司 一种多声源的语音信号实时分离方法、装置和拾音器
CN112859000A (zh) * 2020-12-31 2021-05-28 华为技术有限公司 一种声源定位方法以及装置

Also Published As

Publication number Publication date
CN112859000A (zh) 2021-05-28
US20230333205A1 (en) 2023-10-19
CN112859000B (zh) 2023-09-12

Similar Documents

Publication Publication Date Title
CN107577449B (zh) 唤醒语音的拾取方法、装置、设备及存储介质
JP5710792B2 (ja) 可聴音と超音波とを用いたソース特定のためのシステム、方法、装置、およびコンピュータ可読媒体
US20230333205A1 (en) Sound source positioning method and apparatus
WO2020108614A1 (zh) 音频识别方法、定位目标音频的方法、装置和设备
US11158333B2 (en) Multi-stream target-speech detection and channel fusion
US20210314701A1 (en) Multiple-source tracking and voice activity detections for planar microphone arrays
KR101520554B1 (ko) 연속파 초음파 신호들을 이용한 무접촉식 감지 및 제스쳐 인식
TWI711035B (zh) 方位角估計的方法、設備、語音交互系統及儲存介質
US11435429B2 (en) Method and system of acoustic angle of arrival detection
US10535361B2 (en) Speech enhancement using clustering of cues
JP2015516093A (ja) オーディオユーザ対話認識および文脈精製
US20190355373A1 (en) 360-degree multi-source location detection, tracking and enhancement
WO2021017950A1 (zh) 超声波处理方法、装置、电子设备及计算机可读介质
US11264017B2 (en) Robust speaker localization in presence of strong noise interference systems and methods
Sewtz et al. Robust MUSIC-based sound source localization in reverberant and echoic environments
CN113223552B (zh) 语音增强方法、装置、设备、存储介质及程序
CN114038452A (zh) 一种语音分离方法和设备
JP2023545981A (ja) 動的分類器を使用したユーザ音声アクティビティ検出
Taj et al. Audio-assisted trajectory estimation in non-overlapping multi-camera networks
Berghi et al. Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization
CN109212480B (zh) 一种基于分布式辅助粒子滤波的声源跟踪方法
US20240022869A1 (en) Automatic localization of audio devices
US20230204744A1 (en) On-device user presence detection using low power acoustics in the presence of multi-path sound propagation
Pandey et al. Sound Localisation of an Acoustic Source Using Time Delay and Distance Estimation
WO2023102089A1 (en) Multi-sensor systems and methods for providing immersive virtual environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913595

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913595

Country of ref document: EP

Kind code of ref document: A1