WO2022012206A1 - 音频信号处理方法、装置、设备及存储介质 - Google Patents

音频信号处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022012206A1
WO2022012206A1 PCT/CN2021/098085 CN2021098085W WO2022012206A1 WO 2022012206 A1 WO2022012206 A1 WO 2022012206A1 CN 2021098085 W CN2021098085 W CN 2021098085W WO 2022012206 A1 WO2022012206 A1 WO 2022012206A1
Authority
WO
WIPO (PCT)
Prior art keywords
filter
target
audio
audio signal
interference
Prior art date
Application number
PCT/CN2021/098085
Other languages
English (en)
French (fr)
Inventor
陈日林
姜开宇
黎韦伟
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2022538830A priority Critical patent/JP7326627B2/ja
Priority to EP21842054.5A priority patent/EP4092672A4/en
Publication of WO2022012206A1 publication Critical patent/WO2022012206A1/zh
Priority to US17/741,285 priority patent/US12009006B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/20Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/40Arrangements for obtaining a desired directivity characteristic
    • H04R25/407Circuits for combining signals of a plurality of transducers

Definitions

  • the present application relates to the field of speech processing, in particular to audio signal processing technology.
  • Speech enhancement technology is an important branch of speech signal processing. It is widely used in the fields of noise suppression, speech compression coding and speech recognition in noisy environments. It plays an increasingly important role in aspects such as speech recognition rate.
  • GSC Generalized Sidelobe Canceller
  • the method in the related art uses a pre-designed filter without considering the influence of the movement of the interference sound source on the processing result, resulting in a poor sound source separation effect finally obtained.
  • the present application provides an audio signal processing method, apparatus, device and storage medium, which can reduce interference leakage in the case of interference movement.
  • the technical solution is as follows:
  • an audio signal processing method is provided, the method is performed by an audio signal processing device, and the method includes:
  • a first target beam is obtained by filtering the audio signal through a first filter, where the first filter is used to suppress interfering speech in the audio signal and enhance target speech in the audio signal;
  • a first interference beam is obtained by filtering the audio signal through a second filter, where the second filter is used to suppress the target speech and enhance the interference speech;
  • At least one of the second filter and the third filter is adaptively updated, and the first filter is updated according to the second filter and the third filter after the update is completed.
  • an audio signal processing method is provided, the method is performed by an audio signal processing device, and the method includes:
  • the microphone array includes n target directions, each of the target directions corresponds to a filter bank, and the filter bank uses the above method to process the audio signal, so the Said n is a positive integer greater than 1;
  • the corresponding filter banks are used to filter the audio signals, to obtain n first audio processing outputs corresponding to the n target directions;
  • an audio signal processing apparatus is provided, the apparatus is deployed on an audio signal processing device, and the apparatus includes:
  • a first acquisition module used for acquiring audio signals collected by different microphones in the microphone array
  • a first filtering module configured to filter the audio signal through a first filter to obtain a first target beam, the first filter is used to suppress interfering speech in the audio signal and enhance the target in the audio signal voice;
  • a second filtering module configured to filter the audio signal through a second filter to obtain a first interference beam, and the second filter is used to suppress the target speech and enhance the interference speech;
  • a third filtering module configured to obtain a second interference beam of the first interference beam through a third filter, and the third filter is used to weight and adjust the first interference beam;
  • a first determining module configured to determine the difference between the first target beam and the second interference beam as a first audio processing output
  • a first update module configured to adaptively update at least one of the second filter and the third filter, and update the first filter according to the second filter and the third filter after the update is completed.
  • a filter configured to adaptively update at least one of the second filter and the third filter, and update the first filter according to the second filter and the third filter after the update is completed.
  • an audio signal processing apparatus is provided, the apparatus is deployed on an audio signal processing device, and the apparatus includes:
  • the second acquisition module is configured to acquire audio signals collected by different microphones in the microphone array, the microphone array includes n target directions, each of the target directions corresponds to a filter bank, and the filter bank adopts the above-mentioned first An audio signal processing method processes the audio signal;
  • a filter bank module configured to filter the audio signals using the corresponding filter banks for the audio signals corresponding to the n target directions to obtain n first audio signals corresponding to the n target directions process output;
  • a fourth filtering module configured to filter the i-th first audio processing output according to the n-1 first audio processing outputs other than the i-th first audio processing output to obtain the i-th first audio processing output
  • the i-th second audio processing output corresponding to each of the target directions, where i is a positive integer greater than 0 and less than the n; repeating this step to obtain n second audio processing outputs corresponding to the target directions respectively.
  • a computer device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, the at least one The instructions, the at least one piece of program, the code set or the instruction set are loaded and executed by the processor to implement the audio signal processing method described in any of the above-mentioned optional solutions.
  • a computer-readable storage medium where at least one instruction, at least one segment of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one segment of The program, the code set or the instruction set is loaded and executed by the processor to implement the audio signal processing method described in any of the above-mentioned optional solutions.
  • a computer program product or computer program where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio signal processing method provided in the foregoing optional implementation manner.
  • the first filter, the second filter and the third filter can track the change of the steering vector of the target sound source in real time, update the filter in time, and use
  • the real-time update filter is used to process the audio signal collected by the microphone next time, so that the filter can output the audio processing output according to the change of the scene, so as to ensure the tracking performance of the filter in the case of interference movement and reduce the problem of interference leakage.
  • FIG. 1 is a schematic diagram of an audio signal processing system according to an exemplary embodiment
  • FIG. 2 shows a schematic diagram of microphone distribution provided by another exemplary embodiment of the present application
  • FIG. 3 shows a schematic diagram of microphone distribution provided by another exemplary embodiment of the present application.
  • FIG. 4 shows a flowchart of an audio signal processing method provided by another exemplary embodiment of the present application.
  • FIG. 5 shows a schematic diagram of a filter composition provided by another exemplary embodiment of the present application.
  • FIG. 6 shows a schematic diagram of a filter composition provided by another exemplary embodiment of the present application.
  • FIG. 7 shows a flowchart of an audio signal processing method provided by another exemplary embodiment of the present application.
  • FIG. 8 shows a schematic diagram of a filter composition provided by another exemplary embodiment of the present application.
  • FIG. 9 shows a schematic diagram of a filter composition provided by another exemplary embodiment of the present application.
  • FIG. 10 shows a schematic diagram of filter composition provided by another exemplary embodiment of the present application.
  • FIG. 11 shows a schematic diagram of a filter composition provided by another exemplary embodiment of the present application.
  • Figure 12 shows a dual-channel spectrogram provided by another exemplary embodiment of the present application.
  • Figure 13 shows a dual-channel spectrogram provided by another exemplary embodiment of the present application.
  • FIG. 14 shows a block diagram of an audio signal processing apparatus provided by another exemplary embodiment of the present application.
  • FIG. 15 shows a block diagram of an audio signal processing apparatus provided by another exemplary embodiment of the present application.
  • Fig. 16 is a structural block diagram of a computer device according to an exemplary embodiment.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones It is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.
  • the present application relates to the field of smart home technology, and in particular, to an audio signal processing method.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.
  • Microphone commonly known as microphone and microphone, is the first link in electroacoustic equipment.
  • a microphone is a transducer that converts electrical energy into mechanical energy, and then converts mechanical energy into electrical energy.
  • people have made a variety of microphones using various energy conversion principles. Commonly used in recording are capacitors, moving coils, and aluminum ribbon microphones.
  • FIG. 1 is a schematic diagram of an audio signal processing system according to an exemplary embodiment. As shown in FIG. 1 , the audio signal processing system 100 includes a microphone array 101 and an audio signal processing device 102 .
  • the microphone array 101 includes at least two microphones arranged at at least two different positions.
  • the microphone array 101 is used to sample and process the spatial characteristics of the sound field, so as to use the audio signals received by the microphone array 101 to calculate the angle and distance of the target speaker, so as to realize the tracking of the target speaker and subsequent voice directional pickup.
  • the microphone array 101 is set in a vehicle-mounted scene. When the microphone array includes two microphones, the two microphones are respectively arranged near the main driving position and the co-pilot position.
  • the microphone array can be divided into compact type and distributed type, for example, as shown in Figure 2
  • a compact microphone array is provided, and the two microphones are respectively arranged inside the main driver's seat 201 and the passenger seat 202; for another example, as shown in (2) in Figure 2 , a distributed microphone array is given, and two microphones are arranged on the outside of the main driver's seat 201 and the passenger's seat 202 respectively.
  • the microphone array includes four microphones, the four microphones are respectively arranged near the main driver's seat, near the passenger seat, and near the two passenger seats, for example, as shown in (1) in Fig.
  • a compact A type of microphone array four microphones are respectively arranged on the inner side of the main driver's seat 201, the co-pilot seat 202 and the two passenger seats 203.
  • a distributed type is given.
  • another distribution type is provided.
  • the four microphones are respectively arranged above the main driver's seat 201 , the co-pilot seat 202 and the two passenger seats 203 .
  • the audio signal processing device 102 is connected to the microphone array 101 and is used for processing the audio signals collected by the microphone array.
  • the audio signal processing device includes a processor 103 and a memory 104, and the memory 104 stores at least one instruction, at least one program, code set or instruction set, at least one instruction, at least one program, code set or The instruction set is loaded by the processor 103 and executes the audio signal processing method.
  • the audio signal processing device may be implemented as part of an in-vehicle speech recognition system.
  • the audio signal processing device is further configured to, after performing audio signal processing on the audio signal collected by the microphone to obtain the audio processing output, perform speech recognition on the audio processing output to obtain a speech recognition result, or perform speech recognition on the speech recognition result. Make it accordingly.
  • the audio signal processing device may further include a motherboard, an external output/input device, a memory, an external interface, a touch control system, and a power supply.
  • processing elements such as a processor and a controller are integrated in the motherboard, and the processor may be an audio processing chip.
  • the external output/input device may include a display component (such as a display screen), a sound playback component (such as a speaker), a sound collection component (such as a microphone), various keys, etc.
  • the sound collection component may be a microphone array.
  • Program codes and data are stored in the memory.
  • the external interface may include a headphone interface, a charging interface, a data interface, and the like.
  • the touch control system may be integrated in the display component or the key of the external output/input device, and the touch control system is used to detect the touch operation performed by the user on the display component or the key.
  • the power supply is used to power various other components in the terminal.
  • the processor in the main board can obtain the audio processing output by executing or calling the program code and data stored in the storage, and perform speech recognition on the audio processing output to obtain the speech recognition result, and the generated speech recognition result Play through an external output/input device, or respond to user instructions in the speech recognition result according to the speech recognition result.
  • the touch system can detect the keys or other operations performed when the user interacts with the touch system.
  • the sound collection component of the voice interaction device may be A microphone array composed of a certain number of acoustic sensors (usually microphones) is used to sample and process the spatial characteristics of the sound field, so as to use the audio signals received by the microphone array to calculate the angle and distance of the target speaker, so as to realize the Target speaker tracking and subsequent directional pickup of speech.
  • a microphone array composed of a certain number of acoustic sensors (usually microphones) is used to sample and process the spatial characteristics of the sound field, so as to use the audio signals received by the microphone array to calculate the angle and distance of the target speaker, so as to realize the Target speaker tracking and subsequent directional pickup of speech.
  • This embodiment provides a method for processing the collected audio signal to suppress the interference signal in the audio signal to obtain a more accurate target signal.
  • the method is used below to process the audio signal collected by the vehicle-mounted microphone array. Be explained.
  • FIG. 3 shows a flowchart of an audio signal processing method provided by an exemplary embodiment of the present application.
  • the method can be applied to the audio signal processing system shown in FIG. 1 , and the method is processed by an audio signal. device executes.
  • the method may include the following steps:
  • Step 301 Acquire audio signals collected by different microphones in the microphone array.
  • the audio signal is a multi-channel sound source signal, wherein the number of channels may correspond to the number of microphones included in the microphone array, for example, if the number of microphones included in the microphone group array is 4 , then the four audio signals collected by the microphone array.
  • the audio signal includes the target speech issued by the object issuing the speech command and the interfering speech of the ambient noise.
  • the content of the sound source recorded by each audio signal is consistent. For example, for an audio signal at a certain sampling point, if the microphone array includes four microphones, there are four corresponding audio Signal, each audio signal records the content of the sound source signal at the sampling point, but the position and/or distance between each microphone and the sound source in the microphone array is different, resulting in the sound received by each microphone. There are differences in the frequency, strength, etc. of the source signal, which makes the audio signal different.
  • Step 302 filtering the audio signal through a first filter to obtain a first target beam, where the first filter is used to suppress interfering speech in the audio signal and enhance the target speech in the audio signal.
  • the first filter is used for filtering the audio signal, enhancing the target speech in the audio signal, and suppressing the interfering speech in the audio signal.
  • the first filter corresponds to a first weight matrix, and the initial value of the first weight matrix can be set by a technician according to experience, or set arbitrarily.
  • the first filter is a real-time update filter, the first filter will be updated with the adaptive update of the second filter and the third filter, according to the weight of the second filter and the third filter.
  • the matrix enhances the interference speech and suppresses the target speech to determine the suppression of the interference speech and the enhancement of the target speech by the first filter.
  • the target speech is an audio signal received in the target direction
  • the interfering speech is an audio signal received in directions other than the target direction
  • the target voice is a voice signal issued by the object issuing the voice command.
  • the audio signal forms an audio signal matrix X W
  • the first weight matrix corresponding to the first filter 401 is W 2
  • the first target beam obtained by filtering the audio signal by the first filter 401 is X W W 2 .
  • a pre-filter may also be set before the first filter, and step 302 further includes steps 3021 to 3022 .
  • Step 3021 first filtering the audio signal through a pre-filter to obtain a pre-target beam, the pre-filter is a filter calculated by using the training data, and the pre-filter is used to suppress interfering speech and enhance the target speech.
  • Step 3022 Perform second filtering on the pre-target beam through the first filter to obtain the first target beam.
  • the pre-filter is a filter calculated using training data. Pre-filters are also used to enhance target speech in the audio signal and suppress interfering speech.
  • the pre-filter is a filter calculated according to the Linearly Constrained Minimum-Variance (LCMV) criterion, and the pre-filter is a fixed value after calculation and will not be iteratively updated.
  • LCMV Linearly Constrained Minimum-Variance
  • the audio signal forms an audio signal matrix X W
  • the pre-weight matrix corresponding to the pre-filter 402 is W
  • the first weight matrix corresponding to the first filter 401 is W 2
  • the audio signal is pre-filtered
  • the pre-target beam obtained by the device 402 is X W W
  • the first target beam obtained by filtering the pre-target beam by the first filter 401 is X W WW 2 .
  • a method for calculating a pre-filter is given.
  • the application environment is the spatial range in which the microphone array is placed and used.
  • the training data includes sample audio signals collected by different microphones in the microphone array.
  • the training data is calculated according to the linearly constrained minimum variance LCMV criterion to obtain a prediction. filter.
  • a pre-calculated pre-filter is set before the first filter, so that the pre-filter processes the audio signal first, so as to improve the accuracy of target speech separation and improve the accuracy of the filter in the initial stage.
  • the processing power of the audio signal is set before the first filter, so that the pre-filter processes the audio signal first, so as to improve the accuracy of target speech separation and improve the accuracy of the filter in the initial stage.
  • the pre-filter is calculated according to actual data collected in an actual audio signal collection scene.
  • the audio signal processing method provided in this application uses the actual audio data collected in the application environment to train the pre-filter, so that the pre-filter can be close to the actual application scene, and the fit of the pre-filter and the application scene is improved. , to improve the suppression effect of pre-filter on interference.
  • the training data corresponds to a target direction
  • the training data in a certain target direction is used to train the pre-filter corresponding to the target direction, so that the pre-filter obtained by training can enhance the target speech in the target direction and suppress other Interfering speech in the direction.
  • a pre-filter is obtained by training the training data collected according to the target direction, so that the pre-filter can better identify the audio signal in the target direction, and the pre-filter can be improved in other directions. the suppression capability of the audio signal.
  • the time-domain signals collected by the microphones are respectively mic 1 , mic 2 , mic 3 , and mic 4 , and the microphone signals are transformed into the frequency domain to obtain frequency-domain signals X W1 , X W2 , X W3 , X W4 , taking any microphone as a reference microphone, the relative transfer function StrV j of other microphones can be obtained, where j is an integer, and if the number of microphones is k, then 0 ⁇ j ⁇ k-1.
  • the relative transfer functions StrV j of other microphones are:
  • StrV j X Wj /X W1 .
  • the optimal filter (pre-filter) in the current real application environment is obtained.
  • the calculation formula of LCMV criterion is:
  • W is the weight matrix of the pre-filter
  • R xx E[XX H ]
  • X [X W1 , X W2 , X W3 , X W4 ] T
  • C is the steering vector
  • f [1, ⁇ 1 , ⁇ 2 , ⁇ 3] are limiting conditions
  • is 1 in the desired direction
  • the setting of the interference zero point can be set as required to ensure the ability to suppress interference.
  • Step 303 filtering the audio signal through a second filter to obtain a first interference beam, and the second filter is used to suppress the target speech and enhance the interference speech.
  • the second filter is used to suppress the target speech in the audio signal and enhance the interfering speech, so as to obtain the beam of the interfering speech as clearly as possible.
  • the second filter corresponds to a second weight matrix, and the initial value of the second weight matrix can be set according to the experience of a technician.
  • At least two audio signals form an audio signal matrix X W
  • the second weight matrix corresponding to the second filter 403 is W b
  • the at least two audio signals are obtained by filtering the second filter 403
  • the first interfering beam is X W W b .
  • Step 304 Obtain a second interference beam of the first interference beam through a third filter, where the third filter is used to weight and adjust the first interference beam.
  • the third filter is used to secondary filter the output of the second filter.
  • the third filter is used to adjust the weights of the target speech and the interference speech in the first interference beam, so that in step 305, the interference beam is subtracted from the target beam, thereby removing the interference beam in the target beam and obtaining accurate audio output. result.
  • the audio signal forms an audio signal matrix X W
  • the second weight matrix corresponding to the second filter 403 is W b
  • the third weight matrix corresponding to the third filter 404 is W anc
  • at least two A first interference beam obtained by filtering the audio signals by the second filter 403 is X W W b
  • a second interference beam obtained by filtering the first interference beam by the third filter 404 is X W W b W anc .
  • Step 305 Determine the difference between the first target beam and the second interference beam as the first audio processing output.
  • the audio processing output is a filtered beam of target speech.
  • the audio signal forms an audio signal matrix X W
  • the second interference beam X W W b W anc output by the third filter is subtracted from the first target beam X W W 2 output by the first filter.
  • the first audio processing output Y 1 X W W 2 -X W W b W anc is obtained .
  • At least two audio signals form an audio signal matrix X W
  • the second interference beam X W output by the third filter is subtracted from the first target beam X W WW 2 output by the first filter.
  • the filter combination shown in FIG. 6 uses a pre-filter to perform initial filtering, the filtering accuracy is high in the initial stage, and therefore, distributed or compact microphone arrays can be filtered in this manner.
  • the filter combination shown in FIG. 5 does not use a pre-filter, and it is not necessary to obtain a pre-filter using the training data collected in the actual operating environment in advance, thereby reducing the dependence of the filter combination on the actual operating environment.
  • Step 306 adaptively update at least one of the second filter and the third filter, and update the first filter according to the second filter and the third filter after the update is completed.
  • the second filter and the third filter are adjusted according to the filtered beam.
  • the second filter is updated according to the first target beam
  • the third filter is updated according to the first audio processing output; or, the second filter and the third filter are updated according to the first audio processing output; or, according to the first audio processing output, the second filter and the third filter are updated;
  • a target beam updates the second filter; or, updates the second filter according to the output of the first audio processing; or, updates the third filter according to the output of the first audio processing.
  • the audio signal processing method provided by the present application by using the first target beam or the first audio processing output to update the second filter, and using the first audio processing output to update the third filter, so that the second filter can be more accurate
  • the ground interference beam can be suppressed more accurately, so that the third filter can weight the first interference beam more accurately, thereby improving the accuracy of the audio processing output.
  • LMS least mean square adaptive filter
  • NLMS normalized least mean square adaptive filter
  • w(0) is the initial weight matrix of the filter
  • is the update step size
  • y(k) is the estimated noise
  • w(k) is the weight matrix before the filter update
  • w(k+1) is the filter
  • the updated weight matrix x(k) is the input value
  • e(k) is the denoised speech
  • d(k) is the noisy speech
  • k is the number of iterations.
  • the first weight matrix of the first filter is W 2
  • the second weight matrix of the second filter is W b
  • the third weight matrix of the third filter is W anc as an example
  • the first filter is updated according to the updated second filter and the third filter.
  • the first filter is obtained by calculation according to the relative relationship among the first filter, the second filter and the third filter.
  • the filter processes the input audio signal with a weight matrix.
  • the filter multiplies the input audio signal by the weight matrix corresponding to the filter to obtain the audio signal output after filtering.
  • the first weight matrix can be calculated by: after the update is completed, the second weight matrix and the third weight matrix are calculated.
  • the product is determined as the target matrix, and then the difference between the identity matrix and the target matrix is determined as the first weight matrix.
  • the first weight matrix is W 2
  • the second weight matrix is W b
  • the third weight matrix is W anc
  • the second filter 403 is adaptively updated using the first target beam output from the first filter 401
  • the third filter 404 is adaptively updated using the output of the first audio processing.
  • the first filter 401 is then updated with the updated second filter 403 and third filter 404 .
  • the first filter, the second filter and the third filter can be tracked in real time
  • the steering vector of the target sound source changes
  • the filter is updated in time
  • the real-time update filter is used to process the audio signal collected by the microphone next time, so that the filter can output the audio processing output according to the change of the scene, so as to ensure the sound quality in the case of interference and movement.
  • the tracking performance of the filter reduces the problem of interference leakage.
  • the audio signal processing method provided by the present application uses the processed data to update the first filter, the second filter and the third filter in real time, so that the filters can be adjusted in real time according to the change of the steering vector of the target sound source. changes, so that the filter can be applied to the scene where the interference noise is constantly changing, to ensure the tracking performance of the filter in the case of interference movement, and to reduce the problem of interference leakage.
  • FIG. 7 shows a flowchart of an audio signal processing method provided by an exemplary embodiment of the present application.
  • the method can be applied to the audio signal processing system shown in FIG. 1 , and the method is processed by an audio signal. device executes.
  • the method may include the following steps:
  • Step 501 Acquire audio signals collected by different microphones in the microphone array.
  • the microphone array includes n target directions, and each target direction corresponds to a filter bank.
  • the filter bank uses any of the above methods to process the audio signal, and n is greater than 1. positive integer of .
  • the microphone matrix may be set with multiple target directions, and the number of target directions may be arbitrary.
  • a filter bank is obtained by training according to each target direction, and the filter uses the method shown in FIG. 4 to process the audio signal.
  • the filter bank may be any one of the filter banks shown in FIG. 5 or FIG. 6 .
  • the filter groups corresponding to different target directions are different.
  • a filter bank corresponding to the target direction is obtained by training the audio signal in the target direction as the target speech.
  • the microphone array is set with four target directions, and the four target directions correspond to four filter banks: GSC 1 , GSC 2 , GSC 3 , and GSC 4 .
  • Each target direction corresponds to a filter bank.
  • the filter group includes: a first filter, a second filter, and a third filter; or, a pre-filter, a first filter, a second filter, and a third filter.
  • the pre-filter is trained using the training data in the i-th target direction collected by the microphone array.
  • Step 502 for the audio signals corresponding to the n target directions, respectively use the corresponding filter bank to filter the audio signals to obtain n first audio processing outputs corresponding to the n target directions.
  • the audio signal matrix X W composed of audio signals is input into four filter banks respectively to obtain the first audio processing outputs Y 1 , Y corresponding to the four target directions respectively 2 , Y 3 , Y 4 .
  • the first filter, the second filter, and the third filter in the filter group are updated in real time according to the filtering result.
  • Step 503 filter the i-th first audio processing output according to the n-1 first audio processing outputs other than the i-th first audio processing output, and obtain the i-th second audio output corresponding to the i-th target direction.
  • Audio processing output, i is a positive integer greater than 0 and less than n; repeat this step to obtain second audio processing outputs corresponding to n target directions respectively.
  • the ith first audio processing output is the target speech
  • the first audio processing outputs in other target directions are interference speech.
  • the audio signal in the i-th target direction is the target voice
  • the audio signals in other target directions are interference signals
  • the i-th first audio processing output corresponding to the i-th target direction is used as the target beam
  • the n-1 first audio processing outputs corresponding to other target directions are used as interference beams
  • the n-1 first audio processing outputs are filtered by the i-th fourth filter to obtain a third interference beam
  • the third interference beam is used.
  • the ith first audio processing output is filtered to improve the accuracy of the output audio processing result of the ith target direction.
  • the n-1 first audio processing outputs other than the ith first audio processing output are determined as the ith interference group, where i is a positive integer greater than 0 and less than n; through the ith target
  • the ith fourth filter corresponding to the direction filters the interference group to obtain the ith third interference beam, and the fourth filter is used to weight and adjust the interference group; the ith first audio processing output is combined with the ith third interference beam.
  • the difference between the beams is determined as the ith second audio processing output; the ith fourth filter is adaptively updated according to the ith second audio processing output.
  • the ith fourth filter corresponds to the ith target direction.
  • the first target direction is taken as the direction of the target voice
  • the first voice processing in the second target direction, the third target direction, and the fourth target direction is output Y 2 , Y 3 , Y 4 are used as the first interference group
  • the interfering beam gets the first second audio processing output Z 1 .
  • the second target direction is used as the direction of the target voice
  • the first voice processing in the first target direction, the third target direction, and the fourth target direction is output Y 1 , Y 3 , Y 4 are used as the second interference group
  • input the second fourth filter 602 to obtain the second third interference beam
  • use the second first audio processing output Y 2 to subtract the second third
  • the interfering beam results in a second second audio processing output Z 2 .
  • a second audio output Z 2 updates the first two fourth adaptive filter 602.
  • the third target direction is taken as the direction of the target voice
  • the first voice processing in the first target direction, the second target direction, and the fourth target direction is output Y 1 , Y 2 , Y 4 are used as the third interference group
  • input the third fourth filter 603 to obtain the third third interference beam
  • use the third first audio processing output Y 3 to subtract the third third
  • the interfering beam results in a third second audio processing output Z 3 .
  • the first voice processing of the first target direction, the second target direction, and the third target direction is output Y 1 , Y 2 , Y 3 are used as the fourth interference group, input the fourth fourth filter 604 to obtain the fourth third interference beam, and use the fourth first audio processing output Y 4 to subtract the fourth third The interfering beam results in a fourth second audio processing output Z 4 .
  • a fourth processing a second audio output Z 4 4th fourth adaptive update filter 604.
  • the audio signal processing method provided by the present application obtains multiple audio processing outputs corresponding to multiple target directions by performing audio processing on the collected audio signals in multiple target directions, and uses audio processing in other directions. Output to remove the interference in the audio processing output in this direction and improve the accuracy of the audio processing output in this direction.
  • microphones are respectively set in the main driver's seat, the co-pilot seat, and the two passenger seats of the vehicle to form a microphone array, which is used to collect the voice interaction commands issued by the driver or passengers.
  • the method of FIG. 4 or FIG. 7 is used to filter the audio signal to obtain the first audio processing output or the second audio processing output, and use the speech recognition algorithm to filter the first audio processing output or the second audio processing output.
  • the audio processing output performs speech recognition or semantic recognition to recognize the voice interaction command issued by the driver or passenger, so that the on-board computer system responds according to the voice interaction command.
  • the four target directions are determined, and the four target directions are respectively used to receive the voice interaction instructions of the driver on the main driver's seat, And the voice interaction instructions of the passengers sitting in the co-pilot seat and the passenger seat respectively.
  • the method of Figure 4 or Figure 7 is used to filter the audio signal, and different target directions are used as the target speech to filter to obtain the audio processing outputs corresponding to the four target directions respectively.
  • the audio processing output strengthens the The audio signal in the selected target direction suppresses the interference of other target directions, thereby improving the accuracy of the audio processing output and facilitating the speech recognition algorithm to recognize the voice commands in the signal.
  • (1) in Figure 12 it is a dual-channel language spectrum collected by setting the microphones at the main driver's seat and the co-pilot's seat respectively, wherein the upper part is the language spectrum of the main driver's seat, and the lower part is the co-pilot. bit spectrum.
  • (2) in Figure 12 it is a dual-channel spectrum obtained by filtering the collected audio signal using the pre-filter provided by the present application. The comparison between (1) and (2) can clearly see that the data The pre-filter processing of the training realizes the spatial filtering effect on the speech, and the interference of both channels is greatly reduced.
  • (3) in Figure 12 is a dual-channel spectrogram obtained by using a data pre-filter combined with traditional GSC processing on the audio signal, and (3) has better interference leakage than (2).
  • (1) in Figure 13 it is a dual-channel spectrogram obtained by processing the audio signal using the audio signal processing method (full blind GSC structure) shown in Figure 7, which is further reduced compared to (3) in Figure 12 This is because the left channel in the separated sound source is the moving sound source in the experiment.
  • (3) in Figure 12 shows that the traditional GSC structure cannot track the change of the moving sound source well, and
  • (1) in Figure 13 Although there is no data-dependent pre-filter, it can track the changes of moving sound sources well, so it has better ability to suppress interfering speech.
  • (2) in Figure 13 is a dual-channel spectrogram obtained by processing the audio signal using the audio signal processing method shown in Figure 4. The audio signal is filtered by a pre-filter combined with a fully blind GSC structure, and a data-related Pre-filters and tracking of moving interfering sound sources for best results.
  • FIG. 14 shows a block diagram of an audio signal processing apparatus provided by an exemplary embodiment of the present application.
  • the apparatus is configured to execute all or part of the steps of the method of the embodiment shown in FIG. 4 , as shown in FIG. 14 .
  • the apparatus may include:
  • a first acquisition module 701, configured to acquire audio signals collected by different microphones in the microphone array
  • the first filtering module 702 is configured to filter the audio signal through a first filter to obtain a first target beam, and the first filter is used to suppress interfering speech in the audio signal and enhance the audio signal in the audio signal. target voice;
  • a second filtering module 703, configured to filter the audio signal through a second filter to obtain a first interference beam, and the second filter is used to suppress the target speech and enhance the interference speech;
  • a third filtering module 704 configured to obtain a second interference beam of the first interference beam through a third filter, and the third filter is used to weight and adjust the first interference beam;
  • a first determining module 705, configured to determine the difference between the first target beam and the second interference beam as the first audio processing output
  • a first update module 706, configured to adaptively update at least one of the second filter and the third filter, and update the second filter and the third filter after the update is completed. first filter.
  • the first filter corresponds to a first weight matrix
  • the second filter corresponds to a second weight matrix
  • the third filter corresponds to a third weight matrix
  • the first update module 706 is further configured to, after the update is completed, calculate and obtain the first weight matrix according to the second weight matrix and the third weight matrix;
  • the first update module 706 is further configured to update the first filter according to the first weight matrix.
  • the first update module 706 is further configured to determine the product of the second weight matrix and the third weight matrix as the target matrix after the update is completed; The difference between the target matrices is determined as the first weight matrix.
  • the first update module 706 is further configured to:
  • Update the second filter according to the first target beam update the third filter according to the first audio processing output; or update the second filter and the third filter according to the first audio processing output the third filter; or, updating the second filter according to the first target beam; or, updating the second filter according to the output of the first audio processing; or, according to the first audio processing
  • the output updates the third filter.
  • the apparatus further includes:
  • a pre-filtering module 707 configured to first filter the audio signal through a pre-filter to obtain a pre-target beam, the pre-filter is a filter calculated using training data, and the pre-filter is used to suppress the interference speech and enhance the target speech;
  • the first filtering module 702 is further configured to perform a second filtering on the pre-target beam through the first filter to obtain the first target beam.
  • the apparatus further includes:
  • the first acquisition module 701 is further configured to acquire training data collected by the microphone array in an application environment, where the application environment is a spatial range where the microphone array is placed and used, and the training data includes the microphone array Sample audio signals collected by different microphones in
  • the calculation module 708 is configured to calculate the training data according to the linear constrained minimum variance LCMV criterion to obtain the pre-filter.
  • FIG. 15 shows a block diagram of an audio signal processing apparatus provided by an exemplary embodiment of the present application, and the apparatus is used to execute all or part of the steps of the method of the embodiment shown in FIG. 7 , as shown in FIG. 15 .
  • the apparatus may include:
  • the second acquisition module 801 is configured to acquire audio signals collected by different microphones in the microphone array, where the microphone array includes n target directions, and each target direction corresponds to a filter bank, and the filter bank adopts FIG. 4
  • the method of any of the illustrated embodiments processes the audio signal, and the n is a positive integer greater than 1;
  • a filter bank module 802 configured to filter the audio signals using the corresponding filter banks for the audio signals corresponding to the n target directions, to obtain n first audio signals corresponding to the n target directions audio processing output;
  • the fourth filtering module 803 is configured to filter the i-th first audio processing output according to the n-1 first audio processing outputs except the i-th first audio processing output to obtain the ith first audio processing output.
  • the apparatus further includes:
  • the fourth filtering module 803 is further configured to determine the n-1 first audio processing outputs except the ith first audio processing output as the ith interference group;
  • the fourth filtering module 803 is further configured to filter the ith interference group through the ith fourth filter corresponding to the ith target direction to obtain the ith third interference beam, and the ith third interference beam is obtained.
  • Four filters are used for weighted adjustment of the interference group;
  • a second determining module 804 configured to determine the difference between the i-th first audio processing output and the i-th third interference beam as the i-th second audio processing output;
  • the second updating module 805 is configured to adaptively update the i-th fourth filter according to the i-th second audio processing output.
  • the i-th filter bank includes a pre-filter, and the pre-filter is obtained by training using the i-th training data in the target direction collected by the microphone array .
  • Fig. 16 is a structural block diagram of a computer device according to an exemplary embodiment.
  • the computer device can be implemented as the audio signal processing device in the above solution of the application.
  • the computer device 900 includes a central processing unit (Central Processing Unit, CPU) 901, a system memory 904 including a random access memory (Random Access Memory, RAM) 902 and a read-only memory (Read-Only Memory, ROM) 903, and A system bus 905 that connects the system memory 904 and the central processing unit 901 .
  • CPU Central Processing Unit
  • system memory 904 including a random access memory (Random Access Memory, RAM) 902 and a read-only memory (Read-Only Memory, ROM) 903
  • a system bus 905 that connects the system memory 904 and the central processing unit 901 .
  • the computer device 900 also includes a basic input/output system (Input/Output system, I/O system) 906 that helps to transfer information between various devices in the computer, and is used to store an operating system 913, application programs 914 and other programs Module 915 of mass storage device 907 .
  • I/O system input/output system
  • the basic input/output system 906 includes a display 908 for displaying information and an input device 909 such as a mouse, keyboard, etc., for the user to input information.
  • the display 908 and the input device 909 are both connected to the central processing unit 901 through the input and output controller 910 connected to the system bus 905 .
  • the basic input/output system 906 may also include an input output controller 910 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus.
  • input output controller 910 also provides output to a display screen, printer, or other type of output device.
  • the computer device 900 may also be operated by connecting to a remote computer on the network through a network such as the Internet. That is, the computer device 900 can be connected to the network 912 through the network interface unit 911 connected to the system bus 905, or can also use the network interface unit 911 to connect to other types of networks or remote computer systems (not shown). ).
  • the memory also includes one or more programs, the one or more programs are stored in the memory, and the central processing unit 901 implements all of the methods shown in FIG. 4 or FIG. 7 by executing the one or more programs. or part of the steps.
  • Embodiments of the present application further provide a computer-readable storage medium for storing computer software instructions used by the above-mentioned computer device, including a program designed for executing the above-mentioned audio signal processing method.
  • the computer-readable storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one piece of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one piece of program, the code A set or set of instructions is loaded and executed by the processor to implement all or part of the steps of the audio signal processing method as described above.
  • Embodiments of the present application also provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio signal processing method provided in the foregoing optional implementation manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Otolaryngology (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Obtaining Desirable Characteristics In Audible-Bandwidth Transducers (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Noise Elimination (AREA)

Abstract

本申请揭示了一种音频信号处理方法、装置、设备及存储介质,属于语音处理领域。方法包括:获取麦克风阵列中不同麦克风采集的音频信号;通过第一滤波器对音频信号进行滤波得到第一目标波束;通过第二滤波器对音频信号进行滤波得到第一干扰波束;通过第三滤波器获取第一干扰波束的第二干扰波束;将第一目标波束与第二干扰波束之差确定为第一音频处理输出;自适应更新第二滤波器和第三滤波器中的至少一个,在更新完成后根据第二滤波器和第三滤波器更新第一滤波器。该方法可以在干扰移动情形下减小干扰泄露。

Description

音频信号处理方法、装置、设备及存储介质
本申请要求于2020年7月17日提交中国专利局、申请号202010693891.9、申请名称为“音频信号处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理领域,特别涉及音频信号处理技术。
背景技术
在语音通信中,麦克风采集到的语音信号总会受到外界环境噪声的干扰。语音增强技术是语音信号处理的一个重要分支,它广泛应用于嘈杂环境下的噪声抑制、语音压缩编码和语音识别等领域中,在解决语音噪声污染问题、改进语音通信质量、提高语音可懂度和语音识别率等方面发挥着越来越重要的作用。
在相关技术中,采用广义旁瓣抵消器(Generalized Sidelobe Canceller,GSC)算法进行语音增强。GSC采用凸优化的方式预先设计好滤波器,通过该滤波器去除干扰,从而获得更好地波束性能。
相关技术中的方法,使用预先设计好的滤波器,没有考虑到干扰声源移动对处理结果的影响,导致最终得到的声源分离效果不佳。
发明内容
本申请提供一种音频信号处理方法、装置、设备及存储介质,可以在干扰移动情形下减小干扰泄露。所述技术方案如下:
根据本申请实施例的一个方面,提供了一种音频信号处理方法,所述方法由音频信号处理设备执行,所述方法包括:
获取麦克风阵列中不同麦克风采集的音频信号;
通过第一滤波器对所述音频信号进行滤波得到第一目标波束,所述第一滤波器用于抑制所述音频信号中的干扰语音且增强所述音频信号中的目标语音;
通过第二滤波器对所述音频信号进行滤波得到第一干扰波束,所述第二滤波器用于抑制所述目标语音且增强所述干扰语音;
通过第三滤波器获取所述第一干扰波束的第二干扰波束,所述第三滤波器用于加权调整所述第一干扰波束;
将所述第一目标波束与所述第二干扰波束之差确定为第一音频处理输出;
自适应更新所述第二滤波器和所述第三滤波器中的至少一个,在更新完成后根据所述第二滤波器和所述第三滤波器更新所述第一滤波器。
根据本申请实施例的另一个方面,提供了一种音频信号处理方法,所述方法由音频信号处理设备执行,所述方法包括:
获取麦克风阵列中不同麦克风采集的音频信号,所述麦克风阵列包括n个目标方向,每个所述目标方向分别对应一个滤波器组,所述滤波器组采用上述的方法处理所述音频信号,所述n是大于1的正整数;
针对n个所述目标方向对应的音频信号,分别使用对应的所述滤波器组对所述音频信号进行滤波,得到n个所述目标方向对应的n个第一音频处理输出;
根据除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出对第i个所述第一音频处理输出进行滤波,得到第i个所述目标方向对应的第i个第二音频处理输出,所述i为大于0且小于所述n的正整数;重复该步骤得到n个所述目标方向分别对应的第二音频处理输出。
根据本申请实施例的另一个方面,提供了一种音频信号处理装置,所述装置部署在音频信号处理设备上,所述装置包括:
第一获取模块,用于获取麦克风阵列中不同麦克风采集的音频信号;
第一滤波模块,用于通过第一滤波器对所述音频信号进行滤波得到第一目标波束,所述第一滤波器用于抑制所述音频信号中的干扰语音且增强所述音频信号中的目标语音;
第二滤波模块,用于通过第二滤波器对所述音频信号进行滤波得到第一干扰波束,所述第二滤波器用于抑制所述目标语音且增强所述干扰语音;
第三滤波模块,用于通过第三滤波器获取所述第一干扰波束的第二干扰波束,所述第三滤波器用于加权调整所述第一干扰波束;
第一确定模块,用于将所述第一目标波束与所述第二干扰波束之差确定为第一音频处理输出;
第一更新模块,用于自适应更新所述第二滤波器和所述第三滤波器中的至少一个,在更新完成后根据所述第二滤波器和所述第三滤波器更新所述第一滤波器。
根据本申请实施例的另一个方面,提供了一种音频信号处理装置,所述装置部署在音频信号处理设备上,所述装置包括:
第二获取模块,用于获取麦克风阵列中不同麦克风采集的音频信号,所述麦克风阵列包括n个目标方向,每个所述目标方向分别对应一个滤波器组,所述滤波器组采用上述第一种音频信号处理方法处理所述音频信号;
滤波器组模块,用于针对n个所述目标方向对应的音频信号,分别使用对应的所述滤波器组对所述音频信号进行滤波,得到n个所述目标方向对应的n个第一音频处理输出;
第四滤波模块,用于根据除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出对第i个所述第一音频处理输出进行滤波,得到第i个所述目标方向对应的第i个第二音频处理输出,所述i为大于0且小于所述n的正整数;重复该步骤得到n个所述目标方向分别对应的第二音频处理输出。
根据本申请实施例的另一个方面,提供一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至 少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述的任一可选方案所述的音频信号处理方法。
根据本申请实施例的另一个方面,提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述的任一可选方案所述的音频信号处理方法。
根据本申请实施例的另一个方面,提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的音频信号处理方法。
本申请提供的技术方案可以包括以下有益效果:
通过根据第二滤波器和第三滤波器,更新第一滤波器,使第一滤波器、第二滤波器和第三滤波器可以实时跟踪目标声源的导向矢量变化,及时更新滤波器,使用实时更新的滤波器来处理下一次麦克风采集到的音频信号,使滤波器根据场景的变化来输出音频处理输出,保证在干扰移动情形下的滤波器的跟踪性能,减小干扰泄露问题。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1是根据一示例性实施例示出的音频信号处理系统的示意图;
图2示出了本申请另一个示例性实施例提供的麦克风分布的示意图;
图3示出了本申请另一个示例性实施例提供的麦克风分布的示意图;
图4示出了本申请另一个示例性实施例提供的一种音频信号处理方法的流程图;
图5示出了本申请另一个示例性实施例提供的滤波器组成的示意图;
图6示出了本申请另一个示例性实施例提供的滤波器组成的示意图;
图7示出了本申请另一个示例性实施例提供的一种音频信号处理方法的流程图;
图8示出了本申请另一个示例性实施例提供的滤波器组成的示意图;
图9示出了本申请另一个示例性实施例提供的滤波器组成的示意图;
图10示出了本申请另一个示例性实施例提供的滤波器组成的示意图;
图11示出了本申请另一个示例性实施例提供的滤波器组成的示意图;
图12示出了本申请另一个示例性实施例提供的双通道语谱图;
图13示出了本申请另一个示例性实施例提供的双通道语谱图;
图14示出了本申请另一个示例性实施例提供的一种音频信号处理装置的框图;
图15示出了本申请另一个示例性实施例提供的一种音频信号处理装置的框图;
图16是根据一示例性实施例示出的计算机设备的结构框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
应当理解的是,在本文中提及的“若干个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
本申请涉及智能家居技术领域,特别涉及一种音频信号处理方法。
首先,对本申请涉及的一些名词进行解释。
1)人工智能(Artificial Intelligence,AI)
人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
2)语音技术(Speech Technology)
语音技术的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
3)传声器
传声器俗称话筒、麦克风,是电声设备中的第一个环节。传声器是把电能转变为机械能,然后再把机械能变为电能的换能器。目前,人们利用各种换能原理制成了各种各样的传声器,录音中常用的有电容、动圈、铝带传声器等。
图1是根据一示例性实施例示出的音频信号处理系统的示意图。如图1所示,音频信号处理系统100包括麦克风阵列101和音频信号处理设备102。
其中,麦克风阵列101包括设置在至少两个不同位置的至少两个麦克风。麦克风阵列101用于对声场的空间特性进行采样并处理,从而利用麦克风阵列101接收到的音频信号来计算目标说话人的角度和距离,从而实现对目标说话人的跟踪以及后续的语音定向拾取。示例性的,麦克风阵列101是设置在车载场景中的。当麦克风阵列包括两个麦克风时,两个麦克风分别设置在主驾驶位置附近和副驾驶位置附近,根据麦克风在空间中的位置分布可以将麦克风阵列分为紧凑型和分布型,例如,如图2中的(1)所示,给出了一种紧凑型的麦克风阵列,两个麦克风分别设置在主驾驶位201和副驾驶位202的内侧;再如,如图2中的(2)所示,给出了一种分布型的麦克风阵列,两个麦克风分别设置在主驾驶位201和副驾驶位202的外侧。当麦克风阵列包括四个麦克风时,四个麦克风分别设置在主驾驶位附近、副驾驶位附近以及两个乘客位附近,例如,如图3中的(1)所示,给出了一种紧凑型的麦克风阵列,四个麦克风分别设置在主驾驶位201、副驾驶位202以及两个乘客位203的内侧,再如,如图3中的(2)所示,给出了一种分布型的麦克风阵列,四个麦克风分别设置在主驾驶位201、副驾驶位202以及两个乘客位203的外侧,再如,如图3中的(3)所示,给出了另一种分布型的麦克风阵列,四个麦克风分别设置在主驾驶位201、副驾驶位202以及两个乘客位203的上方。
音频信号处理设备102与麦克风阵列101相连,用于处理麦克风阵列采集到的音频信号。在一个示意性的例子中,音频信号处理设备包含处理器103和存储器104,存储器104中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器103加载并执行音频信号处理方法。示例性的,音频信号处理设备可以实现为车载语音识别系统中的一部分。在一个示意性的例子中,音频信号处理设备还用于在对麦克风采集的音频信号进行音频信号处理得到音频处理输出后,对音频处理输出进行语音识别,得到语音识别结果,或对语音识别结果做出相应。示例性的,音频信号处理设备还可以包括主板、外部输出/输入设备、存储器、外部接口、触控系统以及电源。
其中,主板中集成有处理器和控制器等处理元件,该处理器可以是音频处理芯片。
外部输出/输入设备可以包括显示组件(比如显示屏)、声音播放组件(比如扬声器)、声音采集组件(比如麦克风)以及各类按键等,该声音采集组件可以是麦克风阵列。
存储器中存储有程序代码和数据。
外部接口可以包括耳机接口、充电接口以及数据接口等。
触控系统可以集成在外部输出/输入设备的显示组件或者按键中,触控系统用于检测用户在显示组件或者按键上执行的触控操作。
电源用于对终端中的其它各个部件进行供电。
在本申请实施例中,主板中的处理器可以通过执行或者调用储存器中储存的程序代码和数据来得到音频处理输出,对音频处理输出进行语音识别得到语音识别结果,将生成的语音识别结果通过外部输出/输入设备进行播放,或,根据语音识别结果响应语音识别结果中的用户指令。在音频内容播放的过程中,可以通过触控系统检测用户与触控系统交互时执行的按键或者其它操作等等。
由于在现实中,由于声源的位置是不断变化的,对于麦克风收音会造成影响,因此,在本申请实施例中,为提高语音交互设备的收音效果,该语音交互设备的声音采集组件可以是由一定数目的声学传感器(一般是麦克风)组成的麦克风阵列,用于对声场的空间特 性进行采样并处理,从而利用麦克风阵列接收到的音频信号来计算目标说话人的角度和距离,从而实现对目标说话人的跟踪以及后续的语音定向拾取。
本实施例提供了一种对采集到的音频信号进行处理,来抑制音频信号中的干扰信号得到更准确地目标信号的方法,下面以该方法应用于对车载麦克风阵列采集到的音频信号进行处理进行说明。
请参考图3,其示出了本申请一个示例性实施例提供的一种音频信号处理方法的流程图,该方法可以应用于图1所示的音频信号处理系统中,该方法由音频信号处理设备执行。如图4所示,该方法可以包括以下步骤:
步骤301,获取麦克风阵列中不同麦克风采集的音频信号。
示例性的,该音频信号是多通道的声源信号,其中,通道的数量可以对应于麦克风阵列中所包含的麦克风的个数,比如,若该麦克风组阵列包含的麦克风的个数为4个,那么麦克风阵列采集到的四个音频信号。示例性的,该音频信号包括发布语音命令的对象所发出的目标语音和环境噪声的干扰语音。
示例性的,每个音频信号所记录的声源内容都是一致的,比如,对于某一采样点的音频信号,在该麦克风阵列包含四个麦克风的情况下,有4个与之对应的音频信号,每个音频信号都记录了该采样点声源信号的内容,只是由于麦克风阵列中,每个麦克风与声源之间的方位和/或距离不同,从而导致了各个麦克风所接收到的声源信号的频率、强度等存在差异,从而使得音频信号存在差异。
步骤302,通过第一滤波器对音频信号进行滤波得到第一目标波束,第一滤波器用于抑制音频信号中的干扰语音且增强音频信号中的目标语音。
示例性的,第一滤波器用于对音频信号进行滤波,增强音频信号中的目标语音、抑制音频信号中的干扰语音。示例性的,第一滤波器对应有第一权重矩阵,第一权重矩阵的初始值可以由技术人员根据经验设置,或,任意设置。示例性的,第一滤波器是实时更新的滤波器,第一滤波器会随着第二滤波器和第三滤波器的自适应更新而更新,根据第二滤波器、第三滤波器的权重矩阵对干扰语音的增强和对目标语音的抑制,来确定第一滤波器对干扰语音的抑制和对目标语音的增强。
示例性的,目标语音是在目标方向上接收的音频信号,干扰语音是在除目标方向外的其他方向上接收到的音频信号。示例性的,目标语音是发布语音命令的对象所发出的语音信号。
例如,如图5所示,音频信号组成音频信号矩阵X W,第一滤波器401对应的第一权重矩阵为W 2,则音频信号经过第一滤波器401滤波得到的第一目标波束为X WW 2
示例性的,在第一滤波器前还可以设置预滤波器,则步骤302还包括步骤3021至步骤3022。
步骤3021,通过预滤波器对音频信号进行第一滤波得到预目标波束,预滤波器是使用训练数据计算得到的滤波器,预滤波器用于抑制干扰语音且增强目标语音。
步骤3022,通过第一滤波器对预目标波束进行第二滤波,得到第一目标波束。
示例性的,预滤波器是利用训练数据计算得到的滤波器。预滤波器也用于增强音频信号中的目标语音并抑制干扰语音。示例性的,预滤波器是根据线性约束最小方差(Linearly Constrained Minimum-Variance,LCMV)准则计算得到的滤波器,预滤波器在计算得到后即为固定值,不会迭代更新。
例如,如图6所示,音频信号组成音频信号矩阵X W,预滤波器402对应的预权重矩阵为W,第一滤波器401对应的第一权重矩阵为W 2,则音频信号经过预滤波器402得到的预目标波束为X WW,预目标波束经过第一滤波器401滤波得到的第一目标波束为X WWW 2
示例性的,给出一种计算预滤波器的方法。获取麦克风阵列在应用环境中采集的训练数据,应用环境是麦克风阵列被放置使用的空间范围,训练数据包括麦克风阵列中不同麦克风采集的样本音频信号;根据线性约束最小方差LCMV准则计算训练数据得到预滤波器。
本申请提供的音频信号处理方法,通过在第一滤波器之前设置预先计算好的预滤波器,使预滤波器先对音频信号进行处理,提高目标语音分离的准确度,提高初始阶段滤波器对音频信号的处理能力。
示例性的,预滤波器是根据在实际的音频信号采集场景中采集到的实际数据计算得到的。本申请提供的音频信号处理方法,通过使用在应用环境中采集到的实际音频数据,来训练得到预滤波器,使预滤波器可以贴近实际应用场景,提高预滤波器与应用场景的贴合度,提高预滤波器对干扰的抑制效果。
示例性的,训练数据对应有目标方向,使用某个目标方向上的训练数据来训练该目标方向对应的预滤波器,使训练得到的预滤波器能够增强该目标方向上的目标语音,抑制其他方向上的干扰语音。
本申请提供的音频信号处理方法,通过使用根据目标方向上采集到的训练数据来训练得到预滤波器,使预滤波器可以更好地识别目标方向上的音频信号,提高预滤波器对其他方向的音频信号的抑制能力。示例性的,以麦克风阵列包括四个麦克风为例,麦克风采集到的时域信号分别为mic 1、mic 2、mic 3、mic 4,将麦克风信号变换到频域得到频域信号X W1、X W2、X W3、X W4,将任意一个麦克风作为参考麦克风,可以得到其他麦克风的相对传递函数StrV j,j为整数,若麦克风数量为k,则0<j≤k-1。以参考麦克风是第一麦克风为例,其他麦克风的相对传递函数StrV j为:
StrV j=X Wj/X W1
然后根据LCMV准则射击得到当前真实应用环境下的最优滤波器(预滤波器)。其中,LCMV准则的计算式为:
minimize J(W)=1/2(W HR xxW)
subject to C HW=f
Figure PCTCN2021098085-appb-000001
其中,W为预滤波器的权重矩阵;R xx=E[XX H],X=[X W1,X W2,X W3,X W4] T;C为导向矢量;f=[1,ξ 1,ξ 2,ξ 3]为限定条件,在期望方向上ξ为1,在其他干扰零点方向上ξ设置为ξ nn=0或ξ n<<1)。干扰零点的设置可以根据需要进行设置,保证对干扰的抑制能力即可。步骤303,通过第二滤波器对音频信号进行滤波得到第一干扰波束,第二滤波器用于抑制目标语音且增强干扰语音。
第二滤波器用于抑制音频信号中的目标语音并增强干扰语音,尽量清晰地得到干扰语音的波束。示例性的,第二滤波器对应有第二权重矩阵,第二权重矩阵的初始值可以根据技术人员的经验设置。
例如,如图5所示,至少两个音频信号组成音频信号矩阵X W,第二滤波器403对应的第二权重矩阵为W b,则至少两个音频信号经过第二滤波器403滤波得到的第一干扰波束为X WW b
步骤304,通过第三滤波器获取第一干扰波束的第二干扰波束,第三滤波器用于加权调整第一干扰波束。
第三滤波器用于对第二滤波器的输出进行二次滤波。示例性的,第三滤波器用于调整第一干扰波束中目标语音和干扰语音的权重,以便在步骤305中用目标波束减去干扰波束,从而去除目标波束中的干扰波束,得到准确的音频输出结果。
例如,如图5所示,音频信号组成音频信号矩阵X W,第二滤波器403对应的第二权重矩阵为W b,第三滤波器404对应的第三权重矩阵为W anc,则至少两个音频信号经过第二滤波器403滤波得到的第一干扰波束为X WW b,第一干扰波束经过第三滤波器404滤波得到的第二干扰波束为X WW bW anc
步骤305,将第一目标波束与第二干扰波束之差确定为第一音频处理输出。
音频处理输出是经过滤波后得到的目标语音的波束。
例如,如图5所示,音频信号组成音频信号矩阵X W,用第一滤波器输出的第一目标波束X WW 2减去第三滤波器输出的第二干扰波束X WW bW anc,得到第一音频处理输出Y 1=X WW 2-X WW bW anc
再如,如图6所示,至少两个音频信号组成音频信号矩阵X W,用第一滤波器输出的第一目标波束X WWW 2减去第三滤波器输出的第二干扰波束X WW bW anc,得到第一音频处理输出Y 1=X WWW 2-X WW bW anc
示例性的,由于图6所示的滤波器组合使用了预滤波器进行初次滤波,在初始阶段滤波准确度较高,因此,分布型或紧密型的麦克风阵列都可以采用这种方式进行滤波。示例性的,图5所示的滤波器组合没有使用预滤波器,不需要预先使用实际运行环境中采集的训练数据获得预滤波器,从而降低了滤波器组合对实际运行环境的依赖。
步骤306,自适应更新第二滤波器和第三滤波器中的至少一个,在更新完成后根据第二滤波器和第三滤波器更新第一滤波器。
示例性的,根据滤波后得到的波束对第二滤波器和第三滤波器进行调整。示例性的,根据第一目标波束更新第二滤波器,根据第一音频处理输出更新第三滤波器;或,根据第 一音频处理输出更新第二滤波器和第三滤波器;或,根据第一目标波束更新第二滤波器;或,根据第一音频处理输出更新第二滤波器;或,根据第一音频处理输出更新第三滤波器。
本申请提供的音频信号处理方法,通过使用第一目标波束或第一音频处理输出来更新第二滤波器,使用第一音频处理输出来更新第三滤波器,使第二滤波器能够得到更准确地干扰波束,更准确地抑制目标波束,使第三滤波器能够更准确地对第一干扰波束进行加权,进而提高音频处理输出的准确度。
示例性的,采用最小均方自适应滤波器(LMS,Least Mean Square)或归一化最小均方自适应滤波器(NLMS,Normalized Least Mean Square)的方法来自适应更新第二滤波器或第三滤波器。
示例性的,采用LMS算法对滤波器进行自适应更新的过程为:
1)给定w(0);
2)计算输出值:y(k)=w(k) Tx(k);
3)计算估计误差:e(k)=d(k)-y(k);
4)权重更新:w(k+1)=w(k)+μe(k)x(k)。
其中,w(0)是滤波器的初始权重矩阵,μ为更新步长,y(k)为估计噪声,w(k)为滤波器更新前的权重矩阵,w(k+1)为滤波器更新后的权重矩阵,x(k)为输入值,e(k)为降噪后语音,d(k)为带噪语音,k为迭代的次数。
以由音频信号组成的音频信号矩阵为X W,第一滤波器的第一权重矩阵为W 2,第二滤波器的第二权重矩阵为W b,第三滤波器的第三权重矩阵为W anc为例,使用第一音频处理输出Y1=X W W 2-X WW bW anc,采用LMS算法对第三滤波器进行自适应更新得到更新后的权重矩阵为(W b+μY 1X W)。
示例性的,在第二滤波器和第三滤波器的更新完成后,根据更新后的第二滤波器和第三滤波器更新第一滤波器。示例性的,根据第一滤波器、第二滤波器和第三滤波器间的相对关系,来计算得到第一滤波器。
示例性的,若第一滤波器对应有第一权重矩阵,第二滤波器对应有第二权重矩阵,第三滤波器对应有第三权重矩阵,则在更新完成后根据第二滤波器和第三滤波器更新第一滤波器的实现方式可以是在更新完成后,根据第二权重矩阵和第三权重矩阵,计算得到第一权重矩阵,然后根据第一权重矩阵更新第一滤波器。示例性的,滤波器用权重矩阵来处理输入的音频信号。滤波器将输入的音频信号乘以滤波器对应的权重矩阵,得到滤波后输出的音频信号。
示例性的,在一些情况下,在更新完成后根据第二权重矩阵和第三权重矩阵,计算得到第一权重矩阵的方式可以是在更新完成后,将第二权重矩阵与第三权重矩阵之积确定为目标矩阵,然后将单位矩阵与目标矩阵之差确定为第一权重矩阵。
例如,第一权重矩阵为W 2,第二权重矩阵为W b,第三权重矩阵为W anc,则W 2=(1-W bW anc)。
例如,如图5所示,使用第一滤波器401输出的第一目标波束自适应更新第二滤波器403,使用第一音频处理输出自适应更新第三滤波器404。然后使用更新后的第二滤波器403和第三滤波器404更新第一滤波器401。
综上所述,本申请提供的音频信号处理方法,通过根据第二滤波器和第三滤波器,更新第一滤波器,使第一滤波器、第二滤波器和第三滤波器可以实时跟踪目标声源的导向矢量变化,及时更新滤波器,使用实时更新的滤波器来处理下一次麦克风采集到的音频信号,使滤波器根据场景的变化来输出音频处理输出,保证在干扰移动情形下的滤波器的跟踪性能,减小干扰泄露问题。
本申请提供的音频信号处理方法,通过使用每一次处理后的数据对第一滤波器、第二滤波器、第三滤波器进行实时更新,使滤波器能够实时根据目标声源的导向矢量变化而变化,使滤波器可以适用于干扰噪声不断变化的场景,保证在干扰移动情形下的滤波器的跟踪性能,减小干扰泄露问题。
请参考图7,其示出了本申请一个示例性实施例提供的一种音频信号处理方法的流程图,该方法可以应用于图1所示的音频信号处理系统中,该方法由音频信号处理设备执行。如图4所示,该方法可以包括以下步骤:
步骤501,获取麦克风阵列中不同麦克风采集的音频信号,麦克风阵列包括n个目标方向,每个目标方向分别对应一个滤波器组,滤波器组采用上述任一的方法处理音频信号,n是大于1的正整数。
示例性的,麦克风矩阵可以设置多个目标方向,目标方向的个数可以是任意的。示例性的,根据每个目标方向分别训练得到一个滤波器组,该滤波器采用图4所示的方法处理音频信号。示例性的,该滤波器组可以是图5或图6中所示出的滤波器组中的任意一种。示例性的,不同目标方向对应的滤波器组不同。示例性的,将目标方向上的音频信号作为目标语音训练得到该目标方向对应的滤波器组。
例如,如图8所示,该麦克风阵列设置了四个目标方向,四个目标方向对应有四个滤波器组:GSC 1、GSC 2、GSC 3、GSC 4。每个目标方向对应一个滤波器组。
示例性的,滤波器组包括:第一滤波器、第二滤波器、第三滤波器;或,预滤波器、第一滤波器、第二滤波器、第三滤波器。当第i个滤波器组包括预滤波器时,预滤波器是使用麦克风阵列采集到的第i个目标方向上的训练数据训练得到的。
步骤502,针对n个目标方向对应的音频信号,分别使用对应的滤波器组对音频信号进行滤波,得到n个目标方向对应的n个第一音频处理输出。
例如,如图8所示,以四个目标方向为例,将音频信号组成的音频信号矩阵X W分别输入四个滤波器组得到四个目标方向分别对应的第一音频处理输出Y 1、Y 2、Y 3、Y 4。示例性的,每个滤波器组在得到滤波结果后,会根据滤波结果对滤波器组中的第一滤波器、第二滤波器、第三滤波器进行实时更新。
步骤503,根据除第i个第一音频处理输出之外的n-1个第一音频处理输出对第i个第一音频处理输出进行滤波,得到第i个目标方向对应的第i个第二音频处理输出,i为大于0且小于n的正整数;重复该步骤得到n个目标方向分别对应的第二音频处理输出。
示例性的,对于第i个目标方向,第i个第一音频处理输出是目标语音,其他目标方向上的第一音频处理输出是干扰语音。示例性的,当第i个目标方向上的音频信号是目标语音时,其他目标方向的音频信号即为干扰信号,将第i个目标方向对应的第i个第一音频处理输出作为目标波束,将其他目标方向对应的n-1个第一音频处理输出作为干扰波束, 将n-1个第一音频处理输出经过第i个第四滤波器进行滤波得到第三干扰波束,用第三干扰波束对第i个第一音频处理输出进行滤波,来提高输出的第i个目标方向的音频处理结果的准确度。
示例性的,将除第i个第一音频处理输出之外的n-1个第一音频处理输出确定为第i个干扰组,i为大于0且小于n的正整数;通过第i个目标方向对应的第i个第四滤波器对干扰组进行滤波得到第i个第三干扰波束,第四滤波器用于加权调整干扰组;将第i个第一音频处理输出与第i个第三干扰波束之差确定为第i个第二音频处理输出;根据第i个第二音频处理输出自适应更新第i个第四滤波器。
示例性的,第i个第四滤波器与第i个目标方向相对应。
例如,如图8所示,以四个目标方向为例,将第1目标方向作为目标语音的方向,则将第2目标方向、第3目标方向、第4目标方向的第一语音处理输出Y 2、Y 3、Y 4作为第1个干扰组,输入第1个第四滤波器601得到第1个第三干扰波束,用第1个第一音频处理输出Y 1减去第1个第三干扰波束得到第1个第二音频处理输出Z 1。利用第1个第二音频处理输出Z 1自适应更新第1个第四滤波器601。
例如,如图9所示,以四个目标方向为例,将第2目标方向作为目标语音的方向,则将第1目标方向、第3目标方向、第4目标方向的第一语音处理输出Y 1、Y 3、Y 4作为第2个干扰组,输入第2个第四滤波器602得到第2个第三干扰波束,用第2个第一音频处理输出Y 2减去第2个第三干扰波束得到第2个第二音频处理输出Z 2。利用第2个第二音频处理输出Z 2自适应更新第2个第四滤波器602。
例如,如图10所示,以四个目标方向为例,将第3目标方向作为目标语音的方向,则将第1目标方向、第2目标方向、第4目标方向的第一语音处理输出Y 1、Y 2、Y 4作为第3个干扰组,输入第3个第四滤波器603得到第3个第三干扰波束,用第3个第一音频处理输出Y 3减去第3个第三干扰波束得到第3个第二音频处理输出Z 3。利用第3个第二音频处理输出Z 3自适应更新第3个第四滤波器603。
例如,如图11所示,以四个目标方向为例,将第4目标方向作为目标语音的方向,则将第1目标方向、第2目标方向、第3目标方向的第一语音处理输出Y 1、Y 2、Y 3作为第4个干扰组,输入第4个第四滤波器604得到第4个第三干扰波束,用第4个第一音频处理输出Y 4减去第4个第三干扰波束得到第4个第二音频处理输出Z 4。利用第4个第二音频处理输出Z 4自适应更新第4个第四滤波器604。
综上所述,本申请提供的音频信号处理方法,通过对采集到的音频信号在多个目标方向上进行音频处理得到多个目标方向分别对应的多个音频处理输出,使用其他方向的音频处理输出来去除本方向的音频处理输出中的干扰,提高本方向音频处理输出的精准度。
示例性的,给出一种将上述音频信号处理方法应用在车载语音识别场景中的示例性实施例。
在车载语音识别场景中,在车辆的主驾驶位、副驾驶位、两个乘客位分别设置有麦克风,组成麦克风阵列,用于采集驾驶员或乘客发出的语音交互指令。当麦克风阵列采集到音频信号后,采用图4或图7的方法对音频信号进行滤波,得到第一音频处理输出或第二音频处理输出,并使用语音识别算法对第一音频处理输出或第二音频处理输出进行语音识 别或语义识别,从而识别驾驶员或乘客发出的语音交互指令,从而使车载计算机系统根据语音交互指令进行响应。
示例性的,根据主驾驶位、副驾驶位、两个乘客位在车辆内的位置分布,确定四个目标方向,四个目标方向分别用于接收主驾驶位上的驾驶员的语音交互指令,以及分别坐在副驾驶位、乘客位的乘客的语音交互指令。在麦克风阵列采集到音频信号后,采用图4或图7的方法对音频信号进行滤波,分别以不同目标方向作为目标语音进行滤波得到四个目标方向分别对应的音频处理输出,音频处理输出强化了选定的目标方向上的音频信号,抑制了其他目标方向的干扰,从而提高音频处理输出的准确度,便于语音识别算法识别信号中的语音指令。
示例性的,如图12中的(1)所示,是将麦克风分别设置在主驾驶位和副驾驶位采集到的双通道语谱,其中上方为主驾驶位的语谱,下方为副驾驶位的语谱。如图12中的(2)所示,是使用本申请提供的预滤波器对采集到的音频信号进行滤波得到的双通道语谱,(1)和(2)对比可以清晰的看出经过数据训练的预滤波器处理实现对语音的空间滤波作用,两个通道的干扰都有很大程度的降低。图12中的(3)是对音频信号采用数据预滤波器结合传统GSC处理得到的双通道语谱图,与(2)相比(3)的干扰泄露更好。如图13中的(1)所示,是采用图7所示的音频信号处理方法(全盲GSC结构)处理音频信号得到的双通道语谱图,相比图12中的(3)进一步减小了语音泄露,这是因为实验中分离声源中左声道是移动声源,图12中的(3)显示传统GSC结构不能很好的跟踪移动声源的变化,图13中的(1)虽然没有采用数据相关的预滤波器,但能够很好的跟踪移动声源的变化,因此具有更好的对干扰语音的抑制能力。图13中的(2)是采用图4所示的音频信号处理方法处理音频信号得到的双通道语谱图,采用预滤波器结合全盲的GSC结构对音频信号进行滤波,同时结合了数据相关的预滤波器和移动干扰声源的跟踪能力,具有最佳效果。
请参考图14,其示出了本申请一个示例性实施例提供的一种音频信号处理装置的方框图,该装置用以执行上述图4所示实施例的方法的全部或部分步骤,如图14所示,该装置可以包括:
第一获取模块701,用于获取麦克风阵列中不同麦克风采集的音频信号;
第一滤波模块702,用于通过第一滤波器对所述音频信号进行滤波得到第一目标波束,所述第一滤波器用于抑制所述音频信号中的干扰语音且增强所述音频信号中的目标语音;
第二滤波模块703,用于通过第二滤波器对所述音频信号进行滤波得到第一干扰波束,所述第二滤波器用于抑制所述目标语音且增强所述干扰语音;
第三滤波模块704,用于通过第三滤波器获取所述第一干扰波束的第二干扰波束,所述第三滤波器用于加权调整所述第一干扰波束;
第一确定模块705,用于将所述第一目标波束与所述第二干扰波束之差确定为第一音频处理输出;
第一更新模块706,用于自适应更新所述第二滤波器和所述第三滤波器中的至少一个,在更新完成后根据所述第二滤波器和所述第三滤波器更新所述第一滤波器。
在一种可能的实现方式中,所述第一滤波器对应有第一权重矩阵,所述第二滤波器对应有第二权重矩阵,所述第三滤波器对应有第三权重矩阵;
所述第一更新模块706,还用于在更新完成后,根据所述第二权重矩阵和所述第三权重矩阵,计算得到所述第一权重矩阵;
所述第一更新模块706,还用于根据所述第一权重矩阵更新所述第一滤波器。
在一种可能的实现方式中,所述第一更新模块706,还用于在更新完成后,将所述第二权重矩阵与所述第三权重矩阵之积确定为目标矩阵;将单位矩阵与所述目标矩阵之差确定为所述第一权重矩阵。
在一种可能的实现方式中,所述第一更新模块706,还用于:
根据所述第一目标波束更新所述第二滤波器,根据所述第一音频处理输出更新所述第三滤波器;或,根据所述第一音频处理输出更新所述第二滤波器和所述第三滤波器;或,根据所述第一目标波束更新所述第二滤波器;或,根据所述第一音频处理输出更新所述第二滤波器;或,根据所述第一音频处理输出更新所述第三滤波器。
在一种可能的实现方式中,所述装置还包括:
预滤波模块707,用于通过预滤波器对所述音频信号进行第一滤波得到预目标波束,所述预滤波器是使用训练数据计算得到的滤波器,所述预滤波器用于抑制所述干扰语音且增强所述目标语音;
所述第一滤波模块702,还用于通过所述第一滤波器对所述预目标波束进行第二滤波,得到所述第一目标波束。
在一种可能的实现方式中,所述装置还包括:
所述第一获取模块701,还用于获取所述麦克风阵列在应用环境中采集的训练数据,所述应用环境是所述麦克风阵列被放置使用的空间范围,所述训练数据包括所述麦克风阵列中不同麦克风采集的样本音频信号;
计算模块708,用于根据线性约束最小方差LCMV准则计算所述训练数据得到所述预滤波器。
请参考图15,其示出了本申请一个示例性实施例提供的一种音频信号处理装置的方框图,该装置用以执行上述图7所示实施例的方法的全部或部分步骤,如图15所示,该装置可以包括:
第二获取模块801,用于获取麦克风阵列中不同麦克风采集的音频信号,所述麦克风阵列包括n个目标方向,每个所述目标方向分别对应一个滤波器组,所述滤波器组采用图4所示实施例中任一所述的方法处理所述音频信号,所述n是大于1的正整数;
滤波器组模块802,用于针对n个所述目标方向对应的音频信号,分别使用对应的所述滤波器组对所述音频信号进行滤波,得到n个所述目标方向对应的n个第一音频处理输出;
第四滤波模块803,用于根据除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出对第i个所述第一音频处理输出进行滤波,得到第i个所述目标方向对应的第i个第二音频处理输出,所述i为大于0且小于所述n的正整数;重复该步骤得到n个所述目标方向分别对应的第二音频处理输出。
在一种可能的实现方式中,所述装置还包括:
所述第四滤波模块803,还用于将除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出确定为第i个干扰组;
所述第四滤波模块803,还用于通过第i个所述目标方向对应的第i个第四滤波器对第i个所述干扰组进行滤波得到第i个第三干扰波束,所述第四滤波器用于加权调整所述干扰组;
第二确定模块804,用于将第i个所述第一音频处理输出与第i个所述第三干扰波束之差确定为第i个所述第二音频处理输出;
第二更新模块805,用于根据第i个所述第二音频处理输出自适应更新第i个所述第四滤波器。
在一种可能的实现方式中,所述第i个滤波器组包括预滤波器,所述预滤波器是使用所述麦克风阵列采集到的第i个所述目标方向上的训练数据训练得到的。
图16是根据一示例性实施例示出的计算机设备的结构框图。该计算机设备可以实现为本申请上述方案中的音频信号处理设备。所述计算机设备900包括中央处理单元(Central Processing Unit,CPU)901、包括随机存取存储器(Random Access Memory,RAM)902和只读存储器(Read-Only Memory,ROM)903的系统存储器904,以及连接系统存储器904和中央处理单元901的系统总线905。所述计算机设备900还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(Input/Output系统,I/O系统)906,和用于存储操作系统913、应用程序914和其他程序模块915的大容量存储设备907。
所述基本输入/输出系统906包括有用于显示信息的显示器908和用于用户输入信息的诸如鼠标、键盘之类的输入设备909。其中所述显示器908和输入设备909都通过连接到系统总线905的输入输出控制器910连接到中央处理单元901。所述基本输入/输出系统906还可以包括输入输出控制器910以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器910还提供输出到显示屏、打印机或其他类型的输出设备。
根据本申请的各种实施例,所述计算机设备900还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备900可以通过连接在所述系统总线905上的网络接口单元911连接到网络912,或者说,也可以使用网络接口单元911来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,中央处理器901通过执行该一个或一个以上程序来实现图4或图7所示的方法中的全部或者部分步骤。
本申请实施例还提供了一种计算机可读存储介质,用于储存为上述计算机设备所用的计算机软件指令,其包含用于执行上述音频信号处理方法所设计的程序。例如,该计算机可读存储介质可以是ROM、RAM、CD-ROM、磁带、软盘和光数据存储设备等。
本申请实施例还提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或 指令集由所述处理器加载并执行以实现如上文介绍的音频信号处理方法的全部或者部分步骤。
本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的音频信号处理方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (14)

  1. 一种音频信号处理方法,所述方法由音频信号处理设备执行,所述方法包括:
    获取麦克风阵列中不同麦克风采集的音频信号;
    通过第一滤波器对所述音频信号进行滤波得到第一目标波束,所述第一滤波器用于抑制所述音频信号中的干扰语音且增强所述音频信号中的目标语音;
    通过第二滤波器对所述音频信号进行滤波得到第一干扰波束,所述第二滤波器用于抑制所述目标语音且增强所述干扰语音;
    通过第三滤波器获取所述第一干扰波束的第二干扰波束,所述第三滤波器用于加权调整所述第一干扰波束;
    将所述第一目标波束与所述第二干扰波束之差确定为第一音频处理输出;
    自适应更新所述第二滤波器和所述第三滤波器中的至少一个,在更新完成后根据所述第二滤波器和所述第三滤波器更新所述第一滤波器。
  2. 根据权利要求1所述的方法,所述第一滤波器对应有第一权重矩阵,所述第二滤波器对应有第二权重矩阵,所述第三滤波器对应有第三权重矩阵;
    所述在更新完成后根据所述第二滤波器和所述第三滤波器更新所述第一滤波器,包括:
    在更新完成后,根据所述第二权重矩阵和所述第三权重矩阵,计算得到所述第一权重矩阵;
    根据所述第一权重矩阵更新所述第一滤波器。
  3. 根据权利要求2所述的方法,所述在更新完成后,根据所述第二权重矩阵和所述第三权重矩阵,计算得到所述第一权重矩阵,包括:
    在更新完成后,将所述第二权重矩阵与所述第三权重矩阵之积确定为目标矩阵;
    将单位矩阵与所述目标矩阵之差确定为所述第一权重矩阵。
  4. 根据权利要求1至3任一所述的方法,所述自适应更新所述第二滤波器和所述第三滤波器中的至少一个,包括:
    根据所述第一目标波束更新所述第二滤波器,根据所述第一音频处理输出更新所述第三滤波器;
    或,
    根据所述第一音频处理输出更新所述第二滤波器和所述第三滤波器;
    或,
    根据所述第一目标波束更新所述第二滤波器;
    或,
    根据所述第一音频处理输出更新所述第二滤波器;
    或,
    根据所述第一音频处理输出更新所述第三滤波器。
  5. 根据权利要求1至3任一所述的方法,所述通过第一滤波器对所述音频信号进行滤波得到第一目标波束,包括:
    通过预滤波器对所述音频信号进行第一滤波得到预目标波束,所述预滤波器是使用训练数据计算得到的滤波器,所述预滤波器用于抑制所述干扰语音且增强所述目标语音;
    通过所述第一滤波器对所述预目标波束进行第二滤波,得到所述第一目标波束。
  6. 根据权利要求5所述的方法,所述方法还包括:
    获取所述麦克风阵列在应用环境中采集的训练数据,所述应用环境是所述麦克风阵列被放置使用的空间范围,所述训练数据包括所述麦克风阵列中不同麦克风采集的样本音频信号;
    根据线性约束最小方差LCMV准则计算所述训练数据得到所述预滤波器。
  7. 一种音频信号处理方法,所述方法由音频信号处理设备执行,所述方法包括:
    获取麦克风阵列中不同麦克风采集的音频信号,所述麦克风阵列包括n个目标方向,每个所述目标方向分别对应一个滤波器组,所述滤波器组采用如权利要求1至6任一所述的方法处理所述音频信号,所述n是大于1的正整数;
    针对n个所述目标方向对应的音频信号,分别使用对应的所述滤波器组对所述音频信号进行滤波,得到n个所述目标方向对应的n个第一音频处理输出;
    根据除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出对第i个所述第一音频处理输出进行滤波,得到第i个所述目标方向对应的第i个第二音频处理输出,所述i为大于0且小于所述n的正整数;重复该步骤得到n个所述目标方向分别对应的第二音频处理输出。
  8. 根据权利要求7所述的方法,所述根据除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出对第i个所述第一音频处理输出进行滤波,得到第i个所述目标方向对应的第i个第二音频处理输出,包括:
    将除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出确定为第i个干扰组;
    通过第i个所述目标方向对应的第i个第四滤波器对所述第i个干扰组进行滤波得到第i个第三干扰波束,所述第四滤波器用于加权调整所述干扰组;
    将第i个所述第一音频处理输出与第i个所述第三干扰波束之差确定为第i个所述第二音频处理输出;
    根据第i个所述第二音频处理输出自适应更新第i个所述第四滤波器。
  9. 根据权利要求7或8所述的方法,所述第i个滤波器组包括预滤波器,所述预滤波器是使用所述麦克风阵列采集到的第i个所述目标方向上的训练数据训练得到的。
  10. 一种音频信号处理装置,所述装置部署在音频信号处理设备上,所述装置包括:
    第一获取模块,用于获取麦克风阵列中不同麦克风采集的音频信号;
    第一滤波模块,用于通过第一滤波器对所述音频信号进行滤波得到第一目标波束,所述第一滤波器用于抑制所述音频信号中的干扰语音且增强所述音频信号中的目标语音;
    第二滤波模块,用于通过第二滤波器对所述音频信号进行滤波得到第一干扰波束,所述第二滤波器用于抑制所述目标语音且增强所述干扰语音;
    第三滤波模块,用于通过第三滤波器获取所述第一干扰波束的第二干扰波束,所述第三滤波器用于加权调整所述第一干扰波束;
    第一确定模块,用于将所述第一目标波束与所述第二干扰波束之差确定为第一音频处理输出;
    第一更新模块,用于自适应更新所述第二滤波器和所述第三滤波器中的至少一个,在更新完成后根据所述第二滤波器和所述第三滤波器更新所述第一滤波器。
  11. 一种音频信号处理装置,所述装置部署在音频信号处理设备上,所述装置包括:
    第二获取模块,用于获取麦克风阵列中不同麦克风采集的音频信号,所述麦克风阵列包括n个目标方向,每个所述目标方向分别对应一个滤波器组,所述滤波器组采用如权利要求1至6任一所述的方法处理所述音频信号,所述n是大于1的正整数;
    滤波器组模块,用于针对n个所述目标方向对应的音频信号,分别使用对应的所述滤波器组对所述音频信号进行滤波,得到n个所述目标方向对应的n个第一音频处理输出;
    第四滤波模块,用于根据除第i个所述第一音频处理输出之外的n-1个所述第一音频处理输出对第i个所述第一音频处理输出进行滤波,得到第i个所述目标方向对应的第i个第二音频处理输出,所述i为大于0且小于所述n的正整数;重复该步骤得到n个所述目标方向分别对应的第二音频处理输出。
  12. 一种用于音频信号处理的计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至9任一所述的音频信号处理方法。
  13. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至9任一所述的音频信号处理方法。
  14. 一种计算机程序产品,当所述计算机程序产品被执行时,用于执行权利要求1-9任一所述的音频信号处理方法。
PCT/CN2021/098085 2020-07-17 2021-06-03 音频信号处理方法、装置、设备及存储介质 WO2022012206A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022538830A JP7326627B2 (ja) 2020-07-17 2021-06-03 オーディオ信号処理方法、装置、機器及びコンピュータプログラム
EP21842054.5A EP4092672A4 (en) 2020-07-17 2021-06-03 AUDIO SIGNAL PROCESSING METHOD, DEVICE, EQUIPMENT AND STORAGE MEDIUM
US17/741,285 US12009006B2 (en) 2020-07-17 2022-05-10 Audio signal processing method, apparatus and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010693891.9 2020-07-17
CN202010693891.9A CN111798860B (zh) 2020-07-17 2020-07-17 音频信号处理方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/741,285 Continuation US12009006B2 (en) 2020-07-17 2022-05-10 Audio signal processing method, apparatus and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022012206A1 true WO2022012206A1 (zh) 2022-01-20

Family

ID=72807727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/098085 WO2022012206A1 (zh) 2020-07-17 2021-06-03 音频信号处理方法、装置、设备及存储介质

Country Status (5)

Country Link
US (1) US12009006B2 (zh)
EP (1) EP4092672A4 (zh)
JP (1) JP7326627B2 (zh)
CN (1) CN111798860B (zh)
WO (1) WO2022012206A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798860B (zh) 2020-07-17 2022-08-23 腾讯科技(深圳)有限公司 音频信号处理方法、装置、设备及存储介质
CN112118511A (zh) * 2020-11-19 2020-12-22 北京声智科技有限公司 耳机降噪方法、装置、耳机及计算机可读存储介质
CN112634931B (zh) * 2020-12-22 2024-05-14 北京声智科技有限公司 语音增强方法及装置
CN112785998B (zh) * 2020-12-29 2022-11-15 展讯通信(上海)有限公司 信号处理方法、设备及装置
CN113113036B (zh) * 2021-03-12 2023-06-06 北京小米移动软件有限公司 音频信号处理方法及装置、终端及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353376A (en) * 1992-03-20 1994-10-04 Texas Instruments Incorporated System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment
CN1753084A (zh) * 2004-09-23 2006-03-29 哈曼贝克自动系统股份有限公司 使用噪声降低的多通道自适应语音信号处理
CN102664023A (zh) * 2012-04-26 2012-09-12 南京邮电大学 一种麦克风阵列语音增强的优化方法
CN102831898A (zh) * 2012-08-31 2012-12-19 厦门大学 带声源方向跟踪功能的麦克风阵列语音增强装置及其方法
CN111798860A (zh) * 2020-07-17 2020-10-20 腾讯科技(深圳)有限公司 音频信号处理方法、装置、设备及存储介质

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6034378A (en) * 1995-02-01 2000-03-07 Nikon Corporation Method of detecting position of mark on substrate, position detection apparatus using this method, and exposure apparatus using this position detection apparatus
EP1425738A2 (en) * 2001-09-12 2004-06-09 Bitwave Private Limited System and apparatus for speech communication and speech recognition
US7613310B2 (en) * 2003-08-27 2009-11-03 Sony Computer Entertainment Inc. Audio input system
US7426464B2 (en) * 2004-07-15 2008-09-16 Bitwave Pte Ltd. Signal processing apparatus and method for reducing noise and interference in speech communication and speech recognition
KR20070087533A (ko) * 2007-07-12 2007-08-28 조정권 적응 마이크로폰 어레이를 이용한 간섭 신호 제거 시스템의개발
CN101192411B (zh) * 2007-12-27 2010-06-02 北京中星微电子有限公司 大距离麦克风阵列噪声消除的方法和噪声消除系统
CN102509552B (zh) * 2011-10-21 2013-09-11 浙江大学 一种基于联合抑制的麦克风阵列语音增强方法
DE112012006780T5 (de) * 2012-08-06 2015-06-03 Mitsubishi Electric Corporation Strahlformungsvorrichtung
CN105489224B (zh) * 2014-09-15 2019-10-18 讯飞智元信息科技有限公司 一种基于麦克风阵列的语音降噪方法及系统
CN106910500B (zh) * 2016-12-23 2020-04-17 北京小鸟听听科技有限公司 对带麦克风阵列的设备进行语音控制的方法及设备
US10573301B2 (en) 2018-05-18 2020-02-25 Intel Corporation Neural network based time-frequency mask estimation and beamforming for speech pre-processing
CN110120217B (zh) * 2019-05-10 2023-11-24 腾讯科技(深圳)有限公司 一种音频数据处理方法及装置
CN110265054B (zh) * 2019-06-14 2024-01-30 深圳市腾讯网域计算机网络有限公司 语音信号处理方法、装置、计算机可读存储介质和计算机设备
CN110517702B (zh) * 2019-09-06 2022-10-04 腾讯科技(深圳)有限公司 信号生成的方法、基于人工智能的语音识别方法及装置
CN110706719B (zh) * 2019-11-14 2022-02-25 北京远鉴信息技术有限公司 一种语音提取方法、装置、电子设备及存储介质
CN110827847B (zh) * 2019-11-27 2022-10-18 添津人工智能通用应用系统(天津)有限公司 低信噪比见长的麦克风阵列语音去噪增强方法
CN111770379B (zh) 2020-07-10 2021-08-24 腾讯科技(深圳)有限公司 一种视频投放方法、装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353376A (en) * 1992-03-20 1994-10-04 Texas Instruments Incorporated System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment
CN1753084A (zh) * 2004-09-23 2006-03-29 哈曼贝克自动系统股份有限公司 使用噪声降低的多通道自适应语音信号处理
CN102664023A (zh) * 2012-04-26 2012-09-12 南京邮电大学 一种麦克风阵列语音增强的优化方法
CN102831898A (zh) * 2012-08-31 2012-12-19 厦门大学 带声源方向跟踪功能的麦克风阵列语音增强装置及其方法
CN111798860A (zh) * 2020-07-17 2020-10-20 腾讯科技(深圳)有限公司 音频信号处理方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4092672A4 *

Also Published As

Publication number Publication date
EP4092672A1 (en) 2022-11-23
US20220270631A1 (en) 2022-08-25
CN111798860B (zh) 2022-08-23
US12009006B2 (en) 2024-06-11
CN111798860A (zh) 2020-10-20
EP4092672A4 (en) 2023-09-13
JP2023508063A (ja) 2023-02-28
JP7326627B2 (ja) 2023-08-15

Similar Documents

Publication Publication Date Title
WO2022012206A1 (zh) 音频信号处理方法、装置、设备及存储介质
JP7434137B2 (ja) 音声認識方法、装置、機器及びコンピュータ読み取り可能な記憶媒体
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
Hoshen et al. Speech acoustic modeling from raw multichannel waveforms
US10123113B2 (en) Selective audio source enhancement
WO2020103703A1 (zh) 一种音频数据处理方法、装置、设备及存储介质
CN109887489B (zh) 基于生成对抗网络的深度特征的语音去混响方法
CN113436643B (zh) 语音增强模型的训练及应用方法、装置、设备及存储介质
JP2010519602A (ja) 信号分離のためのシステム、方法、および装置
CN109949821B (zh) 一种利用cnn的u-net结构进行远场语音去混响的方法
CN110473568B (zh) 场景识别方法、装置、存储介质及电子设备
CN116030823B (zh) 一种语音信号处理方法、装置、计算机设备及存储介质
CN113823273B (zh) 音频信号处理方法、装置、电子设备及存储介质
Janský et al. Auxiliary function-based algorithm for blind extraction of a moving speaker
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
Sainath et al. Raw multichannel processing using deep neural networks
CN113707136B (zh) 服务型机器人语音交互的音视频混合语音前端处理方法
CN112466327B (zh) 语音处理方法、装置和电子设备
Yang et al. Guided speech enhancement network
CN112731291B (zh) 协同双通道时频掩码估计任务学习的双耳声源定位方法及系统
CN113035176B (zh) 语音数据处理方法、装置、计算机设备及存储介质
CN115620739A (zh) 指定方向的语音增强方法及电子设备和存储介质
CN114495909A (zh) 一种端到端的骨气导语音联合识别方法
Nakagome et al. Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation.
KR101022457B1 (ko) Casa 및 소프트 마스크 알고리즘을 이용한 단일채널 음성 분리방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21842054

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022538830

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021842054

Country of ref document: EP

Effective date: 20220818

NENP Non-entry into the national phase

Ref country code: DE