WO2021129196A1 - 一种语音信号处理方法及装置 - Google Patents

一种语音信号处理方法及装置 Download PDF

Info

Publication number
WO2021129196A1
WO2021129196A1 PCT/CN2020/127546 CN2020127546W WO2021129196A1 WO 2021129196 A1 WO2021129196 A1 WO 2021129196A1 CN 2020127546 W CN2020127546 W CN 2020127546W WO 2021129196 A1 WO2021129196 A1 WO 2021129196A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
signal
voice signal
external
collector
Prior art date
Application number
PCT/CN2020/127546
Other languages
English (en)
French (fr)
Inventor
张献春
钟金云
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Priority to US17/788,758 priority Critical patent/US20230024984A1/en
Priority to EP20907146.3A priority patent/EP4021008B1/en
Publication of WO2021129196A1 publication Critical patent/WO2021129196A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1016Earpieces of the intra-aural type
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/10Details of earpieces, attachments therefor, earphones or monophonic headphones covered by H04R1/10 but not provided for in any of its subgroups
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2420/00Details of connection covered by H04R, not provided for in its groups
    • H04R2420/07Applications of wireless loudspeakers or wireless microphones

Definitions

  • This application relates to the field of signal processing technology and earphones, and in particular to a voice signal processing method and device.
  • Fig. 1 is a schematic diagram of an earphone in the prior art.
  • the earphone is provided with a noise microphone (MIC), which is represented as MIC1 in Fig. 1.
  • MIC1 noise microphone
  • the voice signal collected by MIC1 is passed through a high-pass filter and a low-pass filter. Filtering processing to retain the voice signal of a certain frequency band, and then the retained voice signal is optimized by an equalizer (EQ) and then output through the speaker.
  • ANC active noise cancellation
  • EQ equalizer
  • the technical solution of the present application provides a voice signal processing method and device, which are used to monitor environmental sound signals and improve the monitoring effect and user experience.
  • the technical solution of the present application provides a voice signal processing method, which is applied to a headset.
  • the headset includes at least one external voice collector, including: preprocessing the voice signal collected by the at least one external voice collector to obtain the external voice Signal, preprocessing can specifically include related processing to improve the signal-to-noise ratio of the external voice signal, such as noise reduction, adjustment of amplitude or gain, etc.; extracting the environmental sound signal in the external voice signal, for example, extracting the external voice The siren, broadcast or baby crying in the signal; according to the amplitude and phase of the first voice signal and the environmental sound signal, and the position of at least one external voice collector, the first voice signal and the environmental sound signal are mixed Audio processing to obtain the target voice signal; wherein, the first voice signal may be a voice signal to be played transmitted to the headset by an electronic device connected to the headset, such as a song or a broadcast, etc.; or, the first voice signal is the headset’s The voice signal collected by the microphone, such as the user's call voice,
  • the external voice collector is located outside the ear canal of the user when the user wears the headset, so that the voice signal collected by at least one external voice collector can be preprocessed to obtain the external voice signal. Extracting the environmental sound signal in the external voice signal can obtain the required environmental sound signal, and mixing the first voice signal and the environmental sound signal to obtain the target voice signal, so that the user can listen to the target voice signal when the target voice signal is played. To the clear and natural first voice signal and the important environmental sound signal in the external environment, the monitoring of the environmental sound is realized, and the monitoring effect and user experience are improved.
  • mixing the first voice signal and the ambient sound signal includes: adjusting at least one of the amplitude, phase, or output delay of the first voice signal; and/ Or, adjusting at least one of the amplitude, phase, or output delay of the environmental sound signal; fusing the adjusted first voice signal and the adjusted environmental sound signal into one voice signal.
  • the first voice signal heard by the user can be made clear and natural, and the ambient sound signal heard by the user will not cause discomfort such as harshness or inaudibility. Problems, thereby improving the quality of the voice signal and user experience.
  • extracting the environmental sound signal in the external voice signal includes: performing coherence processing on the external voice signal and the sample voice signal to obtain the environmental sound signal.
  • the coherence processing of the external voice signal and the sample voice signal may include: determining the power spectral density of the external voice signal, determining the power spectral density of the sample voice signal, and determining the cross-spectral density of the external voice signal and the sample voice signal; The power spectral density and the cross-spectral density determine the coherence coefficients of the external voice signal and the sample voice signal, and then determine the environmental sound signal according to the coherence coefficient.
  • the coherence coefficient in the external voice signal can be equal to 1 or close to
  • the voice signal corresponding to 1 o'clock is determined to be the environmental sound signal.
  • the provided method for extracting the environmental sound signal has high accuracy, and the obtained environmental sound signal has a high signal-to-noise ratio.
  • the at least one external voice collector includes at least two external voice collectors, and then extracting the environmental sound signal in the external voice signal includes: corresponding to the at least two external voice collectors
  • the external voice signal is coherently processed to obtain the environmental sound signal.
  • the external voice signal corresponding to each external voice collector refers to the external voice signal obtained after preprocessing the voice signal collected by the external voice collector.
  • the headset further includes an ear canal voice collector
  • the method further includes: preprocessing the voice signal collected by the ear canal voice collector to obtain the first voice signal.
  • the signal may only include the user's voice signal (for example, the user's self-voice signal, etc.), or may include both the user's voice signal and the environmental sound signal.
  • mixing the first voice signal and the ambient sound signal includes: according to the first voice signal and the ambient sound signal The amplitude and phase of the environmental sound signal, and the position of at least one external voice collector and the ear canal voice collector, perform mixing processing on the first voice signal and the environmental sound signal.
  • the amplitude of the environmental sound signal is increased to the preset amplitude threshold, And adjust the output delay of the environmental sound signal; for another example, when the position of at least one external voice collector is position 2, and the time difference corresponding to the adjacent amplitude of the first voice signal and the environmental sound signal is less than a certain time difference threshold , Widen the ambient sound signal and set the output delay.
  • the first voice signal is obtained by preprocessing the voice signal collected by the ear canal voice collector, so that the user can hear a clear and natural self-voice signal when the target voice signal is played, such as a call Voice signals, etc., thereby improving the call quality.
  • preprocessing the voice signal collected by the ear canal voice collector includes: performing at least one of the following processing on the voice signal collected by the ear canal voice collector: amplitude adjustment, gain Enhancement, echo cancellation or noise suppression.
  • the first voice signal collected by the ear canal voice collector may have small amplitude and low gain, and there may also be various noises such as echo signals or environmental noise in the voice signal.
  • the noise signal in the voice signal can be effectively reduced, and the signal-to-noise ratio can be improved.
  • the ear canal voice collector includes at least one of an ear canal microphone or an ear bone pattern sensor. In the foregoing possible implementation manners, the use diversity and flexibility of the ear canal voice collector are improved.
  • preprocessing the voice signal collected by the at least one external voice collector includes: performing at least one of the following processing on the voice signal collected by the at least one external voice collector: amplitude adjustment , Gain enhancement, echo cancellation or noise suppression.
  • the voice signal collected by at least one external voice collector may have small amplitude and low gain, and various noise signals such as echo signals and environmental noise may also exist in the voice signal.
  • the method further includes: performing at least one of the following processing and outputting on the target voice signal, and the at least one processing includes: noise suppression, equalization processing, data packet loss compensation, and automatic gain control Or dynamic range adjustment.
  • the at least one processing includes: noise suppression, equalization processing, data packet loss compensation, and automatic gain control Or dynamic range adjustment.
  • new noise signals may be generated during the processing of the voice signal, and data packet loss may occur during the transmission process.
  • the at least one external voice collector includes: a call microphone or a noise reduction microphone.
  • Mixing the first voice signal and the environmental sound signal includes: according to the positions of the ear canal microphone and the call microphone, and the amplitude difference and/or phase difference of the same environmental sound signal collected by the ear canal microphone and the call microphone, The distance between the sound source corresponding to the environmental sound signal and the user is determined, and at least one of the amplitude, phase, or output delay of the environmental sound signal and/or the first voice signal is adjusted based on the distance.
  • the technical solution of the present application provides a voice signal processing device, which includes at least one external voice collector, and further includes: a processing unit for preprocessing the voice signal collected by the at least one external voice collector to obtain the external
  • preprocessing may specifically include related processing to improve the signal-to-noise ratio of the external voice signal, such as noise reduction, amplitude adjustment or gain processing;
  • the processing unit is also used to extract environmental sounds in the external voice signal Signals, for example, extracting siren, broadcasting or baby crying from external voice signals;
  • the processing unit is also used for processing according to the amplitude and phase of the first voice signal and the ambient sound signal, and the processing of at least one external voice collector Position, the first voice signal and the ambient sound signal are mixed to obtain the target voice signal;
  • the first voice signal may be a voice signal to be played transmitted to the earphone by an electronic device connected to the earphone, such as a song or Broadcasting, etc.; or, the first voice signal is a voice signal collected by the microphone of
  • the processing unit is specifically configured to: adjust at least one of the amplitude, phase, or output delay of the first voice signal; and/or adjust the amplitude of the ambient sound signal, At least one of phase or output delay; fusing the adjusted first voice signal and the adjusted environmental sound signal into one voice signal.
  • the processing unit is further specifically configured to perform coherence processing on the external voice signal and the sample voice signal to obtain the environmental sound signal.
  • the at least one external voice collector includes at least two external voice collectors; the processing unit is further specifically configured to: correlate the external voice signals corresponding to the at least two external voice collectors
  • the external voice signal corresponding to each external voice collector refers to the external voice signal obtained after preprocessing the voice signal collected by the external voice collector.
  • the processing unit is specifically configured to: determine the power spectral density of the external voice signal, determine the power spectral density of the sample voice signal, and determine the cross-spectral density of the external voice signal and the sample voice signal; The power spectral density and the cross-spectral density determine the coherence coefficients of the external voice signal and the sample voice signal, and then determine the environmental sound signal according to the coherence coefficient.
  • the coherence coefficient in the external voice signal can be equal to or close to 1.
  • the corresponding voice signal is determined to be an environmental sound signal.
  • the headset further includes an ear canal voice collector
  • the processing unit is further configured to: preprocess the voice signal collected by the ear canal voice collector to obtain the first voice signal; correspondingly ,
  • the processing unit is also specifically configured to: perform processing on the first voice signal and the environmental sound signal according to the amplitude and phase of the first voice signal and the environmental sound signal, and the position of at least one external voice collector and the ear canal voice collector Mixing process.
  • the amplitude of the environmental sound signal is increased to the preset amplitude threshold, And adjust the output delay of the environmental sound signal; for another example, when the position of at least one external voice collector is position 2, and the time difference corresponding to the adjacent amplitude of the first voice signal and the environmental sound signal is less than a certain time difference threshold , Widen the ambient sound signal and set the output delay.
  • the processing unit is further configured to: perform at least one of the following processing on the voice signal collected by the ear canal voice collector: amplitude adjustment, gain enhancement, echo cancellation or noise suppression.
  • the ear canal voice collector includes at least one of an ear canal microphone or an ear bone pattern sensor.
  • the processing unit is further configured to: perform at least one of the following processing on the voice signal collected by the at least one external voice collector: amplitude adjustment, gain enhancement, echo cancellation or noise suppression .
  • the processing unit is further configured to: perform at least one of the following processing and output on the target voice signal, and the at least one processing includes: noise suppression, equalization processing, data packet loss compensation, and automatic Gain control or dynamic range adjustment.
  • the at least one external voice collector includes: a call microphone or a noise reduction microphone.
  • the processing unit is specifically configured to: According to the positions of the ear canal microphone and the call microphone, and the ear canal microphone and the call microphone collect data The amplitude difference and/or phase difference of the same environmental sound signal is determined to determine the distance between the sound source corresponding to the environmental sound signal and the user, and then the amplitude of the environmental sound signal and/or the first voice signal is adjusted based on the distance, At least one of phase or output delay.
  • the voice signal processing device is an earphone.
  • the earphone may be a wireless earphone or a wired earphone
  • the wireless earphone may be a Bluetooth earphone, a WiFi earphone, or an infrared earphone.
  • a computer-readable storage medium stores instructions. When the instructions run on a device, the device executes the first aspect or any of the first aspects. A possible implementation of the voice signal processing method provided.
  • a computer program product is provided.
  • the device executes the voice provided by the first aspect or any one of the possible implementations of the first aspect. Signal processing method.
  • any device, computer storage medium or computer program product of the speech signal processing method provided above is used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can refer to the above The beneficial effects of the provided corresponding methods will not be repeated here.
  • Figure 1 is a schematic diagram of the layout of a microphone in a headset
  • FIG. 2 is a schematic diagram of the layout of a voice collector in a headset provided by an embodiment of the application;
  • FIG. 3 is a schematic flowchart of a signal processing method provided by an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of a voice signal processing device provided by an embodiment of this application.
  • FIG. 6 is a schematic structural diagram of another voice signal processing apparatus provided by an embodiment of the application.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • And/or describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects before and after are in an “or” relationship.
  • At least one item (a) in the following” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • At least one of a, b, or c can mean: a, b, c, a and b, a and c, b and c, or a, b and c, where a, b, c It can be single or multiple.
  • words such as “first” and “second” do not limit the number and execution order.
  • FIG. 2 is a schematic diagram of the layout of a voice collector in a headset provided by an embodiment of the application.
  • At least two voice collectors can be provided on the headset, and each voice collector can be used to collect voice signals, for example, each voice
  • the collector can be a microphone or a sound sensor.
  • the at least two voice collectors may include an ear canal voice collector and an external voice collector.
  • the ear canal voice collector may refer to the voice collector located in the user’s ear canal when the user wears the headset, and the external voice collector may refer to A voice collector located outside the ear canal of the user when the user wears the headset.
  • At least two voice collectors including three voice collectors are taken as an example for description.
  • MIC1 and MIC2 are external voice collectors.
  • MIC1 When the user wears the headset, MIC1 is close to the wearer’s ear and MIC2 is close to the wearer’s mouth; MIC3 is the ear canal voice collector.
  • MIC3 When the user wears the headset, MIC3 is in the wearer’s mouth.
  • MIC1 can be a noise reduction microphone or a feedforward microphone
  • MIC2 can be a call microphone
  • MIC3 can be an ear canal microphone or an ear bone pattern sensor.
  • the headset can be used in conjunction with various electronic devices such as mobile phones, notebook computers, computers, watches, etc. through wired or wireless connections to process audio services such as media and calls of the electronic devices.
  • the audio service may include playing the peer's voice data for the user, or collecting the user's voice data and sending it to the peer in call business scenarios such as phone calls, WeChat voice messages, audio calls, video calls, games, and voice assistants; It can also include media services such as playing music, recording, sound in video files, background music in games, and incoming call notification sounds for users.
  • the headset may be a wireless headset, and the wireless headset may be a Bluetooth headset, a WiFi headset, an infrared headset, or the like.
  • the earphone may be a neck-worn earphone, a headphone, or an ear-worn earphone.
  • the earphone may also include a processing circuit and a speaker, and at least two voice collectors and speakers are connected to the processing circuit.
  • the processing circuit can be used to receive and process the voice signals collected by at least two voice collectors, for example, perform noise reduction processing on the voice signals collected by the voice collectors.
  • the speaker can be used to receive audio data transmitted by the processing circuit and play audio data for the user. For example, the voice data of the other party is played to the user during the user's call through the mobile phone, or the audio data on the mobile phone is played to the user.
  • the processing circuit and speaker are not shown in FIG. 2.
  • the processing circuit may include a central processing unit, a general-purpose processor, a digital signal processor (digital signal processor, DSP), a microcontroller or a microprocessor, etc.
  • the processing circuit may further include other hardware circuits or accelerators, such as application specific integrated circuits, field programmable gate arrays or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processing circuit may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.
  • FIG. 3 is a schematic flowchart of a voice signal processing method provided by an embodiment of the application. The method may be applied to the headset shown in FIG. 2 and may be specifically executed by a processing circuit in the headset. Referring to Figure 3, the method includes the following steps.
  • S301 Preprocess the voice signal collected by at least one external voice collector to obtain an external voice signal.
  • the at least one external voice collector may include one or more external voice collectors.
  • the external voice collector When the user wears the headset, the external voice collector is located outside the user's ear canal, and the voice signal outside the ear canal has the characteristics of a lot of interference and a wide frequency band.
  • at least one external voice collector may include a call microphone. When the user wears the headset, the call microphone is close to the user's mouth, so that it can be used to collect voice signals in the external environment.
  • At least one external voice collector can collect voice signals in the external environment, and the collected voice signals have the characteristics of large noise and wide frequency bands.
  • the frequency band can be a mid-to-high frequency band, for example, the frequency band can be 100Hz to 10KHz.
  • at least one external voice collector can collect siren, alarm bells, broadcast sounds, or the voices of surrounding people in the external environment; when the user uses the headset in an indoor environment , At least one external voice collector can collect doorbells, baby crying or voices of people around in the indoor environment.
  • At least one external voice collector may transmit the collected voice signal to the processing circuit, and the processing circuit preprocesses the voice signal to remove a part of the noise signal, and obtain External voice signal.
  • the processing circuit preprocesses the voice signal to remove a part of the noise signal, and obtain External voice signal.
  • the call microphone can transmit the collected voice signal to the processing circuit, and the processing circuit removes part of the noise signal in the voice signal.
  • preprocessing the voice signal collected by at least one external voice collector may include the following four separate processing methods, or may include any two or more of the following four separate processing methods Combination of treatment methods.
  • the four independent processing methods are introduced and explained below.
  • the first is to perform amplitude adjustment processing on the voice signal collected by at least one external voice collector.
  • Performing amplitude adjustment processing on the voice signal collected by the at least one external voice collector may include: increasing the amplitude of the voice signal or reducing the amplitude of the voice signal. By performing amplitude adjustment processing on the voice signal, the signal-to-noise ratio of the voice signal can be improved.
  • the amplitude of the voice signal collected by at least one external voice collector is relatively small. At this time, by increasing the amplitude of the voice signal, the signal of the voice signal can be increased. Noise ratio, which facilitates the effective recognition of the amplitude of the voice signal in subsequent processing.
  • the second method is to perform gain enhancement processing on the voice signal collected by at least one external voice collector.
  • Performing gain enhancement processing on the voice signal collected by at least one external voice collector may refer to amplifying the voice signal collected by at least one external voice collector.
  • the voice signal may include multiple voice signals in the external environment.
  • the voice signal includes a voice signal corresponding to a whistle sound and wind noise.
  • Amplifying the voice signal means amplifying the voice signal and wind noise corresponding to the whistle sound at the same time.
  • the gain of the voice signal collected by at least one external voice collector is relatively small, which may cause large errors in the subsequent processing.
  • the voice signal is subjected to gain enhancement processing, which can increase the gain of the voice signal, so as to effectively reduce the processing error of the voice signal in the subsequent processing.
  • the third is to perform echo cancellation processing on the voice signal collected by at least one external voice collector.
  • the voice signal collected by at least one external voice collector may include an echo signal in addition to an external environmental sound signal, and the echo signal may refer to an external voice collection The sound emitted by the speaker of the headset collected by the receiver.
  • the external voice collector of the headset collects the voice signal, in addition to the voice signal in the external environment, it will also collect the audio data played by the speaker (ie echo Signal), so the voice signal collected by the external voice collector will include the echo signal.
  • performing echo cancellation processing on the voice signal collected by at least one external voice collector may refer to removing the echo signal in the voice signal collected by the at least one external voice collector, for example, through an adaptive echo filter.
  • the voice signal collected by the external voice collector can be filtered to eliminate the echo signal.
  • the echo signal is a kind of noise signal, and the signal-to-noise ratio of the voice signal can be improved by eliminating the echo signal, thereby improving the quality of the audio data played by the headset.
  • the specific implementation process of echo cancellation refer to the description in the related technology of echo cancellation, which is not specifically limited in the embodiment of the present application.
  • the fourth type is to perform noise suppression on the voice signal collected by at least one external voice collector.
  • the voice signal collected by the device will include a variety of environmental sound signals. If the required environmental sound signal is the voice signal corresponding to the siren sound, noise suppression on the voice signal collected by at least one external voice collector can mean reducing or eliminating the voice signal except for the required environmental sound signal. For other environmental sound signals (also referred to as noise signals or background noise), the signal-to-noise ratio of the voice signal collected by at least one external voice collector can be improved by eliminating the noise signal. Exemplarily, the noise signal in the voice signal can be eliminated by filtering the voice signal collected by at least one external voice collector.
  • the external voice signal may include one or more kinds of environmental sound signals, and extracting the environmental sound signal in the external voice signal may refer to extracting the required environmental sound signal from the external voice signal.
  • the external voice signal includes various environmental sound signals such as siren sound and wind sound. If the required environmental sound signal is a siren sound, the environmental sound signal corresponding to the siren sound in the external voice signal can be extracted.
  • the extraction of the environmental sound signal from the external voice signal in this application may include the following two different implementation manners, as described below.
  • the first type is to perform coherence processing on the external voice signal and the sample voice signal to obtain the environmental sound signal.
  • the sample voice signal may be a voice signal stored inside the processing circuit, and the earphone may obtain the sample voice signal in a manner pre-collected by an external voice collector.
  • the siren sound is played in a low-noise environment in advance, the siren sound is collected through the earphone, and the collected voice signal is subjected to a series of processing such as noise reduction, and then stored as a sample voice signal in the processing circuit in the earphone.
  • signal correlation can refer to the synchronization similarity between two signals. For example, if two signals are correlated, it can refer to a certain characteristic mark of the two signals (such as amplitude, frequency, phase, etc.). ) Change synchronously within a certain period of time, and the law of change is similar.
  • Correlation processing of two signals can be achieved by determining the coherence coefficient of the two signals.
  • the coherence coefficient is defined as a function of power-spectrum density (PSD) and cross-spectrum density (CSD), which can be determined by the following formula (1) .
  • P xx (f) and P yy (f) represent the PSD of signal x and signal y, respectively
  • P xy (f) represents the CSD between signal x and signal y.
  • the signal x and the signal y in the formula (1) are the external voice signal and the sample voice signal, respectively, the coherence processing of the external voice signal and the sample voice signal can be realized.
  • the processing circuit can perform coherence processing on the external voice signal through the sample voice signal to extract highly coherent (for example, the coherence coefficient is equal to or close to 1) voice from the external voice signal Signal, that is, the environmental sound signal is extracted from the external voice signal.
  • the sample speech signal is a pre-collected speech signal corresponding to a certain environmental sound with a high signal-to-noise ratio
  • the extracted environmental sound signal is highly coherent with the sample speech signal, so the extracted environmental sound signal is the same as the sample speech signal The voice signal of ambient sound, and the signal-to-noise ratio is high.
  • the processing circuit can perform Fourier transform on the external voice signal x and the sample voice signal y respectively to obtain F(x) and F( y), multiply F(x) and F(y) to obtain the cross-spectral density P xy (f) function of the external speech signal x and the sample speech signal y, and the conjugate of F(x) and F(x) Multiply to obtain the power spectral density P xx (f) of the external voice signal x, and multiply the conjugate of F(y) and F(y) to obtain the power spectral density P yy (f) of the sample voice signal y.
  • P xy (f), P xx (f) and P yy (f) are substituted into the above formula (1) to obtain the coherence coefficients of the external voice signal x and the sample voice signal y, and then obtain highly similar environmental sound signals according to the coherence coefficients .
  • the at least one external voice collector includes at least two external voice collectors, and correlation processing is performed on the external voice signals corresponding to the at least two external voice collectors to obtain the environmental sound signal.
  • the at least two external voice collectors may include two or more external voice collectors, and the voice signal collected by each external voice collector is preprocessed to obtain an external voice signal, so that at least two external voices
  • the collector obtains at least two external voice signals correspondingly. Since at least two external voice collectors can collect the same environmental sound, each of the obtained at least two external voice signals includes the environmental sound signal corresponding to the same environmental sound. Correlation processing of the signal can obtain the environmental sound signal.
  • the processing circuit can perform correlation processing on the first external voice signal and the second external voice signal to obtain the environmental sound signal.
  • S303 Perform sound mixing processing on the first voice signal and the environmental sound signal according to the amplitude and phase of the first voice signal and the environmental sound signal, and the position of at least one external voice collector, to obtain a target voice signal.
  • the first voice signal may be a voice signal to be played.
  • the first voice signal may be a voice signal with a song to be played, a voice signal of a call partner to be played, a voice signal of the user to be played, or a voice signal to be played.
  • the first voice signal may be transmitted to the processing circuit of the earphone by an electronic device connected to the earphone, or may be collected by the earphone through an ear canal voice collector or other voice collectors.
  • mixing the first voice signal and the environmental sound signal may include: adjusting at least one of the amplitude, phase, or output delay of the first voice signal; and/or adjusting the amplitude of the environmental sound signal At least one of, phase, or output delay; fusing the adjusted first voice signal and the adjusted environmental sound signal into a voice signal to obtain a target voice signal.
  • the processing circuit may perform mixing processing on the first voice signal and the ambient sound signal according to a preset mixing rule.
  • the mixing rule may be set by a person skilled in the art according to the actual situation, or through voice data. After training, the embodiment of the application does not impose specific restrictions on specific mixing rules.
  • the amplitude of the environmental sound signal can be increased to the preset amplitude
  • the threshold value can also adjust the output delay of the environmental sound signal to highlight the environmental sound signal in the target voice signal obtained by fusion.
  • the environmental sound signal is a whistle sound
  • the user can clearly hear the whistle sound when the target voice signal is played, thereby improving the safety of the user in the outdoor environment Sex.
  • the environmental sound signal can be widened and combined.
  • the environmental sound signal is the crying sound of an indoor baby or the sound of a person talking
  • the environmental sound signal is embodied in the form of stereo, so that the user can clearly hear the crying sound of the baby or the sound of a person talking at the first time , So as to avoid the inconvenience when the user needs to take off the earphone to listen to the baby in the room, or when the user needs to take off the earphone to talk to the family.
  • the earphone further includes an ear canal voice collector.
  • the method further includes: S300.
  • S300 and S301-S302 may be in no particular order.
  • parallel execution of S300 and S301-S302 is taken as an example for illustration.
  • S300 Preprocess the voice signal collected by the ear canal voice collector to obtain the first voice signal.
  • the ear canal voice collector can be an ear canal microphone or an ear bone pattern sensor.
  • the ear canal voice collector When the user wears the headset, the ear canal voice collector is located in the user's ear canal, and the voice signal in the ear canal has the characteristics of less interference and narrow frequency band.
  • the ear canal voice collector can collect the voice signal in the ear canal, and the voice signal obtained by the collector has low noise and a narrow frequency band.
  • the frequency band may be a low-medium frequency band, for example, the frequency band may be 100 Hz to 4 KHz, or 200 Hz to 5 KHz, and so on.
  • the ear canal voice collector can transmit the voice signal to the processing circuit, and the processing circuit preprocesses the voice signal. For example, the processing circuit responds to the voice collected by the ear canal voice collector.
  • the signal undergoes single-channel denoising to obtain the first speech signal.
  • the first voice signal is the voice signal after removing the noise in the voice signal collected by the ear canal voice collector.
  • the first voice signal obtained may include the user's call voice signal or self-voice signal .
  • the first voice signal may also include an environmental sound signal, and the environmental sound signal and the environmental sound signal in S303 come from the same sound source.
  • preprocessing the voice signal collected by the ear canal voice collector may include: performing at least one of the following processing on the voice signal collected by the ear canal voice collector: amplitude adjustment, gain enhancement, echo cancellation or noise suppression. That is, the method for preprocessing the voice signal collected by the ear canal voice collector is similar to the method for preprocessing the voice signal collected by at least one external voice collector described in S301, that is, the method described in S301 can be used.
  • S303 may specifically be: according to the amplitude and phase of the first voice signal and the environmental sound signal, and the position and ear canal of the at least one external voice collector.
  • the location of the voice collector performs mixing processing on the first voice signal and the environmental sound signal to obtain the target voice signal.
  • the amplitude difference and/or phase difference of the same environmental sound signal collected by the ear canal voice collector and the external voice collector Determine the distance between the sound source corresponding to the environmental sound signal and the user, and then adjust at least one of the amplitude, phase, or output delay of the environmental sound signal based on the distance, and/or adjust the distance of the first voice signal At least one of amplitude, phase, or output delay; and fusing the adjusted first voice signal and the adjusted environmental sound signal into a voice signal to obtain a target voice signal.
  • the processing circuit may output the target voice signal.
  • the processing circuit may output the target voice signal to the speaker of the earphone to play the target voice signal. Since the target voice signal is obtained through the fusion of the adjusted first voice signal and the adjusted environmental sound signal, when the user wears and uses the headset, the user can hear the clear and natural first voice signal and the external environment Ambient sound signal in.
  • the ambient sound signal in the target voice signal is an adjusted signal, so that the ambient sound signal heard by the user will not cause discomfort such as harshness or inaudibility, thereby improving the quality of the voice signal and the user experience.
  • the processing circuit may further perform other processing on the target voice signal to further improve the signal-to-noise ratio of the target voice signal.
  • the processing circuit may perform at least one of the following processing on the target voice signal: noise suppression, equalization processing, data packet loss compensation, automatic gain control, or dynamic range adjustment.
  • the speech signal may generate new noise signals in the process of processing.
  • the speech signal generates new noise in the process of noise reduction and/or coherence processing. That is, the target speech signal will include noise signals.
  • Noise suppression processing can reduce or eliminate the noise signal in the target speech signal, thereby improving the signal-to-noise ratio of the target speech signal.
  • the voice signal may cause data packet loss during the transmission process.
  • the voice signal is lost during the transmission from the voice collector to the processing circuit, so there may be a packet loss problem in the data packet corresponding to the target voice signal In this way, the quality of the call will be affected when the target voice signal is output.
  • packet loss compensation processing By performing packet loss compensation processing on the target voice signal, the packet loss problem can be solved, thereby improving the call quality when the target voice signal is output.
  • the gain of the target voice signal obtained by the processing circuit may be larger or smaller, which will affect the quality of the call when the target voice signal is output.
  • the gain of the target voice signal is adjusted to an appropriate range, thereby improving the quality of target voice playback and user experience.
  • the headset includes hardware structures and/or software modules corresponding to each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • the embodiment of the present application may divide the functional modules of the headset according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 5 shows a possible structural schematic diagram of a voice signal processing apparatus involved in the foregoing embodiment.
  • the device includes: at least one external voice collector 502, and the device further includes a processing unit 503 and an output unit 504.
  • the processing unit 503 may be a DSP, a micro-processing circuit, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • the output unit 504 may be an output interface, a communication interface, a speaker, or the like. Further, the device may also include an ear canal voice collector 501.
  • the processing unit 503 is configured to preprocess the voice signal collected by at least one external voice collector 502 to obtain an external voice signal; the processing unit 503 is also configured to extract environmental sound signals from the external voice signal; the processing unit 503 is further configured to perform mixing processing on the first voice signal and the environmental sound signal according to the amplitude and phase of the first voice signal and the environmental sound signal, and the position of at least one external voice collector, to obtain the target voice signal.
  • the output unit 504 is configured to output the target voice signal.
  • the processing unit 503 is specifically configured to: adjust at least one of the amplitude, phase, and output delay of the first voice signal; adjust the amplitude, phase, and output delay of the ambient sound signal. At least one; fusing the adjusted first voice signal and the adjusted environmental sound signal into one voice signal.
  • the processing unit 503 is further specifically configured to: perform coherent processing on the external voice signal and the sample voice signal to obtain an environmental sound signal; or, at least one external voice collector includes at least two external voice collectors, Coherent processing is performed on the external voice signals corresponding to at least two external voice collectors to obtain the environmental sound signal.
  • the processing unit 503 is further configured to: preprocess the voice signal collected by the ear canal voice collector to obtain the first voice signal.
  • the processing unit 503 performs at least one of the following processing on the voice signal collected by the ear canal voice collector: amplitude adjustment, gain enhancement, echo cancellation or noise suppression.
  • the processing unit 503 is further specifically configured to perform at least one of the following processing on the voice signal collected by at least one external voice collector: amplitude adjustment, gain enhancement, echo cancellation or noise suppression.
  • processing unit 503 is further configured to: perform at least one of the following processing on the output target voice signal: noise suppression, equalization processing, data packet loss compensation, automatic gain control, or dynamic range adjustment.
  • the ear canal voice collector 501 includes: an ear canal microphone or an ear bone pattern sensor; the at least one external voice collector 502 includes: a call microphone and a noise reduction microphone.
  • FIG. 6 is a schematic structural diagram of a voice signal processing device provided by an embodiment of the application.
  • an ear canal voice collector 501 is used as an ear canal microphone, and at least one external voice collector 502 includes a call microphone and a microphone.
  • the processing circuit 503 is a DSP, and the output unit 504 is a loudspeaker.
  • the external voice collector 502 is located outside the ear canal of the user when the user wears the headset, so that the voice signal collected by at least one external voice collector can be preprocessed to obtain the external voice signal. Extracting the environmental sound signal from the external voice signal can obtain the required environmental sound signal, and mixing the first voice signal and the environmental sound signal to obtain the target voice signal, so that the user can listen to the target voice signal when the target voice signal is played. To the clear and natural first voice signal and the important environmental sound signal in the external environment, the monitoring of the environmental sound is realized, and the monitoring effect and user experience are improved.
  • a computer-readable storage medium stores instructions.
  • a device which may be a single-chip microcomputer, a chip, or a processing circuit, etc.
  • runs the instruction it causes The device executes the voice signal processing method provided above.
  • the aforementioned computer-readable storage media may include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk, and other media that can store program codes.
  • a computer program product includes instructions, and the instructions are stored in a computer-readable storage medium; when a device (may be a single-chip microcomputer, a chip, or a processing circuit, etc.) When the instruction is executed, the device executes the voice signal processing method provided above.
  • the aforementioned computer-readable storage media may include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Headphones And Earphones (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

本申请提供一种语音信号处理方法及装置,涉及信号处理技术和耳机领域,用于监听环境音信号,提高监听效果和用户体验。该方法应用于耳机中,该耳机包括至少一个外部语音采集器,包括:预处理至少一个外部语音采集器采集到的语音信号,得到外部语音信号;提取外部语音信号中的环境音信号;根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,得到目标语音信号。

Description

一种语音信号处理方法及装置
本申请要求于2019年12月25日提交国家知识产权局、申请号为201911359322.4、申请名称为“一种语音信号处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信号处理技术和耳机领域,尤其涉及一种语音信号处理方法及装置。
背景技术
为了营造更好的听音环境,实现更好的声音效果,现有耳机都采用了各种降噪技术,用于对周边环境中的其他声音起到隔绝或智能消除的作用。但是,隔离环境声音之后,用户几乎听不到周围环境的声音,也会给用户带来诸多问题。比如,当用户需要与身边人谈话时,用户需要取下耳机才听得见对方说话。再比如,当用户在室外行走时,用户很难听到车辆喇叭声,当有车辆经过时容易发生危险情况。因此,具有监听环境音的功能的耳机成为一种需求。
图1为现有技术中一种耳机的示意图,该耳机设置有噪声麦克风(microphone,MIC),图1中表示为MIC1,在用户佩戴耳机时MIC1靠近用户的耳朵。对于设置有MIC1的耳机,现有技术中通常采用以下方法来监听环境音:在主动降噪(active noise cancellation,ANC)芯片中将MIC1采集到的语音信号经过高通滤波器和低通滤波器进行滤波处理,以保留某一频段的语音信号,再将保留的语音信号经过均衡器(equalizer,EQ)优化后通过扬声器输出。但是,通过这种方法监听到的环境音信号很不自然,从而监听效果不佳。
发明内容
本申请技术方案提供一种语音信号处理方法及装置,用于监听环境音信号,提高监听效果和用户体验。
第一方面,本申请技术方案提供一种语音信号处理方法,应用于耳机中,该耳机包括至少一个外部语音采集器,包括:预处理至少一个外部语音采集器采集到的语音信号,得到外部语音信号,预处理具体可以包括用于提高外部语音信号的信噪比比的相关处理,比如,降噪、调整幅值或增益等处理;提取外部语音信号中的环境音信号,比如,提取外部语音信号中的汽笛声、广播声或者婴儿哭声等;根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,得到目标语音信号;其中,第一语音信号可以为与该耳机连接的电子设备传输给该耳机的待播放的语音信号,比如歌曲或广播等;或者,第一语音信号为该耳机的麦克风采集到的语音信号,比如用户的通话语音等。
上述技术方案中,外部语音采集器在用户佩戴该耳机时位于用户耳道外,从而预处理至少一个外部语音采集器采集到的语音信号可以得到外部语音信号。提取外部语音信号中的环境音信号可以得到所需要的环境音信号,对第一语音信号和环境音信号做混音处理,得到目标语音信号,从而在播放该目标语音信号时,可以使用户听到清 晰、自然的第一语音信号和外部环境中重要的环境音信号,从而实现了环境音的监听,且提高监听效果和用户体验。
在第一方面的一种可能的实现方式中,对第一语音信号和环境音信号做混音处理,包括:调整第一语音信号的幅值、相位或输出时延中的至少一个;和/或,调整环境音信号的幅值、相位或输出时延中的至少一个;将调整后的第一语音信号和调整后的环境音信号融合为一个语音信号。上述可能的实现方式中,通过调整第一语音信号和环境音信号,可以使得用户听到的第一语音信号清晰、自然,同时听到的环境音信号不会产生刺耳、或者听不见等不适的问题,从而提高了语音信号的质量和用户体验。
在第一方面的一种可能的实现方式中,提取外部语音信号中的环境音信号,包括:将外部语音信号与样本语音信号做相干性处理,得到环境音信号。其中,将外部语音信号与样本语音信号做相干性处理可以包括:确定外部语音信号的功率谱密度,确定样本语音信号的功率谱密度,以及确定外部语音信号与样本语音信号的互谱密度;根据所述功率谱密度和所述互谱密度确定外部语音信号和样本语音信号的相干性系数,进而根据相干性系数确定环境音信号,比如,可以将外部语音信号中相干性系数等于1或接近于1时对应的语音信号确定为环境音信号。上述可能的实现方式中,提供的提取环境音信号的方式的准确性高,且得到的环境音信号的信噪比高。
在第一方面的一种可能的实现方式中,至少一个外部语音采集器包括至少两个外部语音采集器,则提取外部语音信号中的环境音信号包括:将至少两个外部语音采集器对应的外部语音信号做相干性处理,得到环境音信号,每个外部语音采集器对应的外部语音信号是指预处理该外部语音采集器采集到的语音信号后得到的外部语音信号。上述可能的实现方式中,通过相干性处理提供的提取环境音信号的方式的准确性高,且得到的环境音信号的信噪比高。
在第一方面的一种可能的实现方式中,该耳机还包括耳道语音采集器,该方法还包括:预处理耳道语音采集器采集到的语音信号,得到第一语音信号,第一语音信号可以仅包括用户语音信号(比如,用户的自语音信号等),也可以同时包括用户语音信号和环境音信号。相应的,根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,包括:根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器和耳道语音采集器的位置,对第一语音信号和环境音信号做混音处理。比如,至少一个外部语音采集器的位置为位置1、且第一语音信号与环境音信号的幅值差小于某一幅值阈值时,增大环境音信号的幅值至预设幅值阈值,以及调整环境音信号的输出时延;再比如,至少一个外部语音采集器的位置为位置2、且第一语音信号与环境音信号相邻的幅值对应的时刻差小于某一时刻差阈值时,将环境音信号拉宽并设置输出时延。上述可能的实现方式中,第一语音信号是预处理耳道语音采集器采集到的语音信号得到的,从而可以使得用户在目标语音信号播放时可以听到清楚、自然的自语音信号,比如通话语音信号等,从而提高了通话质量。
在第一方面的一种可能的实现方式中,预处理耳道语音采集器采集到的语音信号,包括:对耳道语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。上述可能的实现方式中,耳道语音采集器采集到的第的 语音信号可能会存在幅度较小、增益较低的情况,该语音信号中也会存在有回波信号或环境噪声等各种噪声信号,通过对该语音信号做幅度调整、增益增强、回波消除或者噪声抑制中的至少一种处理,可以有效降低该语音信号中的噪声信号,提高信噪比。
在第一方面的一种可能的实现方式中,耳道语音采集器包括:耳道麦克风或者耳骨纹传感器中的至少一个。上述可能的实现方式中,提高了耳道语音采集器的使用多样性和灵活性。
在第一方面的一种可能的实现方式中,预处理至少一个外部语音采集器采集到的语音信号,包括:对至少一个外部语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。上述可能的实现方式中,至少一个外部语音采集器采集到的语音信号可能会存在幅度较小、增益较低的情况,该语音信号中也会存在有回波信号和环境噪声等各种噪声信号,通过对该语音信号做上述至少一种处理,可以有效降低该语音信号中的噪声信号,提高信噪比。
在第一方面的一种可能的实现方式中,方法还包括:对目标语音信号做以下至少一种处理并输出,至少一种处理包括:噪声抑制、均衡处理、数据包丢失补偿、自动增益控制或者动态范围调整。上述可能的实现方式中,语音信号在处理过程中可能会产生新的噪声信号,在传输过程中可能会产生数据包丢失的情况,通过对输出目标语音信号做上述至少一种处理,可以有效提高目标语音信号的信噪比,提高通话的质量和用户体验。
在第一方面的一种可能的实现方式中,至少一个外部语音采集器包括:通话麦克风或者降噪麦克风。
在第一方面的一种可能的实现方式中,当该耳机包括耳道麦克风和通话麦克风时,根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,包括:根据耳道麦克风和通话麦克风的位置,以及耳道麦克风和通话麦克风采集到的同一环境音信号的幅值差和/或相位差,确定该环境音信号对应的声源与用户之间的距离,进而基于该距离调节环境音信号和/或第一语音信号的幅值、相位或者输出时延中的至少一个。
第二方面,本申请技术方案提供一种语音信号处理装置,该装置包括至少一个外部语音采集器,还包括:处理单元,用于预处理至少一个外部语音采集器采集到的语音信号,得到外部语音信号,预处理具体可以包括用于提高外部语音信号的信噪比比的相关处理,比如,降噪、调整幅值或增益等处理;处理单元,还用于提取外部语音信号中的环境音信号,比如,提取外部语音信号中的汽笛声、广播声或者婴儿哭声等;处理单元,还用于根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,得到目标语音信号;其中,第一语音信号可以为与该耳机连接的电子设备传输给该耳机的待播放的语音信号,比如歌曲或广播等;或者,第一语音信号为该耳机的麦克风采集到的语音信号,比如用户的通话语音等。
在第二方面的一种可能的实现方式中,处理单元具体用于:调整第一语音信号的幅值、相位或输出时延中的至少一个;和/或,调整环境音信号的幅值、相位或输出时延中的至少一个;将调整后的第一语音信号和调整后的环境音信号融合为一个语音信 号。
在第二方面的一种可能的实现方式中,处理单元还具体用于:将外部语音信号与样本语音信号做相干性处理,得到环境音信号。
在第二方面的一种可能的实现方式中,至少一个外部语音采集器包括至少两个外部语音采集器;处理单元还具体用于:将至少两个外部语音采集器对应的外部语音信号做相干性处理,得到环境音信号,每个外部语音采集器对应的外部语音信号是指预处理该外部语音采集器采集到的语音信号后得到的外部语音信号。在一种可能的实施例中,处理单元具体用于:确定外部语音信号的功率谱密度,确定样本语音信号的功率谱密度,以及确定外部语音信号与样本语音信号的互谱密度;根据所述功率谱密度和所述互谱密度确定外部语音信号和样本语音信号的相干性系数,进而根据相干性系数确定环境音信号,比如,可以将外部语音信号中相干性系数等于1或接近于1时对应的语音信号确定为环境音信号。
在第二方面的一种可能的实现方式中,该耳机还包括耳道语音采集器,处理单元还用于:预处理耳道语音采集器采集到的语音信号,得到第一语音信号;相应的,处理单元,还具体用于:根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器和耳道语音采集器的位置,对第一语音信号和环境音信号做混音处理。比如,至少一个外部语音采集器的位置为位置1、且第一语音信号与环境音信号的幅值差小于某一幅值阈值时,增大环境音信号的幅值至预设幅值阈值,以及调整环境音信号的输出时延;再比如,至少一个外部语音采集器的位置为位置2、且第一语音信号与环境音信号相邻的幅值对应的时刻差小于某一时刻差阈值时,将环境音信号拉宽并设置输出时延。
在第二方面的一种可能的实现方式中,处理单元还用于:对耳道语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
在第二方面的一种可能的实现方式中,耳道语音采集器包括:耳道麦克风或者耳骨纹传感器中至少一个。
在第二方面的一种可能的实现方式中,处理单元还用于:对至少一个外部语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
在第二方面的一种可能的实现方式中,处理单元还用于:对目标语音信号做以下至少一种处理并输出,至少一种处理包括:噪声抑制、均衡处理、数据包丢失补偿、自动增益控制或者动态范围调整。
在第二方面的一种可能的实现方式中,至少一个外部语音采集器包括:通话麦克风或者降噪麦克风。
在第二方面的一种可能的实现方式中,当该装置包括耳道麦克风和通话麦克风时,处理单元具体用于:根据耳道麦克风和通话麦克风的位置,以及耳道麦克风和通话麦克风采集到的同一环境音信号的幅值差和/或相位差,确定该环境音信号对应的声源与用户之间的距离,进而基于该距离调节环境音信号和/或第一语音信号的幅值、相位或者输出时延中的至少一个。
在第二方面的一种可能的实现方式中,该语音信号处理装置为耳机,比如,该耳 机可以为无线耳机、有线耳机,该无线耳机可以为蓝牙耳机、WiFi耳机或者红外耳机等。
在本申请技术方案的另一方面,提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当指令在设备上运行时,使得设备执行上述第一方面或第一方面的任一种可能的实现方式所提供的语音信号处理方法。
在本申请技术方案的另一方面,提供一种计算机程序产品,当计算机程序产品在设备上运行时,使得设备执行上述第一方面或第一方面的任一种可能的实现方式所提供的语音信号处理方法。
可以理解地,上述提供的任一种语音信号处理方法的装置、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1为一种耳机中的麦克风的布局示意图;
图2为本申请实施例提供的一种耳机中的语音采集器的布局示意图;
图3为本申请实施例提供的一种信号处理方法的流程示意图;
图4为本申请实施例提供的另一种信号处理方法的流程示意图;
图5为本申请实施例提供的一种语音信号处理装置的结构示意图;
图6为本申请实施例提供的另一种语音信号处理装置的结构示意图。
具体实施方式
本申请实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下中的至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a、b、c、a和b、a和c、b和c、或a、b和c,其中a、b、c可以是单个,也可以是多个。另外,在本申请的实施例中,“第一”、“第二”等字样并不对数量和执行次序进行限定。
需要说明的是,本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
图2为本申请实施例提供的一种耳机中语音采集器的布局示意图,该耳机上可以设置有至少两个语音采集器,每个语音采集器均可用于采集语音信号,比如,每个语音采集器可以为麦克风或者声音传感器等。至少两个语音采集器中可以包括耳道语音采集器和外部语音采集器,耳道语音采集器可以是指当用户佩戴该耳机时位于用户耳道内的语音采集器,外部语音采集器可以是指当用户佩戴该耳机时位于用户耳道外的语音采集器。
上述图2中以至少两个语音采集器包括三个语音采集器,且分别表示为MIC1、MIC2和MIC3为例进行说明。其中,MIC1和MIC2为外部语音采集器,当用户佩戴 该耳机时,MIC1靠近佩戴者的耳朵、MIC2靠近佩戴者的嘴巴;MIC3为耳道语音采集器,当用户佩戴该耳机时,MIC3位于佩戴者的耳道内。在实际应用中,MIC1可以为降噪麦克风或者前馈麦克风,MIC2可以为通话麦克风,MIC3可以为耳道麦克风或者耳骨纹传感器。
其中,该耳机可以通过有线连接或者无线连接的方式与手机、笔记本电脑、计算机、手表等各种电子设备配合使用,处理电子设备的媒体、通话等音频业务。例如,该音频业务可以包括在电话、微信语音消息、音频通话、视频通话、游戏、语音助手等通话业务场景下,为用户播放对端的语音数据,或采集用户的语音数据发送给对端等;还可以包括为用户播放音乐、录音、视频文件中的声音、游戏中的背景音乐、来电提示音等媒体业务。在一种可能的实施例中,该耳机可以为无线耳机,该无线耳机可以为蓝牙耳机、WiFi耳机或者红外耳机等。在另一种可能的实现实施例中,该耳机可以为颈戴式耳机、头戴式耳机或者耳戴式耳机等。
进一步的,该耳机还可以包括处理电路和扬声器,至少两个语音采集器和扬声器均与处理电路连接。该处理电路可用于接收至少两个语音采集器采集到的语音信号并处理,比如,对语音采集器采集到的语音信号进行降噪处理。该扬声器可用于接收处理电路传输的音频数据,并为用户播放音频数据,比如,在用户通过手机通话的过程中将对方的语音数据播放给用户,或者将手机上的音频数据播放给用户。图2中未示出处理电路和扬声器。
在一些可行的实施例中,处理电路可以包括中央处理器单元、通用处理器、数字信号处理器(digital signal processor,DSP)、微控制器或微处理器等。除此以外,处理电路还可进一步包括其他硬件电路或加速器,如专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理电路也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等。
图3为本申请实施例提供的一种语音信号处理方法的流程示意图,该方法可应用于图2所示的耳机中,具体可以由该耳机中的处理电路执行。参见图3,该方法包括以下几个步骤。
S301:预处理至少一个外部语音采集器采集到的语音信号,得到外部语音信号。
其中,至少一个外部语音采集器可以包括一个或者多个外部语音采集器。当用户佩戴该耳机时,外部语音采集器位于用户的耳道外,耳道外的语音信号具有干扰多、频段宽的特性。比如,至少一个外部语音采集器可以包括通话麦克风,当用户佩戴该耳机时,通话麦克风靠近用户的嘴巴,从而可以用于采集外部环境中的语音信号。
当用户通过该耳机连接手机等电子设备播放音乐、广播或者通话语音等音频数据时,至少一个外部语音采集器可以采集外部环境中的语音信号,采集到的语音信号具有噪声大、频段宽的特性,该频段可以是中高频段,比如,该频段可以为100Hz至10KHz。示例性的,当用户处于室外环境使用该耳机时,至少一个外部语音采集器可以采集外部环境中的汽笛声、警铃声、广播声或者周围人说话声等;当用户处于室内环境使用该耳机时,至少一个外部语音采集器可以采集室内环境中的门铃声、婴儿哭 声或者周围人说话声等。
具体的,当至少一个外部语音采集器采集到语音信号时,至少一个外部语音采集器可以将采集到的语音信号传输给处理电路,由处理电路预处理该语音信号,以去除一部分噪音信号,得到外部语音信号。比如,当至少一个外部语音采集器包括通话麦克风时,通话麦克风可以将采集到的语音信号传输至处理电路,由处理电路去除该语音信号中的一部分噪音信号。
在一种实现方式中,预处理至少一个外部语音采集器采集到的语音信号可以包括下述四种单独的处理方式,也可以包括下述四种单独的处理方式中的任意两种或者多种处理方式的结合。下面分别对这四种独立的处理方法进行介绍说明。
第一种、对至少一个外部语音采集器采集到的语音信号做幅度调整处理。
对至少一个外部语音采集器采集到的语音信号做幅度调整处理可以包括:增加该语音信号的幅度,或者减小该语音信号幅度。通过对该语音信号做幅度调整处理,可以提高该语音信号的信噪比。
示例性的,当外部环境中的语音信号的幅度较小时,至少一个外部语音采集器采集到的语音信号的幅度比较小,此时,通过增加该语音信号的幅度,可以提高该语音信号的信噪比,从而便于在后续处理时有效识别该语音信号的幅度。
第二种、对至少一个外部语音采集器采集到的语音信号做增益增强处理。
对至少一个外部语音采集器采集到的语音信号做增益增强处理,可以是指放大至少一个外部语音采集器采集到的语音信号,放大倍数越大(即增益越大),该语音信号的信号值越大。该语音信号可以包括外部环境中的多种语音信号,比如,该语音信号包括汽笛声对应的语音信号和风噪声,放大该语音信号,即同时放大汽笛声对应的语音信号和风噪声。
示例性的,当外部环境中的语音信号较弱时,至少一个外部语音采集器采集到的语音信号的增益比较小,从而在后续处理时可能会导致较大的误差,此时,通过对该语音信号做增益增强处理,可以增大该语音信号的增益,从而便于在后续处理时有效减小该语音信号的处理误差。
第三种、对至少一个外部语音采集器采集到的语音信号做回波消除处理。
用户在通过该耳机播放音频数据的过程中,至少一个外部语音采集器采集到的语音信号中除了包括外部的环境音信号,还可能会包括回波信号,该回波信号可以是指外部语音采集器采集到的耳机的扬声器发出的声音。比如,用户在通过该耳机播放音频数据的过程中,耳机的外部语音采集器在采集语音信号时,除了采集到外部环境中的语音信号外,还会采集到扬声器播放的音频数据(即回波信号),从而外部语音采集器采集到的语音信号中会包括回波信号。
其中,对至少一个外部语音采集器采集到的语音信号做回波消除处理,可以是指消除至少一个外部语音采集器采集到的语音信号中的回波信号,比如通过自适应回波滤波器对外部语音采集器采集到的语音信号做滤波处理可消除该回波信号。该回波信号是一种噪声信号,通过消除该回波信号可以提高该语音信号的信噪比,从而提高耳机播放音频数据的质量。关于回波消除的具体实现过程可以参见回波消除的相关技术中的描述,本申请实施例对此不作具体限制。
第四种、对至少一个外部语音采集器采集到的语音信号做噪声抑制。
用户在通过该耳机播放音频数据的过程中,若该用户所处的环境中存在多种环境音,比如,汽笛声、风噪声或者用户周围的其他人的说话声等,则至少一个外部语音采集器采集到的语音信号中会包括多种环境音信号。若所需要的环境音信号为汽笛声对应的语音信号时,对至少一个外部语音采集器采集到的语音信号做噪声抑制,可以是指降低或消除该语音信号中除所需要的环境音信号之外的其他环境音信号(也可以称为噪声信号或背景噪声),通过消除该噪声信号可以提高至少一个外部语音采集器采集到的语音信号的信噪比。示例性,通过对至少一个外部语音采集器采集到的语音信号做滤波处理可以消除该语音信号中的噪声信号。
S302:提取外部语音信号中的环境音信号。
外部语音信号可以包括一种或者多种环境音信号,提取外部语音信号中的环境音信号可以是指将所需要的环境音信号从外部语音信号中提取出来。比如,外部语音信号中包括汽笛声、风声等多种环境音信号,若所需要的环境音信号为汽笛声,则可以将外部语音信号中汽笛声对应的环境音信号提取出来。具体的,本申请中提取外部语音信号中的环境音信号可以包括以下两种不同的实现方式,如下所述。
第I种、将外部语音信号与样本语音信号做相干性处理,得到环境音信号。
其中,样本语音信号可以是处理电路内部存储的语音信号,该耳机可以通过外部语音采集器预先采集的方式得到该样本语音信号。比如,预先在噪声较小的环境中播放汽笛声,通过该耳机采集该汽笛声,将采集到的语音信号进行降噪等一系列处理后作为样本语音信号存储中该耳机中的处理电路中。
另外,信号的相关性可以是指两个信号之间的同步相似性,比如,若两个信号具有相关性,可以是指这两个信号的某个特性标记(比如,振幅、频率、相位等)在某一时间内是同步变化的,且变化规律是相似的。
将两个信号做相关性处理,可以通过确定这两个信号的相干性系数来实现。对于任意两个信号x和信号y,相干性系数定义为功率谱密度(power-spectrum density,PSD)和互谱密度(cross-spectrum density,CSD)的函数,具体可通过如下公式(1)确定。式中,P xx(f)和P yy(f)分别表示信号x和信号y的PSD,P xy(f)表示信号x和信号y之间的CSD。Coh xy表示信号x和信号y在频率f处的相干性系数,式中0≤Coh xy≤1;若Coh xy=0,信号x和信号y不相干;若Coh xy=1,信号x和信号y完全相干。
Coh 2 xy=|P xy(f)| 2/(P xx(f)×P yy(f))    (1)。
当公式(1)中的信号x和信号y分别为外部语音信号和样本语音信号时,即可实现外部语音信号与样本语音信号的相干性处理。
当处理电路得到外部语音信号时,处理电路可以通过样本语音信号对外部语音信号做相干性处理,以从外部语音信号中提取出高度相干(比如,相干性系数等于1或接近于1)的语音信号,即从外部语音信号中提取出环境音信号。由于样本语音信号是预先采集得到的信噪比较高的某一环境音对应的语音信号,提取出的环境音信号与样本语音信号高度相干,从而提取得到的环境音信号与样本语音信号为同一环境音的语音信号,且信噪比较高。
具体的,以外部语音信号表示为信号x、样本语音信号表示为信号y为例,处理 电路可以将外部语音信号x和样本语音信号y分别做傅里叶变换,得到F(x)和F(y),将F(x)与F(y)相乘即得到外部语音信号x和样本语音信号y的互谱密度P xy(f)函数,将F(x)与F(x)的共轭相乘即得到外部语音信号x的功率谱密度P xx(f),将F(y)与F(y)的共轭相乘即得到样本语音信号y的功率谱密度P yy(f),将P xy(f)、P xx(f)和P yy(f)代入上述公式(1)得到外部语音信号x和样本语音信号y的相干性系数,进而根据相干性系数获取高度相似的环境音信号。
第II种、至少一个外部语音采集器包括至少两个外部语音采集器,对至少两个外部语音采集器对应的外部语音信号做相关性处理,得到环境音信号。
其中,至少两个外部语音采集器可以包括两个或者两个以上的外部语音采集器,每个外部语音采集器采集到的语音信号经过预处理后得到一个外部语音信号,从而至少两个外部语音采集器对应得到至少两个外部语音信号。由于至少两个外部语音采集器可对同一环境音作采集,从而得到的至少两个外部语音信号中的每个外部语音信号中均包括同一环境音对应的环境音信号,对至少两个外部语音信号做相关性处理即可得到环境音信号。
示例性的,以至少两个外部语音采集器包括通话麦克风和降噪麦克风为例,若通话麦克风采集到的语音信号经过预处理后得到第一外部语音信号,降噪麦克风采集到的语音信号经过预处理后得到第二外部语音信号,则处理电路可以将第一外部语音信号和第二外部语音信号做相关性处理,得到环境音信号。
需要说明的是,将第一外部语音信号和第二外部语音信号做相关性处理的具体过程,与上述第I种方式中将外部语音信号与样本语音信号做相干性处理的具体过程类似,具体可以参见上述第I种方式中的描述,本申请实施例在此不再赘述。
S303:根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,得到目标语音信号。
其中,第一语音信号可以为待播放的语音信号,比如,第一语音信号可以是带播放的歌曲的语音信号、待播放的通话对方的语音信号、待播放的用户自身的语音信号或者待播放的其他音频数据的语音信号等。在一种实现方式中,第一语音信号可以由连接该耳机的电子设备传输给该耳机的处理电路,也可以由该耳机通过耳道语音采集器等其他语音采集器采集得到。
具体的,对第一语音信号和环境音信号做混音处理,可以包括:调整第一语音信号的幅值、相位或者输出时延中的至少一个;和/或,调整环境音信号的幅值、相位或者输出时延中的至少一个;将调整后的第一语音信号与调整后的环境音信号融合为一个语音信号,得到目标语音信号。
在一种实现方式中,处理电路可以根据预先设置的混音规则对第一语音信号和环境音信号做混音处理,该混音规则可以由本领域技术人员根据实际情况进行设置,或者通过语音数据训练得到,本申请实施例对具体的混音规则不作具体限制。
比如,当至少一个外部语音采集器的位置为位置1、且第一语音信号与环境音信号的幅值差小于某一幅值阈值时,可以增大环境音信号的幅值至预设幅值阈值,还可以调整环境音信号的输出时延,以在融合得到的目标语音信号中突显出环境音信号。这样当该环境音信号为汽笛声时,通过调整环境音信号的幅值和输出时延,可以使得 用户在目标语音信号播放的时候清晰地听到汽笛声,从而提高用户在室外环境中的安全性。
再比如,当至少一个外部语音采集器的位置为位置2、且第一语音信号与环境音信号相邻的幅值对应的时刻差小于某一时刻差阈值时,可以将环境音信号拉宽并设置输出时延,以将融合得到的目标语音信号中的环境音信号以立体声的形式体现出来。这样当该环境音信号为室内婴儿的哭声或者人说话的声音时,通过环境音信号以立体声的形式体现,可以使得用户能够在第一时间清楚的听到婴儿的哭声或者人说话的声音,从而避免用户需要摘下耳机聆听室内婴儿的动静、或者需要摘下耳机与家人说话时的不便。
可选的,该耳机还包括耳道语音采集器,当第一语音信号由耳道语音采集器等其他语音采集器采集得到时,如图4所示,该方法还包括:S300。其中,S300与S301-S302可以不分先后顺序,图4中以S300与S301-S302并列执行为例进行说明。
S300:预处理耳道语音采集器采集到的语音信号,得到第一语音信号。
该耳道语音采集器可以为耳道麦克风或者耳骨纹传感器。当用户佩戴该耳机时,耳道语音采集器位于用户的耳道内,耳道内的语音信号具有干扰少、频段窄的特性。当用户通过该耳机连接手机等电子设备进行通话或者播放音频数据时,耳道语音采集器可以采集耳道内的语音信号,采集器到的语音信号的噪声小,且频段范围窄。该频段可以是低中频段,比如,该频段可以为100Hz至4KHz、或者200Hz至5KHz等。
当耳道语音采集器采集到语音信号时,耳道语音采集器可以将该语音信号传输给处理电路,由处理电路预处理该语音信号,比如,处理电路对耳道语音采集器采集到的语音信号进行单通道消噪,以得到第一语音信号。第一语音信号为去除耳道语音采集器采集到的语音信号中的噪声之后的语音信号。比如,当用户通过该耳机连接手机等电子设备通话时,对耳道语音采集器采集到的语音信号进行单通道消噪后,得到的第一语音信号可以包括用户的通话语音信号或者自语音信号。在一种实现方式中,第一语音信号中还可以包括环境音信号,该环境音信号与S303中的环境音信号来自同一声源。
具体的,预处理耳道语音采集器采集到的语音信号,可以包括:对耳道语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。也即是,预处理耳道语音采集器采集到的语音信号的方法与上述S301中所描述的预处理至少一个外部语音采集器采集到的语音信号方法类似,即可以采用上述S301中所述的四种单独的处理方式,或者采用上述四种单独的处理方式中的任意两种或者多种处理方式的结合。具体过程可以参见上述S301中的相关描述,本申请实施例在此不再赘述。
相应的,当第一语音信号由耳道语音采集器采集得到时,S303具体可以为:根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置和耳道语音采集器的位置,对第一语音信号和环境音信号做混音处理,得到目标语音信号。在一种实现方式中,根据外部语音采集器的位置和耳道语音采集器的位置,以及耳道语音采集器和外部语音采集器采集到的同一环境音信号的幅值差和/或相位差,确定该环境音信号对应的声源与用户之间的距离,进而基于该距离可以调节环境音信号的幅 值、相位或者输出时延中的至少一个,和/或,调整第一语音信号的幅值、相位或者输出时延中的至少一个;并将调整后的第一语音信号与调整后的环境音信号融合为一个语音信号,得到目标语音信号。
S304:输出目标语音信号。
当得到目标语音信号时,处理电路可以输出该目标语音信号,比如处理电路可以将该目标语音信号输出至该耳机的扬声器,以播放该目标语音信号。由于目标语音信号是通过调整后的第一语音信号和调整后的环境音信号融合得到的,这样在用户佩戴并使用耳机时,可以使用户能够听到清晰、自然的第一语音信号和外部环境中的环境音信号。此外,目标语音信号中的环境音信号是调整后的信号,这样可以使用户听到的环境音信号不会产生刺耳、或者听不见等不适的问题,从而提高了语音信号的质量和用户体验。
在一种实现方式中,在输出目标语音信号之前,处理电路还可以进一步地对目标语音信号进行其他处理,以进一步提高目标语音信号的信噪比。具体的,处理电路可以对目标语音信号做以下至少一种处理:噪声抑制、均衡处理、数据包丢失补偿、自动增益控制或者动态范围调整。
其中,语音信号在处理过程中可能会产生新的噪声信号,比如,语音信号在降噪过程、和/或相干性处理过程中产生了新的噪声,即目标语音信号中会包括噪声信号,通过噪声抑制处理可以降低或消除目标语音信号中的噪声信号,从而提高目标语音信号的信噪比。
语音信号在传输过程中可能会产生数据包丢失的情况,比如,语音信号在从语音采集器传输至处理电路的过程中发生了丢包,从而目标语音信号对应的数据包中可能存在丢包问题,从而在输出目标语音信号时会影响通话的质量,通过对目标语音信号做数据包丢失补偿处理,可以解决丢包问题,进而提高输出目标语音信号时的通话质量。
处理电路得到的目标语音信号的增益可能较大也可以较小,从而在输出目标语音信号时会影响通话的质量,通过对目标语音信号做自动增益控制处理、和/或动态范围调整,可以将目标语音信号的增益调整到一个合适的范围内,从而提高目标语音播放的质量和用户体验。
上述主要从耳机的角度对本申请实施例提供的方案进行了介绍。可以理解的是,耳机为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对耳机进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能 划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图5示出了上述实施例中所涉及的一种语音信号处理装置的一种可能的结构示意图。参见图5,该装置包括:至少一个外部语音采集器502,该装置还包括处理单元503和输出单元504。在实际应用中,处理单元503可以为DSP、微处理电路、专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合等。输出单元504可以是输出接口、通信接口或者扬声器等。进一步的,该装置还可以包括耳道语音采集器501。
在本申请实施例中,处理单元503用于预处理至少一个外部语音采集器502采集到的语音信号,得到外部语音信号;处理单元503还用于从外部语音信号中提取环境音信号;处理单元503还用于根据第一语音信号和环境音信号的幅值和相位、以及至少一个外部语音采集器的位置,对第一语音信号和环境音信号做混音处理,得到目标语音信号。可选的,输出单元504,用于输出目标语音信号。
在一种可能的实现方式中,处理单元503具体用于:调整第一语音信号的幅值、相位和输出时延中的至少一个;调整环境音信号的幅值、相位和输出时延中的至少一个;将调整后的第一语音信号和调整后的环境音信号融合为一个语音信号。
在一种实现方式中,处理单元503还具体用于:将外部语音信号与样本语音信号做相干性处理,得到环境音信号;或者,至少一个外部语音采集器包括至少两个外部语音采集器,将至少两个外部语音采集器对应的外部语音信号做相干性处理,得到环境音信号。
在另一种可能的实现方式中,处理单元503还用于:预处理耳道语音采集器采集到的语音信号,得到第一语音信号。示例性的,处理单元503对耳道语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
在一种实现方式中,处理单元503还具体用于:对至少一个外部语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
进一步的,处理单元503还用于:对输出目标语音信号做以下至少一种处理:噪声抑制、均衡处理、数据包丢失补偿、自动增益控制或者动态范围调整。
在一种可能的实现方式中,耳道语音采集器501包括:耳道麦克风、或者耳骨纹传感器;至少一个外部语音采集器502包括:通话麦克风、降噪麦克风。
示例性的,图6为本申请实施例提供的一种语音信号处理装置的结构示意图,图6中以耳道语音采集器501为耳道麦克风,至少一个外部语音采集器502包括通话麦克风和降噪麦克风,处理电路503为DSP,输出单元504为扬声器为例进行说明。
在本申请实施例中,外部语音采集器502在用户佩戴该耳机时位于用户耳道外,从而预处理至少一个外部语音采集器采集到的语音信号可以得到外部语音信号。提取外部语音信号中的环境音信号可以得到所需要的环境音信号,对第一语音信号和环境音信号做混音处理,得到目标语音信号,从而在播放该目标语音信号时,可以使用户听到清晰、自然的第一语音信号和外部环境中重要的环境音信号,从而实现了环境音的监听,且提高监听效果和用户体验。
在本申请的另一实施例中,还提供一种计算机可读存储介质,计算机可读存储介 质中存储有指令,当一个设备(可以是单片机、芯片或者处理电路等)运行该指令时,使得该设备执行上文所提供的语音信号处理方法。前述的计算机可读存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括指令,该指令存储在计算机可读存储介质中;当一个设备(可以是单片机、芯片或者处理电路等)运行该指令时,使得该设备执行上文所提供的语音信号处理方法。前述的计算机可读存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种语音信号处理方法,其特征在于,应用于耳机中,所述耳机包括至少一个外部语音采集器,包括:
    预处理所述至少一个外部语音采集器采集到的语音信号,得到外部语音信号;
    提取所述外部语音信号中的环境音信号;
    根据第一语音信号和所述环境音信号的幅值和相位、以及所述至少一个外部语音采集器的位置,对所述第一语音信号和所述环境音信号做混音处理,得到目标语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述第一语音信号和所述环境音信号做混音处理,包括:
    调整所述第一语音信号的幅值、相位或输出时延中的至少一个;和/或,
    调整所述环境音信号的幅值、相位或输出时延中的至少一个;
    将调整后的所述第一语音信号和调整后的所述环境音信号融合为一个语音信号。
  3. 根据权利要求1或2所述的方法,其特征在于,所述提取所述外部语音信号中的环境音信号,包括:
    将所述外部语音信号与样本语音信号做相干性处理,得到所述环境音信号。
  4. 根据权利要求1或2所述的方法,其特征在于,所述至少一个外部语音采集器包括至少两个外部语音采集器,所述提取所述外部语音信号中的环境音信号,包括:
    将所述至少两个外部语音采集器对应的外部语音信号做相干性处理,得到所述环境音信号,每个外部语音采集器对应的外部语音信号是预处理所述外部语音采集器采集到的语音信号得到的外部语音信号。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述耳机还包括耳道语音采集器,所述方法还包括:
    预处理所述耳道语音采集器采集到的语音信号,得到所述第一语音信号;
    相应的,根据第一语音信号和所述环境音信号的幅值和相位、以及所述至少一个外部语音采集器的位置,对所述第一语音信号和所述环境音信号做混音处理,包括:
    根据所述第一语音信号和所述环境音信号的幅值和相位、以及所述至少一个外部语音采集器和所述耳道语音采集器的位置,对所述第一语音信号和所述环境音信号做混音处理。
  6. 根据权利要求5所述的方法,其特征在于,所述预处理所述耳道语音采集器采集到的语音信号,包括:
    对所述耳道语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
  7. 根据权利要求5或6所述的方法,其特征在于,所述耳道语音采集器包括:耳道麦克风或者耳骨纹传感器中的至少一个。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述预处理所述至少一个外部语音采集器采集到的语音信号,包括:
    对所述至少一个外部语音采集器采集到的语音信号做以下至少一种处理:幅度调 整、增益增强、回波消除或者噪声抑制。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述方法还包括:
    对所述目标语音信号做以下至少一种处理并输出,所述至少一种处理包括:噪声抑制、均衡处理、数据包丢失补偿、自动增益控制或者动态范围调整。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述至少一个外部语音采集器包括:通话麦克风或者降噪麦克风。
  11. 一种语音信号处理装置,其特征在于,该装置包括至少一个外部语音采集器,还包括:
    处理单元,用于预处理所述至少一个外部语音采集器采集到的语音信号,得到外部语音信号;
    所述处理单元,还用于提取所述外部语音信号中的环境音信号;
    所述处理单元,还用于根据第一语音信号和所述环境音信号的幅值和相位、以及所述至少一个外部语音采集器的位置,对所述第一语音信号和所述环境音信号做混音处理,得到目标语音信号。
  12. 根据权利要求11所述的装置,其特征在于,所述处理单元,具体用于:
    调整所述第一语音信号的幅值、相位或输出时延中的至少一个;和/或,
    调整所述环境音信号的幅值、相位或输出时延中的至少一个;
    将调整后的所述第一语音信号和调整后的所述环境音信号融合为一个语音信号。
  13. 根据权利要求11或12所述的装置,其特征在于,所述处理单元,还具体用于:
    将所述外部语音信号与样本语音信号做相干性处理,得到所述环境音信号。
  14. 根据权利要求11或12所述的装置,其特征在于,所述至少一个外部语音采集器包括至少两个外部语音采集器;所述处理单元,还具体用于:
    将所述至少两个外部语音采集器对应的外部语音信号做相干性处理,得到所述环境音信号,每个外部语音采集器对应的外部语音信号是预处理所述外部语音采集器采集到的语音信号得到的外部语音信号。
  15. 根据权利要求11-14任一项所述的装置,其特征在于,所述装置还包括耳道语音采集器,所述处理单元,还用于:
    预处理所述耳道语音采集器采集到的语音信号,得到所述第一语音信号;
    相应的,所述处理单元,还具体用于:根据所述第一语音信号和所述环境音信号的幅值和相位、以及所述至少一个外部语音采集器和所述耳道语音采集器的位置,对所述第一语音信号和所述环境音信号做混音处理。
  16. 根据权利要求15所述的装置,其特征在于,所述处理单元,还用于:
    对所述耳道语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
  17. 根据权利要求15或16所述的装置,其特征在于,所述耳道语音采集器包括:耳道麦克风或者耳骨纹传感器中的至少一个。
  18. 根据权利要求11-17任一项所述的装置,其特征在于,所述处理单元,还用于:
    对所述至少一个外部语音采集器采集到的语音信号做以下至少一种处理:幅度调整、增益增强、回波消除或者噪声抑制。
  19. 根据权利要求11-18任一项所述的装置,其特征在于,所述处理单元,还用于:
    对所述目标语音信号做以下至少一种处理并输出,所述至少一种处理包括:噪声抑制、均衡处理、数据包丢失补偿、自动增益控制或者动态范围调整。
  20. 根据权利要求11-19任一项所述的装置,其特征在于,所述至少一个外部语音采集器包括:通话麦克风或者降噪麦克风。
PCT/CN2020/127546 2019-12-25 2020-11-09 一种语音信号处理方法及装置 WO2021129196A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/788,758 US20230024984A1 (en) 2019-12-25 2020-11-09 Speech signal processing method and apparatus
EP20907146.3A EP4021008B1 (en) 2019-12-25 2020-11-09 Voice signal processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911359322.4A CN113038315A (zh) 2019-12-25 2019-12-25 一种语音信号处理方法及装置
CN201911359322.4 2019-12-25

Publications (1)

Publication Number Publication Date
WO2021129196A1 true WO2021129196A1 (zh) 2021-07-01

Family

ID=76459085

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127546 WO2021129196A1 (zh) 2019-12-25 2020-11-09 一种语音信号处理方法及装置

Country Status (4)

Country Link
US (1) US20230024984A1 (zh)
EP (1) EP4021008B1 (zh)
CN (1) CN113038315A (zh)
WO (1) WO2021129196A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN204887366U (zh) * 2015-07-19 2015-12-16 段太发 可监听环境音的蓝牙耳机
CN107919132A (zh) * 2017-11-17 2018-04-17 湖南海翼电子商务股份有限公司 环境声音监听方法、装置及耳机
CN207560274U (zh) * 2017-11-08 2018-06-29 深圳市佳骏兴科技有限公司 降噪耳机
CN108847250A (zh) * 2018-07-11 2018-11-20 会听声学科技(北京)有限公司 一种定向降噪方法、系统及耳机
CN108847208A (zh) * 2018-05-04 2018-11-20 歌尔科技有限公司 一种降噪处理方法、装置和耳机
CN209002161U (zh) * 2018-09-13 2019-06-18 深圳市斯贝达电子有限公司 一种特种降噪组网通信耳机
US20190287546A1 (en) * 2018-03-19 2019-09-19 Bose Corporation Echo control in binaural adaptive noise cancellation systems in headsets

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7099821B2 (en) * 2003-09-12 2006-08-29 Softmax, Inc. Separation of target acoustic signals in a multi-transducer arrangement
US8194865B2 (en) * 2007-02-22 2012-06-05 Personics Holdings Inc. Method and device for sound detection and audio control
US8798283B2 (en) * 2012-11-02 2014-08-05 Bose Corporation Providing ambient naturalness in ANR headphones
CN103269465B (zh) * 2013-05-22 2016-09-07 歌尔股份有限公司 一种强噪声环境下的耳机通讯方法和一种耳机
US9843859B2 (en) * 2015-05-28 2017-12-12 Motorola Solutions, Inc. Method for preprocessing speech for digital audio quality improvement
JP2018074220A (ja) * 2016-10-25 2018-05-10 キヤノン株式会社 音声処理装置
JP6177480B1 (ja) * 2016-12-08 2017-08-09 三菱電機株式会社 音声強調装置、音声強調方法、及び音声処理プログラム
WO2018111894A1 (en) * 2016-12-13 2018-06-21 Onvocal, Inc. Headset mode selection
CN108322845B (zh) * 2018-04-27 2020-05-15 歌尔股份有限公司 一种降噪耳机
WO2023085749A1 (ko) * 2021-11-09 2023-05-19 삼성전자주식회사 빔포밍을 제어하는 전자 장치 및 이의 동작 방법

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN204887366U (zh) * 2015-07-19 2015-12-16 段太发 可监听环境音的蓝牙耳机
CN207560274U (zh) * 2017-11-08 2018-06-29 深圳市佳骏兴科技有限公司 降噪耳机
CN107919132A (zh) * 2017-11-17 2018-04-17 湖南海翼电子商务股份有限公司 环境声音监听方法、装置及耳机
US20190287546A1 (en) * 2018-03-19 2019-09-19 Bose Corporation Echo control in binaural adaptive noise cancellation systems in headsets
CN108847208A (zh) * 2018-05-04 2018-11-20 歌尔科技有限公司 一种降噪处理方法、装置和耳机
CN108847250A (zh) * 2018-07-11 2018-11-20 会听声学科技(北京)有限公司 一种定向降噪方法、系统及耳机
CN209002161U (zh) * 2018-09-13 2019-06-18 深圳市斯贝达电子有限公司 一种特种降噪组网通信耳机

Also Published As

Publication number Publication date
EP4021008A1 (en) 2022-06-29
EP4021008A4 (en) 2022-10-26
CN113038315A (zh) 2021-06-25
EP4021008B1 (en) 2023-10-18
US20230024984A1 (en) 2023-01-26

Similar Documents

Publication Publication Date Title
US11569789B2 (en) Compensation for ambient sound signals to facilitate adjustment of an audio volume
CN104883636B (zh) 仿生听力耳麦
US8675884B2 (en) Method and a system for processing signals
US7889872B2 (en) Device and method for integrating sound effect processing and active noise control
US10475434B2 (en) Electronic device and control method of earphone device
JP2009530950A (ja) ウェアラブル装置のためのデータ処理
CN104429096A (zh) 音频信号输出装置和处理音频信号的方法
CN112954530B (zh) 一种耳机降噪方法、装置、系统及无线耳机
JP2017527148A (ja) 音質改善のための方法及びヘッドセット
CN111683319A (zh) 一种通话拾音降噪方法及耳机、存储介质
WO2023098401A1 (zh) 具有主动降噪功能的耳机及主动降噪方法
WO2021129197A1 (zh) 一种语音信号处理方法及装置
CN113207056B (zh) 一种无线耳机及其透传方法、装置及系统
US11335315B2 (en) Wearable electronic device with low frequency noise reduction
WO2021129196A1 (zh) 一种语音信号处理方法及装置
JP2008228198A (ja) 再生音調整装置及び再生音調整方法
WO2023197474A1 (zh) 一种耳机模式对应的参数确定方法、耳机、终端和系统
CN111327984B (zh) 基于零陷滤波的耳机辅听方法和耳戴式设备
CN110896514A (zh) 一种降噪耳机
TWI700004B (zh) 減少干擾音影響之方法及聲音播放裝置
JP6668865B2 (ja) 耳装着音響再生装置
TWI345923B (zh)
WO2006117718A1 (en) Sound detection device and method of detecting sound
CN115866474A (zh) 无线耳机的透传降噪控制方法、系统及无线耳机
TW202312140A (zh) 會議終端及回授抑制方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20907146

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020907146

Country of ref document: EP

Effective date: 20220321

NENP Non-entry into the national phase

Ref country code: DE