CN116343756A - Human voice transmission method, device, earphone, storage medium and program product - Google Patents

Human voice transmission method, device, earphone, storage medium and program product Download PDF

Info

Publication number
CN116343756A
CN116343756A CN202111582502.6A CN202111582502A CN116343756A CN 116343756 A CN116343756 A CN 116343756A CN 202111582502 A CN202111582502 A CN 202111582502A CN 116343756 A CN116343756 A CN 116343756A
Authority
CN
China
Prior art keywords
signal
voice
external audio
audio signal
human voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111582502.6A
Other languages
Chinese (zh)
Inventor
李芳庆
黄景昌
关智博
李培硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202111582502.6A priority Critical patent/CN116343756A/en
Publication of CN116343756A publication Critical patent/CN116343756A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The embodiment of the application discloses a human voice transmission method, a device, an earphone, a storage medium and a program product, and belongs to the technical field of audio processing. The method is used for the earphone, and comprises the following steps: carrying out voice recognition on the collected external audio signals; separating the human voice signal from the external audio signal if the human voice signal is identified to be contained in the external audio signal; mixing the voice signal and the noise reduction signal which are obtained through separation to obtain a mixed voice signal, wherein the noise reduction signal is used for actively reducing noise; the speaker is driven to sound based on the mix signal. According to the scheme, the human voice transmission effect of the earphone can be improved, and meanwhile, the power consumption of the human voice transmission system of the earphone is reduced.

Description

Human voice transmission method, device, earphone, storage medium and program product
Technical Field
The embodiment of the application relates to the technical field of audio processing, in particular to a human voice transmission method, a device, an earphone, a storage medium and a program product.
Background
Along with the improvement of living standard, the earphone has become an indispensable living article for people. In noisy environments such as airports, subways, restaurants and the like, the noise reduction function of the earphone can eliminate the interference of external noise to the greatest extent. However, in the situations that the user needs to accept external voice, external environmental noise and the like, the earphone also needs to have a transparent transmission function, and external sound signals are transmitted to the user in a transparent way, so that the user can hear external sound without taking off the earphone.
In the related art, the transmission function of the earphone is to transmit the target sound source signal to be heard by the user and other sound source signals to the user, so that the sound heard by the user contains the target sound source and other sound sources, and the transmission effect is reduced.
Disclosure of Invention
The embodiment of the application provides a human voice transmission method, a device, an earphone, a storage medium and a program product, wherein the technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for transmitting human voice, where the method is used for headphones, and the method includes:
carrying out voice recognition on the collected external audio signals;
separating the human voice signal from the external audio signal under the condition that the external audio signal is identified to contain the human voice signal;
mixing the voice signals obtained through separation and the noise reduction signals to obtain mixed voice signals, wherein the noise reduction signals are used for actively reducing noise;
and driving a loudspeaker to sound based on the mixed sound signal.
In another aspect, an embodiment of the present application provides a human voice transmission device, where the device is used for an earphone, and the device includes:
the voice recognition module is used for recognizing the voice of the collected external audio signals;
the separation module is used for separating the voice signal from the external audio signal under the condition that the voice signal is contained in the external audio signal;
the sound mixing module is used for carrying out sound mixing processing on the voice signal obtained by separation and the noise reduction signal to obtain a sound mixing signal, and the noise reduction signal is used for carrying out active noise reduction;
and the driving module is used for driving the loudspeaker to sound based on the sound mixing signal.
In another aspect, an embodiment of the present application provides an earphone, where the earphone includes a processor and a memory, where at least one instruction, at least one program, a code set, or an instruction set is stored in the memory, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement a human voice transmission method as described in the foregoing aspect.
In another aspect, embodiments of the present application provide a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement a human voice transmission method as described in the above aspects.
In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor reads the computer instructions from the computer-readable storage medium and executes the computer instructions to perform the human voice transmission method provided in the above aspect.
The technical scheme that this application provided can include following beneficial effect:
in this application embodiment, at first carries out the voice recognition to the external audio signal who gathers, under the condition that contains the voice signal in the external audio signal of discernment, separates out this voice signal from external audio signal again to with the voice signal and the noise reduction signal mix of separation, generate one way signal drive speaker sound, thereby realized the transmission function of earphone. In this application embodiment, the earphone only carries out the transmission with the voice signal for the user makes an uproar when enjoying the earphone and falls and conveniently hears the voice on every side, has improved the effect of transmitting through of earphone, and in addition, the earphone carries out voice recognition to external audio signal earlier, carries out voice separation again under the condition that contains the voice signal in the discernment external audio signal, has reduced the consumption of earphone.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 shows a schematic diagram of a headset transmission principle provided in an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of a DSP-based headset transmission principle provided in one exemplary embodiment of the present application;
FIG. 3 is a schematic diagram of an NPU-based earphone transmission principle according to an exemplary embodiment of the present application;
FIG. 4 illustrates a flow chart of a human voice transmission method provided in an exemplary embodiment of the present application;
FIG. 5 illustrates a flow chart of a human voice transmission method provided in another exemplary embodiment of the present application;
FIG. 6 illustrates a schematic diagram of a VAD classifier training process provided in one exemplary embodiment of the present application;
FIG. 7 illustrates a flow chart of a method of separating human voice provided in an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a process for obtaining a voice probability matrix using U-net according to an exemplary embodiment of the present application;
FIG. 9 illustrates a process schematic of human voice separation provided in one exemplary embodiment of the present application;
FIG. 10 is a block diagram illustrating the construction of a human voice transmission device according to an exemplary embodiment of the present application;
fig. 11 shows a block diagram of a headset according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
In the related art, headphones have not only an active noise reduction (Active Noise Cancellation, ANC) mode but also a transparent (HT) mode. After the earphone is started to actively reduce noise, a user can enjoy comfortable noise reduction experience in various noisy environments such as airports, subways, restaurants and the like. However, in some cases, the user needs to receive external voice or external noise, for example, when on a subway, the user needs to receive the sound broadcasted by the subway so as not to miss a station, or the user needs to hear the speaking content of a person nearby, and at this time, the earphone needs to turn on a transmission mode, that is, the target sound signal that the user needs to hear is transmitted to the human ear. As shown in fig. 1, a schematic diagram of the earphone transmission principle provided in an exemplary embodiment of the present application is shown. Including a Playback path 110, an anc path 120, a pass-through path 130, and a mixing circuit 140. The Playback path 110 is used for playing music or data of a downlink call, and the ANC path 120 is used for actively reducing noise, and the principle is that a reverse sound wave equal to an external noise signal is generated, and noise is neutralized, so that the noise reduction effect is realized. Alternatively, the ANC path 120 may be a feed-forward structure, a feedback structure, or a hybrid structure, i.e., having both a feed-forward structure and a feedback structure. The transparent channel 130 is used for processing external audio signals that need to be transmitted through. The mixing circuit 140 is configured to mix the music signal or the downlink speech signal of the Playback path 110, the noise reduction signal of the ANC path 120, and the target sound signal of the transparent path 130 into one signal, and drive the speaker of the earphone to sound through the one signal.
In the related art, as shown in fig. 2, the transparent path 130 is implemented based on a DSP (Digital Signal Processor ). The earphone is provided with a DSP, and the DSP is used for processing the external audio signal through an EQ (Equalizer), namely, the gain or attenuation is carried out on a certain frequency band or a plurality of frequency bands of the external audio signal, so that the purpose of adjusting tone is achieved, and the sound heard by the human ear is more comfortable. However, the method does not distinguish the external audio signals, but transmits all the voice signals and the environmental noise signals to the human ears, and the user hears noisy human voice, so that the transparent transmission effect is poor. Alternatively, the DSP may be an FIR (Finite Impulse Response ) filter, an IIR (Infinite Impulse Response, wireless impulse response) filter, or the like, which is not limited by the embodiments of the present application.
The EQ processing of the external audio signal is illustratively implemented in fig. 2 by an FIR filter or an IIR filter. Wherein x is k (n) for representing the input signal, i.e. the external audio signal, y k (n) means for representing the output signal, i.e. the processed external audio signal, b k0 、b k1 、b k2 For representing filter coefficients, a k1 、a k2 For representing feedback coefficient, z -1 For characterizing the Z-transform.
In the embodiment of the present application, in order to improve the transparent transmission effect, as shown in fig. 3, the transparent transmission path 130 is implemented based on an NPU (Neural Processing Unit, neural network processor). The NPU is arranged in the earphone, and human voice separation is carried out on external audio signals through the NPU, so that clean human voice is obtained, and the transparent transmission effect of the earphone is improved. However, because the capacity of the earphone battery is generally smaller (about 50 mAh), the human voice signal is separated from the external audio signal and needs to be subjected to complex calculation, and the earphone power consumption overhead is relatively large in the calculation process, so that in order to reduce the overall power consumption of the earphone human voice transmission system, the earphone in the embodiment of the invention firstly carries out human voice recognition on the external audio signal, and then carries out human voice separation when the condition that the external audio signal contains the human voice signal is recognized, thereby reducing the calculation of human voice separation and further reducing the power consumption of the earphone. The following describes a human voice transmission method in the embodiment of the present application.
Referring to fig. 4, a flowchart of a human voice transmission method according to an exemplary embodiment of the present application is shown, where the method includes:
step 410, voice recognition is performed on the collected external audio signals.
The earphone collects external audio signals in real time through the microphone and performs voice recognition on the external audio signals.
Alternatively, the earphone may be a wireless earphone, a wired earphone, or the like, which is not limited in the embodiments of the present application.
Alternatively, the external audio signal may include a human voice signal, an ambient noise signal, a music sound signal, and the like, which is not limited in the embodiment of the present application.
In one possible implementation, for voice recognition in the headset, NPU, DSP, MCU (Micro Controller Unit, micro control unit), VAD (Voice Activity Detection, voice endpoint detection) hardware, etc., which is not limited by the embodiment of the present application. The VAD is used to detect the starting position of the voice signal, and separate the voice end and the non-voice end, so as to achieve the purpose of voice recognition.
In step 420, in case that the external audio signal is identified to contain a human voice signal, the human voice signal is separated from the external audio signal.
In one possible embodiment, when the earphone identifies that the external audio signal contains a human voice signal, the human voice signal is separated from the external audio signal.
In another possible implementation manner, when the earphone recognizes that the external audio signal does not contain the voice signal, the voice separation is stopped for the external audio signal, so as to reduce the power consumption overhead of the earphone.
Optionally, an NPU is provided in the earphone, and the NPU performs the separation of human voice.
Alternatively, the voice signal may be a speaking voice signal sent by a person, a speaking voice signal in a broadcast, or the like, which is not limited in the embodiment of the present application.
Step 430, performing mixing processing on the separated voice signal and noise reduction signal to obtain a mixed signal, wherein the noise reduction signal is used for active noise reduction.
In one possible implementation, after the earphone turns on the noise reduction mode, the noise reduction component inside the earphone generates a noise reduction signal, where the noise reduction signal is a reverse sound wave signal with opposite phase and same amplitude as the external noise signal. The external noise signals are counteracted by neutralizing the noise reduction signals with the external noise signals collected by the earphone microphone, so that the active noise reduction function of the earphone is realized. However, after the user wears the earphone, the broadcasting station report sound or the voice of surrounding people in the broadcast needs to be heard in some scenes, so that the earphone also needs to separate the voice signal from the external audio signal and transmit the voice signal to the human ear without picking up the earphone, at this time, two paths of signals exist in the earphone, one path of signals is the voice signal obtained by separation, and the other path of signals is the noise reduction signal. In order to convert the multipath signals into one path of signals, a sound mixing circuit is arranged in the earphone, the earphone transmits the separated human voice signals and noise reduction signals to the sound mixing circuit, and the two paths of signals are converted into one path of sound mixing signals.
Step 440, driving the speaker to sound based on the mixed signal.
The speaker in the earphone converts the mixed electrical signal into an acoustic signal, thereby emitting sound. Therefore, in a noisy environment such as an airport, a subway and the like, a user wearing the earphone can hear the sound of other people speaking at the near field side of the earphone or the station reporting sound in broadcasting without picking up the earphone while enjoying noise reduction, and compared with the prior art, the method has the advantage that all external sound signals are transmitted to human ears thoroughly, and the transmission effect is obviously improved.
To sum up, in the embodiment of the application, the earphone firstly performs voice recognition on the collected external audio signal, separates the voice signal from the external audio signal under the condition that the external audio signal is recognized to contain the voice signal, mixes the separated voice signal with the noise reduction signal, and generates a signal to drive the loudspeaker to sound, thereby realizing the transmission function of the earphone. In this application embodiment, the earphone only carries out the transmission with the voice signal for the user makes an uproar when enjoying the earphone and falls and conveniently hears the voice on every side, has improved the effect of transmitting through of earphone, and in addition, the earphone carries out voice recognition to external audio signal earlier, carries out voice separation again under the condition that contains the voice signal in the discernment external audio signal, has reduced the consumption of earphone.
In one possible implementation manner, the earphone firstly recognizes the voice signal in the external audio signal through the low-power-consumption VAD classifier, and when the earphone recognizes that the external audio signal contains the voice signal, the voice signal is separated from the external audio signal through the high-power-consumption voice separation network, so that the use frequency of the voice separation network is reduced, and the power consumption of the earphone is further reduced. Referring to fig. 5, a flowchart of a human voice transmission method according to another exemplary embodiment of the present application is shown, where the method includes:
and 510, extracting the characteristics of the collected external audio signals to obtain audio characteristics.
In one possible implementation, the earphone performs feature extraction on the external audio signal to obtain an audio feature of the external audio signal.
Alternatively, the audio features may be energy features, frequency domain features, cepstrum features, harmonic features, long-term features, which are not limited by the embodiments of the present application.
Different audio characteristics are extracted in different modes, and optionally, the energy characteristics of the external audio signals are obtained based on the signal strength of the external audio signals; the frequency domain characteristics of the external audio signal are obtained by carrying out STFT (Short-time Fourier transform) on the external audio signal; obtaining a cepstrum feature of the external audio signal by performing cepstrum analysis on the external audio signal, or taking an MFCC (Mel Frequency Cepstral Coefficients, mel frequency cepstrum coefficient) of the external audio signal as the cepstrum feature of the external audio signal; because the voice contains the fundamental frequency and a plurality of harmonic frequencies, even in a strong noise scene, the harmonic characteristics exist, so the fundamental frequency can be found by using an autocorrelation method, and the harmonic characteristics of an external audio signal can be further determined; the voice is an unsteady signal, and most of daily noise is a steady signal, so that the long-term characteristics of the external audio signal can be extracted in view of the fact that the long-term characteristics can analyze the unsteady of the voice.
Step 520, classifying the audio features by the VAD classifier to obtain a classification result, where the classification result is used to characterize the signal type of the signal contained in the external audio signal.
In one possible implementation manner, the earphone is provided with VAD hardware, the earphone inputs the audio features of the extracted external audio signal into the VAD hardware, and the trained VAD classifier classifies the audio features to obtain the classification result. The classification result may be one of a voice signal, an ambient noise signal, and a music voice signal. Therefore, the VAD classifier can identify the speaker sound signals and the music sound signals contained in the external audio signals, and the transparent transmission effect of the earphone is improved.
In one possible implementation, the VAD classifier is trained based on a sample audio signal containing a sample signal type tag.
Alternatively, the sample audio signal may be a mixture of at least two of a sample speaker sound signal, a sample ambient noise signal, and a sample music speaker sound signal. For example, the sample audio signal may be a mixture of two signals, i.e., a sample speaker sound signal and a sample ambient noise signal, or a mixture of two signals, i.e., a sample speaker sound signal and a sample music sound signal, or a combination of three signals, i.e., a sample speaker sound signal, a sample ambient noise signal, and a sample music sound signal.
With respect to the training process of the VAD classifier, an example is shown in fig. 6. The sample audio signal is a mixture of a sample speaker voice signal, a sample ambient noise signal and a sample music voice signal. First, different signal types in the sample audio signal are labeled, either automatically or manually, for example, the sample speaker voice signal is labeled 0 and the sample music voice signal is labeled 1. And extracting audio characteristics of the sample audio signals, wherein the audio characteristics can be energy characteristics, frequency domain characteristics, cepstrum characteristics, harmonic characteristics and the like, inputting the audio characteristics into a network model for training to obtain a trained network model, and distinguishing a speaker sound signal and a music sound signal from the trained network model.
The audio characteristics of the sample audio signal are related to the network model, different network models corresponding to different audio signal characteristics. Alternatively, the network model may be a GMM (Gaussian Mixed Model, gaussian mixture model), an SVM (Support Vector Machine ) model, or a DNN (Deep Neural Networks, deep neural network) model, etc., which is not limited in this application.
In the model test stage, the audio features of the sample external audio signals are manually extracted, and the audio features are input into the trained network model to obtain a classification result. And carrying out post-processing on the classification result, and judging whether the classification result belongs to the speaker sound signal. The purpose of the post-processing is to determine more accurately whether the classification result is a speaker sound signal, and the post-processing may be to remove residual echo and background noise, or may be sharpness control, which is not limited in this embodiment of the present application.
In step 530, in case the classification result indicates that the external audio signal contains a voice signal, the voice signal is separated from the external audio signal through the voice separation network.
In a possible embodiment, in case the classification result indicates that the external audio signal contains a speaker sound signal, the speaker sound signal is separated from the external audio signal by a voice separation network.
In the embodiment of the application, in order to reduce the power consumption of the earphone and prolong the standby time of the earphone, a lightweight voice separation network is selected.
Alternatively, the separation network may be U-net, etc., which is not limited by the embodiments of the present application.
Step 540, performing mixing processing on the separated voice signal and noise reduction signal to obtain a mixed signal, wherein the noise reduction signal is used for active noise reduction.
In this step, please refer to step 430, and the embodiment of the present application will not be repeated.
Step 550, driving the speaker to sound based on the mixed signal.
In this step, please refer to step 440, and the embodiment of the present application will not be repeated.
In the embodiment of the application, when the earphone identifies that the external audio signal contains the voice signal through the low-power-consumption VAD classifier, the voice signal is separated from the external audio signal through the high-power-consumption voice separation network, so that the power consumption of the earphone is prevented from being increased due to the fact that the earphone directly separates the voice signal from the external audio signal through the voice separation network.
In view of the small battery capacity of the earphone, in order to separate the voice signal from the external audio signal with low power consumption and high speed, in the embodiment of the application, a lightweight voice separation network, such as U-net, is selected for voice separation. Referring to fig. 7, a flowchart of a method for separating human voice according to an exemplary embodiment of the present application is shown, where the method includes:
step 710, performing time-frequency conversion on the external audio signal to obtain the amplitude spectrum and the phase spectrum of the external audio signal.
In one possible implementation, an NPU is provided in the headset, the NPU being configured to perform human voice separation. After the microphone of the earphone collects the external audio signal, the external audio signal is stored in a buffer (buffer), when the earphone needs to perform human voice separation on the external audio signal through the NPU, the NPU reads the external audio signal from the buffer, and performs time-frequency conversion on the external audio signal through the STFT.
The earphone collects an external audio signal as X through a microphone, and performs time-frequency conversion on the external audio signal through STFT to obtain an amplitude spectrum X and a phase spectrum Y of the external audio signal:
X=abs(STFT(x))
Y=angle(STFT(x))
and step 720, carrying out voice probability prediction on the amplitude spectrum through a voice separation network to obtain a voice probability matrix.
Further, the earphone inputs the amplitude spectrum of the external audio signal into the voice separation network to obtain a voice probability matrix. The human voice separation network is a trained human voice separation network.
Exemplary, with respect to the training process of the separation network, first the sample human voice signal s is processed by STFT i Mix the acoustic signal x with the sample i Performing time-frequency conversion to obtain an amplitude spectrum of the sample human voice signal and the sample mixed voice signal respectively, wherein the amplitude spectrum is as follows:
X i =abs(STFT(x i ))
S i =abs(STFT(s i ))
wherein X is i Representing the amplitude spectrum of the sample mixed acoustic signal, S i Representing the amplitude spectrum of the sample human voice signal. The sample human voice signal is a pure human voice signal, and the sample mixed voice signal is a human voice signal mixed with environmental sound, which may be environmental noise, music sound, and the like, which is not limited in the embodiment of the present application.
Further, the amplitude spectrum X of the sample mixed acoustic signal i Input human voice separation network Net i In the process, a characteristic diagram m of the voice probability is obtained i The method comprises the following steps:
m i =Net i (X i )
further, feature map m of the voice probability i Amplitude spectrum X of acoustic signal mixed with sample i Multiplying to extract human voice signal amplitude spectrum
Figure BDA0003427457230000091
The method comprises the following steps:
Figure BDA0003427457230000092
the loss function L of the training process is as follows:
Figure BDA0003427457230000093
in one possible embodiment, the human voice separation network Net is obtained by performing a gradient back propagation algorithm on the loss function L i Weight update value of (2), and the human-voice separation network Net is subjected to the weight update value i The weight of the model is updated so that the loss function L gradually converges to the voice separation network Net i And then the trained voice separation network Net is obtained.
In consideration of the fact that the earphone can accurately separate the voice signal from the external audio signal, the voice transmission effect is improved, and in the voice separation network training, the network parameters of the voice separation network, namely the weights of the voice separation network, are guided and optimized through the separation accuracy index. In addition, according to the auditory psychology model of the person, when the human voice transmission delay exceeds 20ms, the human ear can perceive the delay, so that the user experience is affected. Therefore, in the process of training the human voice separation network, the network architecture of the human voice separation network is guided and optimized through the separation speed index, and the calculation speed of the human voice separation network model is required to be high.
Alternatively, the human voice separation network may be a convolutional neural network, a cyclic neural network, a combination of the convolutional neural network and the cyclic neural network, or the like, which is not limited in the embodiments of the present application.
The trained voice separation network model is preset in an NPU of the earphone, and the earphone performs voice separation on external audio signals through the NPU.
The amplitude spectrum X of the external audio signal is input into a trained voice separation network Net to obtain a voice probability matrix m, which is:
m=Net(X)
wherein Net is the trained human voice separation network.
In one possible implementation manner, in order to reduce power consumption of the earphone voice transmission system and improve calculation speed of the voice separation network, delay is avoided, the voice separation network adopts U-net, and voice probability prediction is performed on the amplitude spectrum X through the U-net to obtain a voice probability matrix.
Firstly, extracting the characteristics of the amplitude spectrum through n layers of characteristic extraction layers of the U-net to obtain a downsampled characteristic diagram output by each characteristic extraction layer.
Wherein n is selected according to the separation speed index.
And secondly, carrying out feature fusion on the up-sampling feature map and the down-sampling feature map through an n-layer feature fusion layer of the U-net to obtain a target feature map, wherein the up-sampling feature map is obtained by carrying out up-sampling on the down-sampling feature map through the feature fusion layer.
Alternatively, the fusion method may be splicing (connection) or stacking (Addition), which is not limited in the embodiment of the present application.
And finally, performing activation processing on the target feature map to obtain a voice probability matrix.
And processing the target feature map through an activation function to obtain a human voice probability matrix, wherein the range of the matrix is between 0 and 1.
Alternatively, the commonly used activation function may be a sigmod function, a tanh function, or the like, which is not limited by the embodiments of the present application.
Illustratively, as shown in fig. 8, a schematic diagram of a process of obtaining a voice probability matrix using U-net according to an exemplary embodiment of the present application is shown.
The amplitude spectrum of the external audio signal is input into U-net, after the first layer convolution, the feature map is downsampled in two dimensions of a time domain and a frequency domain by adopting a maximum value pooling 2×2 through a downsampling convolution layer of 3 layers continuously until a bottleneck layer is reached. The feature map is then up-sampled with a convolution 3 x 3 operation and the ReLU function successively through the same number of up-sampling convolution layers. In order to recover the detail information lost by the feature map during downsampling, the feature map of each downsampling layer is jumped to the feature map of the corresponding upsampling layer, that is, the upsampling feature map and the downsampling feature map are fused together, and the fusion method can be splicing or superposition. And finally, a convolution 1 multiplied by 1 operation and a ReLU function pass through an output layer to obtain an amplitude spectrum mask of the output audio, namely a target feature map. And activating the target feature map by adopting a sigmod function to obtain a voice probability matrix.
Step 730, generating a human voice amplitude spectrum of the human voice signal based on the human voice probability matrix and the amplitude spectrum.
Illustratively, the human voice probability matrix m is multiplied by the amplitude spectrum X of the external audio signal X to obtain a human voice amplitude spectrum
Figure BDA0003427457230000111
Figure BDA0003427457230000112
Step 740, performing inverse time-frequency transformation based on the human voice amplitude spectrum and the phase spectrum to obtain a human voice signal.
Illustratively, the human voice amplitude spectrum and the phase spectrum of the external audio signal are combined into a complex spectrum of the output human voice signal
Figure BDA0003427457230000113
The complex spectrum is subjected to Inverse time-frequency variation according to ISTFT (Inverse Short-time Fourier transform) to obtain a human voice signal S:
Figure BDA0003427457230000114
illustratively, as shown in fig. 9, a schematic diagram of a process for separating human voice provided in an exemplary embodiment of the present application is shown. The earphone is provided with the VAD hardware 910, the NPU920 and the buffer 930, the earphone stores the external audio signals acquired through the microphone into the buffer 930, the external audio signals are transmitted to the VAD hardware 910, the VAD classifier in the VAD hardware 910 performs voice recognition on the external audio signals, when the external audio signals are recognized to contain the voice signals, the NPU920 is triggered, and the NPU920 calls the external audio signals in the buffer 930 to perform voice separation, so that the voice signals of the speaker are obtained.
In the embodiment of the application, the human voice separation is performed through the light human voice separation network, such as U-net, so that the power consumption of the earphone is reduced while the human voice separation speed is ensured.
Referring to fig. 10, there is shown a block diagram of a human voice transmission device according to an exemplary embodiment of the present application, the device being used for headphones, the device comprising:
the voice recognition module 1001 is configured to perform voice recognition on the collected external audio signal;
a separation module 1002, configured to, when it is identified that the external audio signal includes a voice signal, separate the voice signal from the external audio signal;
the mixing module 1003 is configured to perform mixing processing on the separated human voice signal and noise reduction signal to obtain a mixed signal, where the noise reduction signal is used for performing active noise reduction;
and a driving module 1004, configured to drive the speaker to sound based on the audio mixing signal.
Optionally, the voice recognition module 1001 is configured to:
extracting the characteristics of the acquired external audio signals to obtain audio characteristics;
classifying the audio features through a VAD classifier to obtain a classification result, wherein the classification result is used for representing the signal type of the signals contained in the external audio signals;
the separation module 1002 includes:
a separation unit, configured to separate the voice signal from the external audio signal through a voice separation network when the classification result indicates that the external audio signal includes the voice signal;
the power consumption of classification by the VAD classifier is lower than that of voice separation by the voice separation network.
Optionally, the signal type includes at least one of a speaker sound signal, an ambient noise signal, and a musical speaker sound signal;
the separation unit is used for:
and in the case that the classification result indicates that the external audio signal contains the voice signal, separating the voice signal from the external audio signal through the voice separation network.
Optionally, the VAD classifier is trained based on a sample audio signal comprising a sample signal type tag, the sample audio signal being derived from a mixture of at least two of a sample speaker sound signal, a sample ambient noise signal, and a sample music speaker sound signal.
Optionally, the separation unit is configured to:
performing time-frequency conversion on the external audio signal to obtain an amplitude spectrum and a phase spectrum of the external audio signal;
carrying out voice probability prediction on the amplitude spectrum through the voice separation network to obtain a voice probability matrix;
generating a human voice amplitude spectrum of the human voice signal based on the human voice probability matrix and the amplitude spectrum;
and performing reverse time-frequency conversion based on the human voice amplitude spectrum and the phase spectrum to obtain the human voice signal.
Optionally, the voice separation network adopts U-net;
optionally, the separation unit is further configured to:
performing feature extraction on the amplitude spectrum through n layers of feature extraction layers of the voice separation network to obtain a downsampled feature map output by each feature extraction layer;
performing feature fusion on the up-sampling feature map and the down-sampling feature map through an n-layer feature fusion layer of the voice separation network to obtain a target feature map, wherein the up-sampling feature map is obtained by performing up-sampling on the down-sampling feature map through the feature fusion layer;
and activating the target feature map to obtain the voice probability matrix.
Optionally, the training indexes adopted in the human voice separation network training process include a separation accuracy index and a separation speed index, wherein the separation accuracy index is used for guiding and optimizing network parameters of the human voice separation network, and the separation speed index is used for guiding and optimizing a network architecture of the human voice separation network.
Optionally, the apparatus further comprises:
and the stopping module is used for stopping the voice separation of the external audio signals under the condition that the external audio signals are identified to not contain the voice signals.
Optionally, the headset is provided with VAD hardware and an NPU, wherein voice recognition is performed by the VAD, voice separation is performed by the NPU, or the headset is provided with an NPU, wherein voice recognition and voice separation are performed by the NPU.
To sum up, in the embodiment of the application, the earphone firstly performs voice recognition on the collected external audio signal, separates the voice signal from the external audio signal under the condition that the external audio signal is recognized to contain the voice signal, mixes the separated voice signal with the noise reduction signal, and generates a signal to drive the loudspeaker to sound, thereby realizing the transmission function of the earphone. In this application embodiment, the earphone only carries out the transmission with the voice signal for the user makes an uproar when enjoying the earphone and falls and conveniently hears the voice on every side, has improved the effect of transmitting through of earphone, and in addition, the earphone carries out voice recognition to external audio signal earlier, carries out voice separation again under the condition that contains the voice signal in the discernment external audio signal, has reduced the consumption of earphone.
It should be noted that: the apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to fig. 11, a block diagram illustrating a structure of an earphone 1100 according to an exemplary embodiment of the present application is shown. Headphones in the present application may include one or more of the following components: processor 1110, memory 1120, microphone 1130, speaker 1140, VAD hardware 1150, wherein processor 1110 is electrically connected to memory 1120, microphone 1130, speaker 1140, VAD hardware 1150, respectively.
The processor 1110 may include an NPU for human voice separation and may also include an MCU (Micro Controller Unit, micro control unit) for implementing other functions of the headset. The processor 1110 performs various functions of the headset 1100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1120, and invoking data stored in the memory 1120, using various interfaces and lines to connect various parts inside the entire headset 1100.
The Memory 1120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (ROM). Optionally, the memory 1120 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 1120 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 1120 may include a memory program area for storing instructions (such as a sound playing function) for implementing at least one function, etc., and a memory data area for storing data such as an external audio signal collected by the microphone 1130.
Microphone 1130 is a transducer device for converting acoustic signals into electrical signals for capturing external audio signals.
The speaker 1140 is a transducer device for converting an electrical signal into an acoustic signal for playing out a separated human voice signal.
The VAD hardware 1150 is used to identify the human voice signals contained in the external audio signals.
In addition, those skilled in the art will appreciate that the structures shown in the above figures do not constitute limitations on the headset 1100, and that the headset 1100 may include more or fewer components than shown, or may combine certain components or a different arrangement of components. For example, the earphone 1100 further includes a sensor, an audio circuit, a control circuit, a power source, etc., which are not described herein.
Embodiments of the present application also provide a computer readable storage medium storing at least one program code loaded and executed by a processor to implement the human voice transmission method provided in the above embodiments.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the headset reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the headset to perform the human voice transmission method provided in various alternative implementations of the above aspects.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the examples disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (13)

1. A method of human voice transmission, the method being for headphones, the method comprising:
carrying out voice recognition on the collected external audio signals;
separating the human voice signal from the external audio signal under the condition that the external audio signal is identified to contain the human voice signal;
mixing the voice signals obtained through separation and the noise reduction signals to obtain mixed voice signals, wherein the noise reduction signals are used for actively reducing noise;
and driving a loudspeaker to sound based on the mixed sound signal.
2. The method of claim 1, wherein the voice recognition of the collected external audio signal comprises:
extracting the characteristics of the acquired external audio signals to obtain audio characteristics;
classifying the audio features through a VAD classifier to obtain a classification result, wherein the classification result is used for representing the signal type of the signals contained in the external audio signals;
the separating the human voice signal from the external audio signal when the external audio signal is identified to contain the human voice signal comprises:
separating the voice signal from the external audio signal through a voice separation network under the condition that the classification result indicates that the external audio signal contains the voice signal;
the power consumption of classification by the VAD classifier is lower than that of voice separation by the voice separation network.
3. The method of claim 2, wherein the signal types include at least one of a speaker sound signal, an ambient noise signal, and a musical sound signal;
and when the classification result indicates that the external audio signal contains the voice signal, separating the voice signal from the external audio signal through a voice separation network, including:
and in the case that the classification result indicates that the external audio signal contains the voice signal, separating the voice signal from the external audio signal through the voice separation network.
4. The method of claim 3, wherein the VAD classifier is trained based on a sample audio signal comprising a sample signal type tag, the sample audio signal being derived from a mixture of at least two of a sample speaker sound signal, a sample ambient noise signal, and a sample music sound signal.
5. The method of claim 2, wherein said separating the human voice signal from the external audio signal via a human voice separation network comprises:
performing time-frequency conversion on the external audio signal to obtain an amplitude spectrum and a phase spectrum of the external audio signal;
carrying out voice probability prediction on the amplitude spectrum through the voice separation network to obtain a voice probability matrix;
generating a human voice amplitude spectrum of the human voice signal based on the human voice probability matrix and the amplitude spectrum;
and performing reverse time-frequency conversion based on the human voice amplitude spectrum and the phase spectrum to obtain the human voice signal.
6. The method of claim 5, wherein the voice separation network employs U-net;
the voice probability prediction is performed on the amplitude spectrum through the voice separation network to obtain a voice probability matrix, and the voice probability matrix comprises the following components:
performing feature extraction on the amplitude spectrum through n layers of feature extraction layers of the voice separation network to obtain a downsampled feature map output by each feature extraction layer;
performing feature fusion on the up-sampling feature map and the down-sampling feature map through an n-layer feature fusion layer of the voice separation network to obtain a target feature map, wherein the up-sampling feature map is obtained by performing up-sampling on the down-sampling feature map through the feature fusion layer;
and activating the target feature map to obtain the voice probability matrix.
7. The method of claim 5, wherein the training metrics employed in the human voice separation network training process include a separation accuracy metric for guiding optimization of network parameters of the human voice separation network and a separation speed metric for guiding optimization of network architecture of the human voice separation network.
8. The method according to any one of claims 1 to 7, further comprising:
and stopping human voice separation of the external audio signal under the condition that the external audio signal is identified to not contain the human voice signal.
9. The method according to any one of claims 1 to 7, wherein,
the headset is provided with VAD hardware and an NPU, wherein voice recognition is performed by the VAD, voice separation is performed by the NPU, or,
the earphone is provided with an NPU, wherein voice recognition and voice separation are performed by the NPU.
10. A human voice transmission device, the device for headphones, the device comprising:
the voice recognition module is used for recognizing the voice of the collected external audio signals;
the separation module is used for separating the voice signal from the external audio signal under the condition that the voice signal is contained in the external audio signal;
the sound mixing module is used for carrying out sound mixing processing on the voice signal obtained by separation and the noise reduction signal to obtain a sound mixing signal, and the noise reduction signal is used for carrying out active noise reduction;
and the driving module is used for driving the loudspeaker to sound based on the sound mixing signal.
11. A headset comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the human voice transmission method of any one of claims 1 to 9.
12. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the human voice transmission method of any of claims 1 to 9.
13. A computer program product, characterized in that it comprises computer instructions which, when executed by a processor, implement the human voice transmission method according to any one of claims 1 to 9.
CN202111582502.6A 2021-12-22 2021-12-22 Human voice transmission method, device, earphone, storage medium and program product Pending CN116343756A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111582502.6A CN116343756A (en) 2021-12-22 2021-12-22 Human voice transmission method, device, earphone, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111582502.6A CN116343756A (en) 2021-12-22 2021-12-22 Human voice transmission method, device, earphone, storage medium and program product

Publications (1)

Publication Number Publication Date
CN116343756A true CN116343756A (en) 2023-06-27

Family

ID=86889917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111582502.6A Pending CN116343756A (en) 2021-12-22 2021-12-22 Human voice transmission method, device, earphone, storage medium and program product

Country Status (1)

Country Link
CN (1) CN116343756A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116980798A (en) * 2023-09-20 2023-10-31 彼赛芬科技(深圳)有限公司 Permeation mode adjusting device of wireless earphone and wireless earphone
CN117412216A (en) * 2023-12-12 2024-01-16 深圳市雅乐电子有限公司 Earphone, control method and control device thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116980798A (en) * 2023-09-20 2023-10-31 彼赛芬科技(深圳)有限公司 Permeation mode adjusting device of wireless earphone and wireless earphone
CN117412216A (en) * 2023-12-12 2024-01-16 深圳市雅乐电子有限公司 Earphone, control method and control device thereof

Similar Documents

Publication Publication Date Title
CN109121057B (en) Intelligent hearing aid method and system
CN112424863B (en) Voice perception audio system and method
EP2306457B1 (en) Automatic sound recognition based on binary time frequency units
CN111464905A (en) Hearing enhancement method and system based on intelligent wearable device and wearable device
JP5929786B2 (en) Signal processing apparatus, signal processing method, and storage medium
CN116343756A (en) Human voice transmission method, device, earphone, storage medium and program product
CN111833896A (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
US11510019B2 (en) Hearing aid system for estimating acoustic transfer functions
CN103392349A (en) Systems, methods, apparatus, and computer-readable media for spatially selective audio augmentation
US20230335101A1 (en) Active noise cancellation method, device, and system
KR101414233B1 (en) Apparatus and method for improving speech intelligibility
CN107564538A (en) The definition enhancing method and system of a kind of real-time speech communicating
JP2014174255A5 (en)
CN110992967A (en) Voice signal processing method and device, hearing aid and storage medium
CN113241085B (en) Echo cancellation method, device, equipment and readable storage medium
CN110970010A (en) Noise elimination method, device, storage medium and equipment
CN112767908A (en) Active noise reduction method based on key sound recognition, electronic equipment and storage medium
CN113823304A (en) Voice signal processing method and device, electronic equipment and readable storage medium
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN115482830A (en) Speech enhancement method and related equipment
CN108600893A (en) Military environments audio classification system, method and military noise cancelling headphone
WO2019228329A1 (en) Personal hearing device, external sound processing device, and related computer program product
CN107889002B (en) Neck ring bluetooth headset, the noise reduction system of neck ring bluetooth headset and noise-reduction method
JP2007187748A (en) Sound selective processing device
CN115866474A (en) Transparent transmission noise reduction control method and system of wireless earphone and wireless earphone

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination