WO2022253003A1 - Speech enhancement method and related device - Google Patents

Speech enhancement method and related device Download PDF

Info

Publication number
WO2022253003A1
WO2022253003A1 PCT/CN2022/093969 CN2022093969W WO2022253003A1 WO 2022253003 A1 WO2022253003 A1 WO 2022253003A1 CN 2022093969 W CN2022093969 W CN 2022093969W WO 2022253003 A1 WO2022253003 A1 WO 2022253003A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
noise
target user
speech
voice
Prior art date
Application number
PCT/CN2022/093969
Other languages
French (fr)
Chinese (zh)
Inventor
魏善义
吴超
邱炎
廖猛
范泛
彭世强
李斌
赵文斌
李江
李海婷
黄雪妍
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111323211.5A external-priority patent/CN115482830B/en
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202280038999.1A priority Critical patent/CN117480554A/en
Publication of WO2022253003A1 publication Critical patent/WO2022253003A1/en
Priority to US18/522,743 priority patent/US20240096343A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present application relates to the field of speech processing, in particular to a speech enhancement method and related equipment.
  • a general noise reduction method one way is to estimate the background noise by using the signal collected within a period of time according to the difference in spectral characteristics between the background noise signal and the voice music signal, and then perform environmental noise suppression according to the estimated background noise characteristics , this method works well for stationary noise, but completely fails for speech interference.
  • another way also uses the difference in the correlation between different channels, such as multi-channel noise suppression or microphone array beamforming technology.
  • the direction of voice interference can be suppressed to a certain extent, but the tracking effect of the direction change of the interference source often cannot meet the demand, and the voice enhancement of the specific target person cannot be realized.
  • the embodiment of the present application provides a method for speech enhancement, including: after the terminal device enters into a person-specific noise reduction (personalized noise reduction, PNR) mode, acquiring a noisy speech signal and target speech related data, wherein the noisy The speech signal contains the interference noise signal and the speech signal of the target user; the target speech related data is used to indicate the speech characteristics of the target user; according to the target speech related data, the first noisy speech signal is denoised through the trained speech noise reduction model processing to obtain the noise-reduced speech signal of the target user; wherein, the speech noise-reduction model is realized based on a neural network.
  • PNR personalized noise reduction
  • the noise reduction speech signal of the target user is enhanced to obtain the enhanced speech signal of the target user, wherein the amplitude of the enhanced speech signal of the target user is the same as that of the target user
  • the ratio of the amplitude of the noise-reduced speech signal is the speech enhancement coefficient of the target user.
  • the target user's voice signal can be further enhanced, so as to further highlight the target user's voice and suppress the non-target user's voice, and improve the user's experience in voice calls and voice interactions.
  • the interference noise suppression coefficient suppress the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient;
  • the noise-suppressed signal is fused with the target user's enhanced speech signal to obtain an output signal.
  • the value range of the interference noise suppression coefficient is (0,1).
  • the voice of the non-target user is further suppressed, and the voice of the target user is highlighted indirectly.
  • the interference noise suppression coefficient suppress the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient;
  • the noise suppression signal is fused with the target user's noise reduction speech signal to obtain an output signal.
  • the target users include M
  • the target voice-related data includes the voice-related data of the M target users
  • the noise-reduced voice signals of the target users include the noise-reduced voice signals of the M target users
  • the enhancement coefficient includes speech enhancement coefficients of M target users, and M is an integer greater than 1,
  • the first noisy speech signal is denoised by the speech noise reduction model, so as to obtain the noise-reduced speech signal of the target user A; for M
  • Each target user in the target users is all processed according to this method, and the noise-reduced voice signals of M target users can be obtained;
  • the voice signals of multiple target users can be enhanced by using the above parallel method, and for multiple target users, the enhanced voice signals of the target users can be further adjusted by setting the voice enhancement coefficient, thus solving the problem of voice degradation in the case of multiple people. noise problem.
  • the first noisy voice signal is denoised through the voice noise reduction model, and the noise-reduced voice signal of the first target user and the first target user are obtained.
  • the first noisy speech signal of the user's speech signal; the first noisy speech signal not containing the speech signal of the first target user through the speech noise reduction model according to the speech related data of the second target user in the M target users Carry out noise reduction processing, to obtain the noise-reduced speech signal of the 2nd target user and the first band noise speech signal that does not contain the speech signal of the 1st target user and the speech signal of the 2nd target user; Repeat above-mentioned process, until According to the voice-related data of the Mth target user, the noise reduction process is performed on the first noisy voice signal that does not contain the voice signals of the 1st to M-1 target users through the voice noise reduction model, and the noise reduction of the Mth target user is obtained.
  • Noise-reduced speech signals and interference noise signals so far, the noise-reduce
  • the relevant data of each target user includes the registered voice signal of the target user, and the registered voice signal of target user A is when the noise decibel value is lower than the preset
  • the voice noise reduction model includes M first encoding network, second encoding network, time convolution network (time convolution network, TCN), first decoding network and M first
  • the three-decoding network performs noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech-related data, so as to obtain the target user's noise-reduced speech signal and interference noise signal, including:
  • the voice signals of multiple target users can be denoised, thereby solving the problem of voice denoising in the case of multiple people.
  • the method of the present application also includes:
  • An interfering noise signal is also obtained from the first decoding network and the second eigenvector.
  • the relevant data of the target user A includes the registered voice signal of the target user A
  • the registered voice signal of the target user A is the voice of the target user A collected in an environment where the noise decibel value is lower than a preset value signal
  • the speech noise reduction model includes a first coding network, a second coding network, a TCN and a first decoding network, and performs noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech related data of the target user A, so as to Obtain the noise-reduced voice signal of the target user A, including:
  • the first noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-1 target users; according to the feature vector of the registration speech signal of the i target user and the feature vector of the first noise signal Obtain the first eigenvector; Obtain the second eigenvector according to the TCN and the first eigenvector; Obtain the noise-reduced voice signal and the second noise signal of the i-th target user according to the first decoding network and the second eigenvector, wherein, the second The noise signal is a first noisy speech signal that does not include the speech signals of the 1st to i-th target users.
  • the voice signal of the target user By registering the voice signal of the target user in advance, in the subsequent voice interaction, the voice signal of the target user can be enhanced, interference voice and noise can be suppressed, and only the voice signal of the target user can be input during voice wake-up and voice interaction to improve voice The effect and accuracy of wake-up and speech recognition; and construct a speech noise reduction model based on the TCN causal hole convolution network, and realize the low-latency output speech signal of the speech noise reduction model.
  • the relevant data of the target user includes the VPU signal of the target user
  • the speech noise reduction model includes a preprocessing module, a third coding network, a gated recurrent unit (gated recurrent unit, GRU), a second decoding network
  • the post-processing module according to the relevant data of the target voice, the noise reduction processing is carried out to the first noisy voice signal by the voice noise reduction model, so as to obtain the noise-reduced voice signal of the target user, including:
  • the frequency domain signal is fused with the second frequency domain signal to obtain a first fused frequency domain signal;
  • the first fused frequency domain signal is successively processed by a third encoding network, a GRU, and a second decoding network to obtain a voice signal of a target user
  • the first frequency domain signal is post-processed by the post-processing module according to the mask of the third frequency domain signal to obtain the third frequency domain signal;
  • the third frequency domain signal is frequency-time Transform to obtain the noise-reduced voice signal of the target user;
  • the third encoding module and the second decoding module are all implemented based on the convolutional layer and the frequency transformation block (frequency transformation block, FTB).
  • the post-processing includes mathematical operations, such as dot multiplication and so on.
  • the mask of the first frequency domain signal is obtained by processing the first fused frequency domain signal successively through the third coding network, the GRU and the second decoding network; the mask of the first frequency domain signal is processed by the post-processing module A frequency-domain signal is post-processed to obtain a fourth frequency-domain signal of the interference noise signal; frequency-time transformation is performed on the fourth frequency-domain signal to obtain the interference noise signal.
  • the relevant data of the target user A includes the VPU signal of the target user A
  • the speech noise reduction model includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module, according to the target user A
  • the voice-related data of A uses the voice noise reduction model to perform noise reduction processing on the first noisy voice signal to obtain the noise-reduced voice signal of the target user A, including:
  • Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user A through the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the ninth frequency domain of the VPU signal of the target user A signal; the first frequency domain signal and the ninth frequency domain signal are fused to obtain the second fused frequency domain signal; the second fused frequency domain signal is successively processed by the third encoding network, the GRU and the second decoding network to obtain the target
  • the first frequency domain signal is post-processed by the mask of the tenth frequency domain signal by the post-processing module to obtain the tenth frequency domain signal; for the tenth frequency domain
  • the signal is subjected to frequency-time transformation to obtain the noise-reduced speech signal of the target user A; wherein, the third encoding module and the second decoding module are both implemented based on the convolutional layer and FTB.
  • the relevant data of the i-th target user among the M target users includes the VPU signal of the i-th target user, where i is an integer greater than 0 and less than or equal to M,
  • Both the first noise signal and the VPU signal of the i-th target user are time-frequency transformed by the preprocessing module to obtain the eleventh frequency domain signal of the first noise signal and the tenth frequency domain signal of the i-th target user's VPU signal.
  • the noise reduction speech signal of the target user is enhanced to obtain the enhanced speech signal of the target user, including:
  • the interference noise suppression signal is fused with the target user's enhanced speech signal to obtain an output signal, including:
  • the magnitudes of the enhanced speech signals of the multiple target users can be adjusted as required.
  • the relevant data of the target user includes the target user's VPU signal
  • the method of the present application further includes: acquiring the target user's in-ear sound signal
  • MVDR minimum variance distortionless response
  • an interference noise signal is obtained according to the noise-reduced speech signal of the target user and the first noisy speech signal.
  • the relevant data of the target user A includes the VPU signal of the target user A
  • the method of the present application further includes: acquiring the in-ear sound signal of the target user A;
  • the first noisy voice signal is denoised through the voice noise reduction model to obtain the denoised voice signal of the target user A, including:
  • the method of the present application also includes:
  • SNR signal-to-noise ratio
  • SPL sound pressure level
  • Obtaining the first noisy speech signal includes:
  • the signal collected by the environment is calculated by using the collected signal to obtain the signal angle of arrival (direction of arrival, DOA) and sound pressure level (sound pressure level, SPL) of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than The tenth threshold, and the SPL of the first noise segment is greater than the eleventh threshold, then extract the second temporary feature vector of the first noise segment; based on the second temporary speech feature vector, the second noise segment is denoised to obtain the first Four denoising noise segments; performing damage assessment based on the fourth denoising noise segment and the second noise segment to obtain a fourth damage score; if the fourth damage score is not greater than the twelfth threshold, enter the PNR mode.
  • DOA direction of arrival
  • SPL sound pressure level
  • Obtaining the first noisy speech signal includes:
  • a first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a second temporary feature vector.
  • Time-frequency transform is performed on the signal collected by the microphone array to obtain a nineteenth frequency domain signal, and based on the nineteenth frequency domain signal, DOA and SPL of the first noise segment are calculated.
  • the method of the present application also includes:
  • the fourth prompt message is sent by the terminal device, and the fourth prompt message is used to prompt whether the terminal device enters the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  • the auxiliary device may be a device with a microphone array, such as a computer, a tablet computer, and the like.
  • the method of the present application also includes:
  • the method of the present application also includes:
  • the terminal device has stored the reference temporary voiceprint feature vector, obtain the third noise segment; according to the reference temporary voiceprint feature vector Perform denoising processing on the third noise segment to obtain a third denoising noise segment; perform damage assessment based on the third noise segment and the third denoising noise segment to obtain a third damage score; if the third damage score is greater than the sixth threshold And the SNR of the third noise segment is less than the seventh threshold, or the third damage score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then the third prompt information is sent through the terminal device, and the third prompt information is used to Prompt the current user that the terminal device can enter the PNR mode; after detecting the current user's operation instruction agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the fourth noisy voice signal; After the operator does not agree to enter the operation instruction of P
  • the reference temporary voiceprint feature vector is the voiceprint feature vector of the historical user.
  • the seventh threshold may be 10dB or other values
  • the sixth threshold may be 8dB or other values
  • the eighth threshold may be 12dB or other values.
  • the method of the present application also includes:
  • the terminal device When it is detected that the terminal device is in the hands-free call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user who is using the terminal device;
  • the terminal device When it is detected that the terminal device is in a video call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user closest to the terminal device;
  • the terminal device When it is detected that the terminal device is connected to the headset for a call, enter the PNR mode, wherein the target user is a user wearing the headset; the first noisy voice signal and the target voice-related data are collected through the headset; or,
  • the terminal device When it is detected that the terminal device is connected to a smart large-screen device, a smart watch or a vehicle-mounted device, it enters the PNR mode, where the target user is the owner of the terminal device or the user who is using the terminal device, and the first noisy voice signal is related to the target voice
  • the data is collected by the audio collection hardware of smart large-screen devices, smart watches or vehicle-mounted devices.
  • the method of the present application also includes:
  • the decibel value of the audio signal in the current environment If the decibel value of the audio signal in the current environment exceeds the preset decibel value, it is determined whether the PNR function corresponding to the application started by the terminal device is enabled; if it is not enabled, the PNR function started by the terminal device is enabled.
  • the application corresponds to the PNR function and enters the PNR mode.
  • the application program is an application program installed on the terminal device, such as call, video call, video recording application program, WeChat, QQ and so on.
  • the terminal device includes a display screen, and the display screen includes a plurality of display areas, wherein each display area in the plurality of display areas displays a label and a corresponding function key, and the function key is used to control its corresponding label Indicates the activation and deactivation of the function or application's PNR function.
  • the interface displayed on the display screen of the terminal device is set to control the opening and closing of the PNR function of a certain application program (such as calling, recording, etc.) of the terminal device, so that the user can turn on and off the PNR function as required.
  • a certain application program such as calling, recording, etc.
  • the method of the present application when voice data transmission is performed between the terminal device and another terminal device, the method of the present application further includes:
  • Receive a voice enhancement request sent by another terminal device the voice enhancement request is used to instruct the terminal device to enable the PNR function of the call function; in response to the voice enhancement request, send a third prompt message through the terminal device, the third prompt message is used to prompt Whether to enable the terminal device to enable the PNR function of the call function; after detecting the operation instruction confirming the PNR function of the call function, enable the PNR function of the call function and enter the PNR mode; send a voice enhancement response message to another terminal device, the The voice enhancement response message is used to indicate that the terminal device has enabled the PNR function of the call function.
  • the display interface of the terminal device when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area and a second area, the first area is used to display the content of the video call or video recording, and the second area The area is used to display M controls and corresponding M labels.
  • the M controls correspond to the M target users one by one.
  • Each of the M controls includes a sliding button and a sliding bar. By controlling the sliding button to slide on the sliding bar, to adjust the speech enhancement coefficient of the target user indicated by the label corresponding to the control.
  • the user can adjust the intensity of noise reduction according to his need.
  • the interference noise suppression coefficient can also be adjusted in this way.
  • the display interface of the terminal device when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area, and the first area is used to display the video call content or the video recording content;
  • the control corresponding to the object is displayed in the first area, the control includes a sliding button and a sliding bar, and the sliding button is controlled to slide on the sliding bar to Adjusts the speech enhancement factor for this object.
  • the user can adjust the intensity of noise reduction according to his need.
  • the interference noise suppression coefficient can also be adjusted in this way.
  • the target voice-related data includes a voice signal including a wake-up word
  • the first noisy voice signal includes an audio signal including a command word
  • the smart interactive devices include devices such as smart speakers, sweeping robots, smart refrigerators, and smart air conditioners.
  • This method is used to perform noise reduction processing on the instruction voice for controlling the intelligent interactive device, so that the intelligent interactive device can quickly obtain accurate instructions, and then complete the actions corresponding to the instructions.
  • an embodiment of the present application provides a terminal device, where the terminal device includes a unit or a module configured to execute the method in the first aspect.
  • an embodiment of the present application provides a terminal device, including a processor and a memory, wherein the processor is connected to the memory, wherein the memory is used to store program codes, and the processor is used to call the program codes to execute the method of the first aspect part or all of.
  • the embodiment of the present application provides a chip system, which is applied to electronic equipment; the chip system includes one or more interface circuits, and one or more processors; the interface circuits and processors are interconnected through lines; the interface The circuit is used to receive a signal from the memory of the electronic device and send a signal to the processor, the signal including computer instructions stored in the memory; when the processor executes the computer instruction, the electronic device executes the method described in the first aspect.
  • an embodiment of the present application provides a computer storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method described in the first aspect.
  • the embodiment of the present application further provides a computer program product, including computer instructions, which, when the computer instructions are run on the terminal device, enable the terminal device to implement part of the method described in the first aspect or all.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application
  • FIG. 2a is a schematic diagram of a speech noise reduction processing principle provided by an embodiment of the present application.
  • FIG. 2b is a schematic diagram of another speech noise reduction processing principle provided by the embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a speech enhancement method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a speech noise reduction model provided in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a speech noise reduction model provided in an embodiment of the present application.
  • Figure 6a shows the framework structure of the TCN model
  • Figure 6b illustrates the structure of the causal dilated convolutional layer unit
  • FIG. 7 is a schematic structural diagram of another speech noise reduction model provided by the embodiment of the present application.
  • Fig. 8 is the specific structure diagram of neural network in Fig. 7;
  • FIG. 9 is a schematic diagram of a speech noise reduction process provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another speech noise reduction process provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of a multi-person speech noise reduction process provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a multi-person speech noise reduction process provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a multi-person speech noise reduction process provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another speech noise reduction model provided by the embodiment of the present application.
  • FIG. 15 is a schematic diagram of a UI interface provided by the embodiment of the present application.
  • FIG. 16 is a schematic diagram of another UI interface provided by the embodiment of the present application.
  • FIG. 17 is a schematic diagram of another UI interface provided by the embodiment of the present application.
  • FIG. 18 is a schematic diagram of another UI interface provided by the embodiment of the present application.
  • FIG. 19 is a schematic diagram of a UI interface in a call scenario provided by an embodiment of the present application.
  • FIG. 20 is a schematic diagram of a UI interface in another call scenario provided by the embodiment of the present application.
  • FIG. 21 is a schematic diagram of a video recording UI interface provided by an embodiment of the present application.
  • FIG. 22 is a schematic diagram of a video call UI interface provided by an embodiment of the present application.
  • FIG. 23 is a schematic diagram of another video call UI interface provided by the embodiment of the present application.
  • FIG. 24 is a schematic structural diagram of a terminal device provided in an embodiment of the present application.
  • FIG. 25 is a schematic structural diagram of another terminal device provided by an embodiment of the present application.
  • FIG. 26 is a schematic structural diagram of another terminal device provided by an embodiment of the present application.
  • Multiple means two or more.
  • “And/or” describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently.
  • the character “/” generally indicates that the contextual objects are an "or” relationship.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • the application scenario includes an audio collection device 102 and a terminal device 101.
  • the terminal device can be a smart phone, a smart watch, a TV, a smart vehicle/vehicle terminal, a headset, a PC, a tablet, a notebook computer, a smart speaker, a robot, a recording collection device, etc.
  • On the terminal equipment that needs to collect sound signals such as for mobile phone voice enhancement, process the noisy voice signal collected by the microphone, output the noise-reduced voice signal of the target user, as the uplink signal of the voice call, or voice wake-up and voice Input signal to the recognition engine.
  • the collected sound signal can also be collected by an audio collection device 102 connected to the terminal device in a wired or wireless manner.
  • the audio collection device 102 and the terminal device 101 are integrated together.
  • Fig. 2a and Fig. 2b schematically illustrate the principle of speech noise reduction processing.
  • the noisy speech signal and the registered speech of the target user are input into the speech noise reduction model for processing to obtain the noise-reduced speech signal of the target user, or as shown in Figure 2b, input the noisy speech signal and the VPU signal of the target user into the speech noise reduction model for processing to obtain the noise-reduced speech signal of the target user.
  • the enhanced voice signal can be used for voice calls or voice wake-up and voice recognition functions.
  • private devices such as mobile phones, PCs and various personal wearable products, etc.
  • the target user is fixed, and only the voice information of the target user is kept as the registered voice or VPU signal during the call and voice interaction, and then the voice enhancement is performed in the above-mentioned manner , which can greatly improve the user experience.
  • voice enhancement can be performed through multi-user voice registration (as shown in Figure 2a), which can improve the experience of multi-user scenarios.
  • FIG. 3 is a schematic flowchart of a speech enhancement method provided by an embodiment of the present application. As shown in Figure 3, the method includes:
  • the terminal device After the terminal device enters the PNR mode, acquire the first noisy voice signal and target voice-related data, wherein the first noisy voice signal includes an interference noise signal and the voice signal of the target user, and the target voice-related data is used to indicate the target The user's speech characteristics.
  • the target voice-related data may be the target user's registered voice signal, or the target user's VPU signal, or the target user's voiceprint features, or the target user's video lip movement information.
  • the voice signal of the target user with a preset duration collected by the microphone in a quiet scene is the registered voice signal of the target user; wherein, the sampling frequency of the microphone can be 16000 Hz, assuming that the preset duration is 6s , the registered voice signal of the target user includes 96000 sampling points.
  • the quiet scene specifically means that the sound level of the scene is not higher than a preset decibel; optionally, the preset decibel may be 1dB, 2dB, 5dB, 10dB or other values.
  • the target user's VPU signal is acquired through a device with a bone voiceprint sensor, and the VPU sensor in the bone voiceprint sensor can pick up the target user's voice signal through bone conduction.
  • the difference of the VPU signal is that it only picks up the voice of the target user and can only pick up low-frequency signals (generally below 4kHz).
  • the first noisy speech signal includes the target user's speech signal and other noise signals
  • the other noise signals include other user's speech signals and/or noise signals generated by non-human beings, such as noise signals generated by automobiles and construction site machines.
  • the speech noise reduction model has different network structures, that is to say, the speech noise reduction model adopts different processing methods for different target speech related data.
  • the target voice related data is the registered voice of the target user or the video lip movement information of the target user
  • the voice noise reduction model corresponding to the method can be used to carry out noise reduction processing on the target voice related data and the first noisy voice signal;
  • the relevant data includes the VPU signal of the target user, and the voice noise reduction model corresponding to mode 2 or mode 3 may be used to perform noise reduction processing on the target voice related data and the first noisy voice signal.
  • the first method is specifically described by taking the target voice-related data as the registered voice signal of the target user as an example.
  • Mode 1 As shown in Figure 4, according to the relevant data of the target voice, the noise reduction processing is performed on the first noisy voice signal through the voice noise reduction model to obtain the noise-reduced voice signal of the target user, which specifically includes the following steps:
  • the speech noise reduction model includes a first encoding network, a second encoding network, a TCN, and a first decoding network.
  • the first encoding network includes a convolutional layer, layer normalization (256), an activation function PReLU (256) and an averaging layer, and the size of the convolution kernel of the convolutional layer can be is 1*1; the registered voice with 96000 sampling points is input with 40 sampling points as a frame, and the feature matrix with a size of 4800*256 is obtained through the convolutional layer, layer normalization and activation function PReLU, in which, and two adjacent The frontal overlap rate of the sampling points of the frame can be 50%, and the overlap rate can of course be other values; then the feature matrix is averaged in the time dimension by the averaging layer to obtain the feature vector of the registered speech signal with a size of 1*256.
  • the second encoding network includes a convolutional layer, layer normalization and activation function; specifically, the noisy speech is passed through the convolutional layer, layer Normalize and activate the function to obtain the speech feature vector of each frame; perform mathematical operations, such as dot multiplication, on the target speech feature vector and the speech feature vector of each frame in the first noisy speech, so as to obtain the first feature vector.
  • the above mathematical operations may be dot multiplication or other operations.
  • the TCN model uses a causal atrous convolution model.
  • Figure 6a shows the framework structure of the TCN model.
  • the TCN model includes M blocks, and each block consists of N causal atrous convolutional layer units.
  • Figure 6b shows the structure of the causal dilated convolutional layer unit, and the corresponding convolution expansion rate of the nth layer is 2 n-1 .
  • the TCN model includes 5 blocks, and each block includes 4 layers of causal atrous convolutional layer units, so the expansion rates corresponding to layers 1, 2, 3, and 4 in each block are 1, 2, and 4 respectively.
  • the convolution kernel is 3x1.
  • the first eigenvector is passed through the TCN model to obtain the second eigenvector, and the dimension of the second eigenvector is 1x256.
  • the first decoding network includes an activation function PReLU (256) and a deconvolution layer (256x20x2); the second feature vector passes through the activation function and the deconvolution layer to obtain the voice signal of the target user.
  • PReLU 256
  • a deconvolution layer 256x20x2
  • the second feature vector passes through the activation function and the deconvolution layer to obtain the voice signal of the target user.
  • the structure of the second encoding network refer to the structure of the first encoding network.
  • the second encoding network lacks the function of averaging in the time dimension.
  • the target user's video lip movement information includes multiple frames of images containing the target user's lip movement information. If the target voice-related data is the target user's video lip movement information, the target user's registered voice The signal is replaced with the video lip movement information of the target user, and the feature vector of the video lip movement information of the target user is extracted through the first coding network, and then the subsequent processing is performed according to the first described method.
  • the voice signal of the target user By registering the voice signal of the target user in advance, in the subsequent voice interaction, the voice signal of the target user can be enhanced, interference voice and noise can be suppressed, and only the voice signal of the target user can be input during voice wake-up and voice interaction to improve voice The effect and accuracy of wake-up and speech recognition; and the TCN causal hole convolution network is used to build a speech noise reduction model, which can realize the low-latency output speech signal of the speech noise reduction model.
  • Mode 2 and Mode 3 are specifically described by taking the target voice-related data as the VPU signal of the target user as an example.
  • Mode 2 As shown in Figure 7, the VPU signal of the target user and the first noisy voice signal are subjected to noise reduction processing using the voice noise reduction model to obtain the noise-reduced voice signal of the target user, which specifically includes the following steps:
  • the frequency domain signal of the VPU signal is fused with the frequency domain signal of the first noisy speech signal to obtain the first fused frequency domain signal;
  • the first fused frequency domain signal is processed through the third encoding network, the GRU and the second decoding network respectively , to obtain the mask of the frequency domain signal of the voice signal of the target user;
  • the frequency domain signal of the first noisy voice signal is post-processed through the post-processing module according to the mask of the frequency domain signal of the voice signal of the target user, such as mathematical operations
  • the dot product in is to obtain the frequency domain signal of the voice signal of the target user, and perform frequency-time transformation on the frequency domain signal of the voice signal of the target user to obtain the noise-reduced voice signal of the target user.
  • fast Fourier transform (fast Fourier transform, FFT) is performed on the VPU signal of the target user and the first noisy voice signal through the preprocessing module to obtain the frequency domain signal of the VPU signal of the target user and the first noisy voice signal.
  • FFT fast Fourier transform
  • the first fused frequency domain signal is input into the third encoding network for feature extraction to obtain the feature vector of the first fused frequency domain signal; then the eigenvector of the first fused frequency domain signal is input into the GRU performing processing to obtain a third eigenvector; inputting the third eigenvector into a second decoding network for processing to obtain a mask of a frequency domain signal of the voice signal of the target user.
  • both the third encoding network and the second decoding network include 2 convolutional layers and 1 FTB. Among them, the size of the convolution kernel of the convolution layer is 3x3.
  • the mask of the frequency domain signal of the speech signal of the target user is carried out point multiplication with the frequency domain signal of the first band noise speech signal by the post-processing module, obtains the frequency domain signal of the speech signal of the target user; Then the speech signal of the target user Inverse fast Fourier transform (IFFT) is performed on the frequency domain signal to obtain the noise-reduced speech signal of the target user.
  • IFFT Inverse fast Fourier transform
  • VPU signal of the target user is used to extract the voice features of the target user in real time, and this feature is fused with the first noisy voice signal collected by the microphone to guide the voice enhancement of the target user and the suppression of interference such as the voice of non-target users , and this embodiment also proposes a new speech noise reduction model based on FTB and GRU for the suppression of interference such as speech enhancement of target users and speech of non-target users; it can be seen that by adopting the scheme of this embodiment, It does not require the user to register voice feature information in advance, and the real-time VPU signal can be used as auxiliary information to obtain the enhanced voice of the target user and suppress the interference of non-target voice.
  • Time-frequency transform is performed on the first noisy speech signal and the target user's in-ear sound signal respectively to obtain the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's sound signal;
  • the covariance matrix of the first band noise speech signal and the ear sound signal of the target user is obtained based on the VPU signal and the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's ear sound signal respectively;
  • the covariance matrix of the in-ear sound signal of the noisy speech signal and the target user obtains the first MVDR weight; based on the first MVDR weight, the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's in-ear sound signal Obtain the frequency domain signal of the first voice signal and the frequency domain signal of the second voice signal; wherein, the frequency domain signal of the first voice signal is related to the first noisy voice signal, and the frequency domain signal of the second voice signal is related to the target user's Intra-ear sound signal correlation, according to the frequency domain signal
  • an earphone device with a bone voiceprint sensor the device includes a bone voiceprint sensor, an in-ear microphone and an out-of-ear microphone, and the VPU sensor in the bone voiceprint sensor can pick up the sound signal of the speaker through bone conduction;
  • the microphone is used to pick up the sound signal in the ear;
  • the microphone outside the ear is used to pick up the sound signal outside the ear, which is the first noisy voice signal in this application;
  • the VPU signal of the target user is processed by the voice activity detection (VAD) algorithm, and the processing result is obtained; according to the processing result, it is judged whether the target user is speaking; if it is judged that the target user is speaking, then the The first identification is set to the first value (such as 1 or true); if it is judged that the target user does not speak, the first identification is set to the second value (such as 0 or false);
  • VAD voice activity detection
  • update the covariance matrix which specifically includes: respectively performing time-frequency transformation, such as FFT, on the first noisy speech signal and the target user's in-ear sound signal to obtain the first noisy speech signal The frequency domain signal of the signal and the frequency domain signal of the target user's in-ear sound signal; then calculate the target user's ear based on the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's in-ear sound signal.
  • time-frequency transformation such as FFT
  • X H (f) is the Hermitian transformation of X (f), or the conjugate transpose of X (f); f is a frequency point; then
  • the MVDR weight is obtained based on the covariance matrix; among them, the MVDR weight can be expressed as:
  • the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's in-ear sound signal, the frequency domain signal of the first speech signal and the frequency domain signal of the second speech signal are obtained; wherein, The frequency domain signal of the first voice signal is related to the first noisy voice signal, the frequency domain signal of the second voice signal is related to the ear sound signal of the target user, and the frequency domain signal of the first voice signal and the second voice signal
  • the frequency domain signal of the signal and the frequency domain signal of the second voice signal; the frequency domain signal of the first noisy voice signal and the frequency domain signal of the ear sound signal of the target user are respectively multiplied by two vectors to obtain the first The frequency domain signal of the voice signal and the frequency domain signal of the second voice signal;
  • the locked covariance matrix is not updated, that is to say, the historical covariance matrix is used for calculating the first MVDR weight.
  • the user does not need to register the voice feature information in advance, and the real-time VPU signal can be used as the auxiliary information to obtain the enhanced voice signal while suppressing the interference noise.
  • the speech enhancement coefficient of the target user is obtained, and the noise-reduced speech signal of the target user is enhanced based on the speech enhancement coefficient of the target user to obtain the target user
  • the enhanced speech signal of the target user wherein the ratio of the amplitude of the enhanced speech signal of the target user to the amplitude of the noise-reduced speech signal of the target user is the speech enhancement coefficient of the target user.
  • an interference noise signal is added on the basis of the voice signal of the target user, thereby improving the user experience.
  • the decoding network including the first decoding network and the second decoding network
  • the enhanced speech signal can also output the interference noise signal.
  • the interference noise signal can be obtained by subtracting the noise-reduced speech signal of the target user from the first noisy speech signal.
  • the second decoding network of the speech noise reduction model also outputs the mask of the frequency domain signal of the first noisy speech signal
  • the post-processing module also performs the first processing according to the mask of the frequency domain signal of the first noisy speech signal.
  • the frequency domain signal of the noisy speech signal is post-processed, such as dot multiplication, to obtain the frequency domain signal of the interference noise, and then frequency-time transform is performed on the frequency domain signal of the interference noise, such as IFFT, to obtain the interference noise signal.
  • the first noisy speech signal is processed according to the noise-reduced speech signal of the target user to obtain an interference noise signal.
  • the interference noise signal can be obtained by subtracting the noise-reduced speech signal of the target user from the first noisy speech signal.
  • the interference noise signal is fused with the enhanced voice signal of the target user to obtain an output signal; the output signal is the enhanced voice signal of the target user and It is obtained by mixing the interference noise signal.
  • the interference noise suppression coefficient is obtained, and the interference noise signal is suppressed based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the amplitude of the interference noise suppression signal and the amplitude of the interference noise
  • the ratio of is the interference noise suppression coefficient; then the interference noise suppression signal is fused with the enhanced speech signal of the target user to obtain an output signal; the output signal is obtained by mixing the enhanced speech signal of the target user and the interference noise suppression signal.
  • the interference noise suppression coefficient is acquired, and the interference noise signal is suppressed based on the interference noise suppression coefficient to obtain an interference noise suppression signal; then the interference noise suppression signal is fused with the target user's noise-reduced voice signal to obtain an output signal.
  • the output signal is obtained by mixing the target user's noise-reduced voice signal and the interference noise-suppressed signal.
  • the target users include M
  • the target voice-related data include the relevant data of M target users
  • the noise-reduced voice signals of the target users include the noise-reduced voice signals of M target users
  • the voice enhancement coefficients of the target users include the voices of M target users Enhancement coefficient
  • the first noisy speech signal includes speech signals of M target users and interference noise signals.
  • Method 4 As shown in Figure 11, input the speech-related data and the first noisy speech signal of the first target user among the M target users into the speech noise reduction model for noise reduction processing, and obtain the first target user The noise-reduced speech signal of the first target user and the first noisy speech signal not containing the speech signal of the first target user; The noisy speech signal is input into the speech noise reduction model for noise reduction processing, and the noise-reduced speech signal of the second target user and the first speech signal not including the speech signal of the first target user and the speech signal of the second target user are obtained.
  • noisy speech signal repeat the above steps until the speech-related data of the M target user and the first noisy speech signal that does not contain the speech of the 1st to M-1 target users are input into the speech noise reduction model for reduction Noise processing, obtain the noise-reduced speech signal of the M target user and the interference noise signal, this interference noise signal is the first band noise speech signal that does not contain the speech signal of the 1st to M target users;
  • the speech enhancement coefficients are respectively enhanced to the noise-reduced speech signals of M target users to obtain the enhanced speech signals of M target users; for any target user O in the M target users, the enhanced speech signal of target user O is The ratio of the amplitude and the amplitude of the noise-reduced speech signal of the target user is the speech enhancement coefficient of the target user O; based on the interference noise suppression coefficient, the interference noise signal is suppressed to obtain the interference noise suppression signal, wherein, the interference noise suppression The ratio of the amplitude of the signal to the amplitude of the interference noise signal is the interference noise suppression coefficient
  • the structure of the voice noise reduction model in mode four when the voice-related data of the M target users are registered voice signals or video lip movement information, the structure of the voice noise reduction model in mode four can be the structure described in mode one;
  • the structure of the speech noise reduction model in the fourth method may be the structure described in the second method, or the speech noise reduction model in the fourth method realizes the function described in the third method.
  • the noise-reduced speech signals and interference noise signals of M target users are obtained according to the fourth manner, the noise-reduced speech signals and interference noise signals of the M target users are directly fused to obtain an output signal.
  • the output signal is obtained by mixing noise-reduced speech signals and interference noise signals of M target users.
  • the target users include M, as shown in Figure 12, input the voice-related data and the first noisy voice signal of the first target user among the M target users into the voice noise reduction model for noise reduction processing, Obtain the noise-reduced speech signal of the first target user; input the speech-related data and the first noisy speech signal of the second target user into the speech noise reduction model for noise reduction processing, and obtain the denoised speech signal of the second target user.
  • noisy voice signal repeat the steps above, until the voice-related data of the M target user and the first noisy voice signal are input into the voice noise reduction model for noise reduction processing, to obtain the noise reduction voice signal of the M target user;
  • the noise-reduced speech signals of the M target users are respectively enhanced to obtain the enhanced speech signals of the M target users; for any target user O in the M target users, the target user
  • the ratio of the amplitude of the enhanced speech signal of O to the amplitude of the noise-reduced speech signal of the target user O is the speech enhancement coefficient of the target user O; the enhanced speech signals of M target users are fused to obtain an output signal.
  • the output signal is obtained by mixing the enhanced speech signals of M target users.
  • voice-related data of the above M target users and the first noisy voice signal are input into the voice noise reduction model in parallel, so the above actions may be processed in parallel.
  • the structure of the voice noise reduction model in mode five when the voice-related data of M target users are registered voice signals or video lip movement information, the structure of the voice noise reduction model in mode five can be the structure described in mode one; When the voice-related data of M target users is a VPU signal, the structure of the voice noise reduction model in the fifth way can be the structure described in the second way, or the voice noise reduction model in the fifth way can realize the function described in the third way.
  • the enhanced voice signals of the M target users may be directly fused to obtain the above output signal.
  • the output signal is obtained by mixing the enhanced speech signals of M target users.
  • Method 6 As shown in Figure 13, input the voice-related data of M target users and the first noisy voice signal into the voice noise reduction model for noise reduction processing, and obtain the noise-reduced voice signals of M target users; based on The speech enhancement coefficients of the M target users respectively enhance the noise-reduced speech signals of the M target users to obtain the enhanced speech signals of the M target users; for any target user O in the M target users, the target user O's The ratio of the amplitude of the enhanced speech signal to the amplitude of the noise-reduced speech signal of the target user O is the speech enhancement coefficient of the target user O; based on the interference noise suppression coefficient, the interference noise signal is suppressed to obtain the interference noise suppression signal, where , the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient; the enhanced speech signals of M target users are fused with the interference noise suppression signal to obtain an output signal.
  • the output signal is obtained by mixing the enhanced speech signals of the M target users and the interference noise suppression signals.
  • the speech noise reduction model in way six is shown in Figure 14, the speech noise reduction model includes M first encoding networks, second encoding networks, TCN and first decoding networks; use M first encoding networks to respectively Feature extraction is performed on the registered voice signals of M target users to obtain the feature vectors of the registered voice signals of M target users; the second coding network is used to extract the features of the first noisy voice signal to obtain the features of the first noisy voice signal vector; perform mathematical operations on the eigenvectors of the registered voice signals of M target users and the eigenvectors of the first noisy voice signal, such as dot multiplication, to obtain the first eigenvector; use TCN to process the first eigenvector to obtain the first eigenvector Two feature vectors; and use the first decoding network for processing to obtain the noise-reduced voice signal and interference noise signal of the target user.
  • VPU signal of each person can be collected, and then the noise reduction scheme based on the VPU signal can be carried out according to the above-mentioned VPU signal. Perform noise reduction processing.
  • the interference noise suppression coefficient can be the default value, or it can be set by the target user based on their own needs. For example, as shown in the left figure in Figure 15, after the PNR function is enabled on the terminal device , the terminal device enters the PNR mode, and the display interface of the terminal device displays the stepless sliding control shown in the right figure in Figure 15.
  • the target user can adjust the interference noise suppression coefficient by controlling the gray knob on the stepless sliding control, where the interference noise suppression The value range of the coefficient is [0,1]; when the control gray knob is slid to the far left, the interference noise suppression coefficient is 0, indicating that the PNR mode is not entered, and the interference noise is not suppressed; when the control gray knob is slid to the right When it is on the right side, the interference noise suppression coefficient is 1, which means that the interference noise is completely suppressed; when the control gray knob slides to the middle, it means that the interference noise is not completely suppressed.
  • the infinite sliding control may be in the shape of a disk as shown in FIG. 15 , or in a bar shape, or in other shapes, which are not limited here.
  • the speech enhancement coefficient may also be adjusted in the above manner.
  • the following method can be used to determine whether to use the traditional noise reduction algorithm or the noise reduction method disclosed in the present application for noise reduction.
  • the method of the present application also includes:
  • the first noise segment and the second noise segment of the environment where the terminal device is located wherein the first noise segment and the second noise segment are continuous in time; obtain the SNR and SPL of the first noise segment, if the first noise segment The SNR of the first noise segment is greater than the first threshold and the SPL of the first noise segment is greater than the second threshold, then the first temporary feature vector of the first noise segment is extracted; the second noise segment is degraded based on the first temporary speech feature vector of the first noise segment Noise processing to obtain the second noise reduction noise segment; damage assessment based on the second noise reduction noise segment and the first noise segment to obtain the first damage score; if the first damage score is not greater than the third threshold, enter the PNR mode, and determine the first noisy speech signal from the noise signal generated after the first noise segment, and use the first temporary feature vector as the feature vector of the registered speech signal.
  • the terminal device sends a first prompt message to the target user, the first prompt message is used to prompt the target user whether to make the terminal device enter the PNR mode; when the target user is detected Enter the PNR mode only after agreeing with the operation instruction that the terminal device enters the PNR mode.
  • the default microphone of the terminal device collects a voice signal, and processes the collected voice signal through a traditional noise reduction algorithm to obtain the user's noise-reduced voice signal; (For example, every 10 minutes) Acquire the first noise segment (such as the 6s voice signal currently collected by the microphone) and the second noise segment (such as the subsequent 10s voice signal of the 6s voice signal currently collected by the microphone) in the environment where the terminal device is located ), and obtain the SNR and SPL of the first noise segment; judge whether the SNR of the first noise segment is greater than 20dB and whether the SPL is greater than 40dB; if the SNR of the first noise segment is greater than the first threshold (such as 20dB) and the SPL is greater than the second threshold (such as 40dB), then extract the first temporary feature vector of the first noise segment; Utilize the first temporary feature vector to carry out noise reduction processing to the second noise segment to obtain the second noise reduction noise segment; based on the second noise reduction noise segment Perform damage
  • determining the first noisy speech signal from the noise signal generated after the first noise segment may be understood as the first noisy speech signal is part or all of the noise signal generated after the first noise segment.
  • the impairment score may be a signal-to-distortion ratio (SDR) value or a (perceptual evaluation of speech quality, PESQ) value.
  • SDR signal-to-distortion ratio
  • PESQ perceptual evaluation of speech quality
  • the method of the present application also includes:
  • the terminal device has stored the reference temporary voiceprint feature vector, obtain the third noise segment; according to the reference temporary voiceprint feature vector Perform denoising processing on the third noise segment to obtain a third denoising noise segment; perform damage assessment based on the third noise segment and the third denoising noise segment to obtain a third damage score; if the third damage score is greater than the sixth threshold And the SNR of the third noise segment is less than the seventh threshold, or the third damage score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then the third prompt information is sent through the terminal device, and the third prompt information is used to Prompt the current user that the terminal device can enter the PNR mode; after detecting the current user's operation instruction agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the fourth noisy voice signal; After the operator does not agree to enter the operation instruction of P
  • the terminal device If it is detected that the current user agrees to the operation instruction of turning on the PNR function of the terminal device, the terminal device enters the PNR mode, and performs noise reduction processing on the fourth noisy voice signal, which is after the third noise segment Acquired; if it is detected that the current user does not agree with the operation instruction to enable the PNR function of the terminal device, then maintain the traditional noise reduction algorithm to perform noise reduction processing on the fourth noisy speech signal.
  • the method of the present application also includes:
  • the second noisy voice signal is obtained, and the traditional noise reduction algorithm is used, that is, the non-PNR mode is used to perform noise reduction processing on the second noisy voice signal to obtain the current user's noise-reduced voice signal; simultaneously judge whether the SNR of the second noisy speech signal is lower than the fourth threshold; when the SNR of the second noisy speech signal is lower than the fourth threshold, perform speech on the second noisy speech signal according to the first temporary feature vector Noise reduction processing to obtain the noise-reduced voice signal of the current user; performing damage assessment based on the noise-reduced voice signal of the current user and the second noisy voice signal to obtain a second damage score; when the second damage score is not greater than the fifth
  • a second prompt message is sent to the current user through the terminal device, and the second prompt message is used to prompt the current user that the terminal device can enter the PNR mode; after detecting the current user's operation instruction that the terminal device enters the PNR mode , enter the PNR mode to perform noise reduction processing
  • the default microphone of the terminal device collects the second noisy voice signal, and uses a traditional noise reduction algorithm to process the second noisy voice signal, and outputs the current user's reduced voice signal.
  • noisy speech signal Determine whether the current environment is noisy at the same time, specifically judge whether the SNR of the second noisy speech signal is less than the fourth threshold; when the SNR of the second noisy speech signal is less than the fourth threshold (for example, the SNR is less than 10dB), it means that the current environment is noisy; according to this
  • the noise reduction algorithm of the application uses the previously stored speech features (i.e.
  • the first temporary feature vector to perform noise reduction processing on the second noisy speech signal to obtain the current user's noise reduction speech signal; based on the current user's noise reduction
  • the noisy speech signal and the second noisy speech signal are subjected to damage assessment to obtain the second damage score.
  • the specific process can be referred to the above method, which will not be described here; if the second score is lower than the fifth threshold, it means that the current user and the storage match the voice features represented by the first temporary feature vector; send a second prompt message to the current user through the terminal device, and the second prompt message is used to prompt the current user to enable the PNR call function of the terminal device.
  • the terminal device If it is detected that the current user agrees to an operation command to enable the PNR function of the terminal device, the terminal device enters the PNR mode, and performs noise reduction processing on the third noisy voice signal, which is generated in the second noisy voice signal. After the signal is acquired; if it is detected that the current user does not agree to the operation instruction of turning on the PNR function of the terminal device, the traditional noise reduction algorithm is maintained to perform noise reduction processing on the third noisy voice signal.
  • the following method can be used to determine whether to use the traditional noise reduction algorithm or the noise reduction method disclosed in the present application for noise reduction.
  • the method of the present application also includes:
  • the signal collected by the environment is used to calculate the DOA and SPL of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than the tenth threshold, and the SPL of the first noise segment is greater than the eleventh threshold, Then extract the second temporary feature vector of the first noise segment; based on the second temporary speech feature vector, the second noise segment is denoised to obtain the fourth noise reduction noise segment; based on the fourth noise reduction noise segment and the second noise Perform damage assessment on the segment to obtain a fourth damage score; if the fourth damage score is not greater than the twelfth threshold, enter the PNR mode.
  • Obtaining the first noisy speech signal includes:
  • a first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a second temporary feature vector.
  • the DOA and SPL of the first noise segment are calculated by using the collected signal, which may specifically include:
  • Time-frequency transform is performed on the signal collected by the microphone array to obtain a nineteenth frequency domain signal, and based on the nineteenth frequency domain signal, DOA and SPL of the first noise segment are calculated.
  • the method of the present application also includes:
  • the fourth prompt message is sent by the terminal device, and the fourth prompt message is used to prompt whether the terminal device enters the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  • the terminal device is connected to a computer (a case of an auxiliary device) in a wired or wireless manner, and the microphone array of the computer collects the signal of the environment where the terminal device is located; then the terminal device obtains the The signals collected by the microphone array are then processed in the manner described above, which will not be described here.
  • a computer a case of an auxiliary device
  • the terminal device stores the first temporary feature vector or the second temporary feature vector, and directly obtains the first temporary feature vector or the second temporary feature vector when it needs to be used later.
  • the second temporary eigenvector avoids the failure to obtain the speech characteristics of the current user in a scene with high noise, thereby making it impossible to perform damage assessment.
  • noise reduction methods are disclosed in this application. For different scenarios, it is possible to judge whether to enter the PNR mode based on the scene information, and automatically identify the target user or object, and select the corresponding noise reduction method:
  • the terminal device When it is detected that the terminal device is in the hands-free call state, it enters the PNR mode, and the owner who has registered the voiceprint feature is the target user; obtain the voice signal of the current user during the call for t seconds for voiceprint recognition, and compare the recognition result with the registration Compare the features of the voiceprint. If it is determined that the current user is not the owner of the phone, use the acquired voice signal of the current user for t seconds during the call as the registered voice signal of the user, and use the current user as the target user. Noise reduction is performed in the manner described above; wherein, the above t can be 3 or other values.
  • the terminal device When it is detected that the terminal device is in the video call state, enter the PNR mode, and when the terminal device is in the video call, face recognition is performed on the image collected by the camera to determine the identity of the current user in the image; if the image contains multiple people, then use The person closest to the camera is the current user; the distance between the person in the image and the camera can be determined through sensors such as depth sensors on the terminal device; after the current user is determined, the terminal device detects whether the registered voice of the current user has been stored or The voice characteristics of the current user; if the registered voice of the current user or the voice characteristics of the current user have been stored, the current user is determined as the target user, and the registered voice or voice characteristics of the current user are used as the voice-related data of the current user; if the terminal If the device does not store the registered voice or voice features of the current user, the terminal device detects whether the current user is speaking through the lip shape detection method, and when it detects that the current user is speaking, intercepts the voice signal of the current user from the voice
  • the terminal device When it is detected that the terminal device is connected to the headset and the terminal device is in a call state, it enters the PNR mode; and the terminal device detects whether the headset has a bone voiceprint sensor, and if so, collects the VPU of the target user through the bone voiceprint sensor of the headset signal, and use the methods described in methods 2, 3 and 4 to perform noise reduction processing; if the earphone does not have a bone voiceprint sensor, the user with the voice signal that has been registered in the earphone will be used as the target user by default, and the The user's registered voice and the first noisy voice signal collected by the headset are sent to the terminal device, and the terminal device uses the methods described in methods 1 and 4 to perform noise reduction;
  • the microphone acquires the call voice of the user who is currently wearing the earphone, uses a part of the voice as the user's registered voice, and sends the registered voice and the first noisy voice signal collected by the earphone to the terminal device.
  • the terminal device adopts method one Perform noise reduction in the same way as
  • the terminal device When it is detected that the terminal device is connected to a smart device (such as a smart large-screen device or a smart watch or a car Bluetooth device) and is in a video call state, enter the PNR mode to determine whether the current user's registered voice signal has been stored in the terminal device, if If the registered voice signal of the current user has been stored in the terminal device, the first noisy voice signal is collected through the smart device, and the first noisy voice signal is sent to the terminal device, and the terminal device adopts the methods described in method 1 and method 4 Perform noise reduction.
  • a smart device such as a smart large-screen device or a smart watch or a car Bluetooth device
  • the PNR since the PNR is mainly used in an environment with relatively strong noise, and the user may not always be in an environment with relatively strong noise, it can provide An interface for users to set a specific function or the PNR function of an application.
  • Apps can be various applications that require specific voice enhancement functions, such as calls, voice assistants, Changlian, recorders, etc.; specific functions can be various functions that require local voice recording, such as answering calls, video recording, using voice Assistant etc.
  • the display interface of the terminal device displays 3 function labels and 3 PNR control buttons corresponding to the 3 function labels; the user can control 3 functions respectively through the 3 PNR control buttons Turn off and turn on the PNR function; as shown in the left figure of Figure 16, the PNR function corresponding to the call and voice assistant is turned on, and the PNR function of video recording is turned off; as shown in the right figure of Figure 16, the display interface of the terminal device There are 5 application labels and 5 PNR control buttons corresponding to the 5 application labels.
  • the user can control the PNR function of the 5 applications to be turned off and on through the 5 PNR control buttons; as shown in the right figure in Figure 16 , the PNR functions of Changba, Recorder and Changlian are turned on, and the PNR functions of calls and WeChat are turned off. It should be pointed out that, for example, when the PNR function of a call is enabled, when the user uses the terminal device to make a call, the terminal device directly enters the PNR mode. By adopting the above method, for different voice functions of the terminal device, the user can flexibly set whether to enable the PNR function.
  • the display interface of the terminal device taking the "Call" application program/"Answer a call” function as an example provides a switch for enabling the PNR function on this interface, such as the "Enable PNR" function button in Figure 17;
  • the left picture in Figure 17 is a schematic diagram of the display interface of the terminal device when a call comes in, the display interface displays the information of the caller, the "open PNR" function button, the "hang up” function button and the “answer” function button; in Figure 17
  • the picture on the right is a schematic diagram of the display interface of the terminal device when answering a call; the display interface displays the information of the caller, the "Enable PNR" function button, and the "Hang up” function button.
  • the display interface of the terminal device jumps to display the interface shown in the left figure in Figure 15
  • Target users can adjust the intensity of noise reduction by controlling the gray knob in Figure 15 to adjust the size of the interference noise suppression coefficient.
  • target users can flexibly enable or disable specific functions or PNR functions of applications according to their needs.
  • the present application further includes: judging whether the decibel value of the current ambient sound exceeds a preset decibel value (such as 50dB), or detecting whether the current ambient sound contains the voice of a non-target user ; If it is determined that the decibel value of the current ambient sound exceeds the preset decibel value, or a non-target user's voice is detected in the current ambient sound, the PNR function is enabled.
  • the target user uses the terminal device and needs to perform noise reduction, it directly enters the PNR mode; in other words, for a specific function or application program of the terminal device, the corresponding PNR function can be enabled in the above manner.
  • the target user clicks on the PNR as shown in a in Figure 18, and enters the PNR setting interface, the target user can turn on the "Smart Open” of the PNR through the "Smart Open” switch function key shown in Figure 18 b Function, after the PNR smart activation function is enabled, the PNR function can be enabled in the above-mentioned way for specific functions or applications of the terminal device.
  • the display interface of the terminal device displays the content shown in c in Figure 18; the target user can turn on or off the specific function or application according to the demand through the PNR function key corresponding to the specific function or application Program the PNR function.
  • Enabling the smart PNR function as described above makes the terminal device more intelligent, reduces user operations, and makes user experience better.
  • the terminal device also the local device
  • the terminal device also the local device
  • the opposite end user knows the effect of the call after the PNR function is enabled, and it is difficult for the target user to judge whether to enable the PNR function.
  • the function or the noise reduction strength set can make the peer user hear clearly. Whether the PNR function of the terminal device is enabled or the noise reduction strength is set by the peer device.
  • the peer device After the peer device (that is, another terminal device) detects that the user of the peer device activates the PNR function of the terminal device, the peer device sends a voice enhancement request to the terminal device, and the voice enhancement request is used to request to turn on the terminal device.
  • the PNR function of the call function of the device after the terminal device receives the enhanced voice request, it responds to the voice enhanced request and displays a reminder label on the display interface of the terminal device, which is the third prompt message.
  • the reminder label is used to remind the target user
  • the peer device requests to enable the PNR function of the call function of the local device, whether to enable the terminal device to enable the PNR function of the call function; the reminder label also includes a confirmation function button; when the terminal device detects the operation of the target user on the confirmation function button Afterwards, the terminal device turns on the PNR function of the call function, enters the PNR mode, and sends a response message to the peer device.
  • the response message is used to respond to the above-mentioned enhanced voice request, and the response message is used to inform the peer device that the PNR function; after receiving the response message, the peer device displays a prompt label on the display interface of the peer device, and the prompt label is used to remind the user of the peer device that the voice of the target user has been enhanced.
  • the peer device sends the interference noise suppression coefficient to the terminal device to adjust the noise reduction strength of the terminal device; or the peer device sends a signal to the terminal device
  • the voice enhancement request carries the interference noise suppression coefficient.
  • the peer device when the peer device sends the interference noise suppression coefficient to the terminal device, the peer device also sends the speech enhancement coefficient of the target user to the terminal device.
  • the terminal device of user A (peer device) and the terminal device of user B (the above-mentioned terminal device is also the local device) conduct voice calls through the base station.
  • the transmission of data realizes the call between user A and user B.
  • the environment where user A is located is very noisy, and user B cannot hear what user A is saying; user B clicks the "enhance the other party's voice” function button displayed on the display interface of user B's terminal device to enhance user A's voice; User B's terminal device detects that user B presses the "enhance the other party's voice” function button, as shown in a in Figure 20, and sends an enhanced voice request to user A's terminal device, and the enhanced voice request is used to request user A's terminal device Turn on the PNR function of the call function; after user A's terminal device receives the voice enhancement request, a reminder label is displayed on the display interface of user A's terminal device, as shown in b in Figure 20, and the reminder label displays "the other party's request Enhance your voice, do you accept” to remind user A that user B requests to enhance his voice; if user B agrees to enhance his voice, user B clicks the "accept” function button displayed on the display interface of his terminal device; user B After detecting user B's operation on the
  • the response message is used to inform user A
  • the PNR function of the call function of user B's terminal device has been turned on; after user B's terminal device receives the above-mentioned response message fed back by the base station, it will display a prompt label "the other party's voice is being enhanced" on its display interface to inform user B that it has been enhanced User A's voice, as shown in c in Figure 20.
  • the terminal device may also control the peer device to enable the PNR function of the call function in the above manner.
  • the data transmitted between the terminal device and the peer device (including voice enhancement requests, response messages, etc.) is realized through the communication link established based on the phone number of the terminal device and the phone number of the peer device. Transmission.
  • the user of the peer device can decide whether to control the local device to enable the PNR function of the call function according to the voice quality of the target user it hears; of course, the target user can The voice quality of the user of the end device determines whether to control the PNR function of the end device to enable the call function, thereby improving the efficiency of the call between the two parties.
  • a parent in a video recording scene, for example, when a parent records a video for a child, the child is far away from the terminal device (such as the shooting terminal), and the parent is relatively close to the terminal device, resulting in the effect of the child's video recording.
  • the voice is small, while the voice of the parents is loud, but it is actually a video in which the voice of the child is loud, and the voice of the parents is weakened or even absent.
  • the display interface of the terminal device When recording a video or a video call, the display interface of the terminal device includes a first area and a second area, wherein the first area is used to display the video recording result or the content of the video call in real time, and the second area is used to display and adjust multiple Controls and corresponding labels of voice enhancement coefficients of objects (or target users);
  • the operation instruction of the control of the voice enhancement coefficient obtains the voice enhancement coefficients of multiple objects, and then respectively performs enhancement processing on the noise-reduced voice signals of the multiple objects according to the voice enhancement coefficients of the multiple objects, so as to obtain the enhanced voice signals of the multiple objects ; Then an output signal is obtained based on the enhanced speech signals of multiple objects.
  • the output signal is obtained by mixing the enhanced speech signals of multiple objects.
  • the speech enhancement coefficients of the multiple objects are obtained according to the above-mentioned method, and then the speech enhancement coefficients of the multiple objects are respectively assigned to
  • the noise-reduced speech signal of one object is enhanced to obtain the enhanced speech signal of multiple objects; then the output signal is obtained based on the enhanced speech signal of multiple objects and the interference noise signal.
  • the output signal is a mixture of enhanced speech signals and interference noise signals of multiple objects.
  • the above-mentioned second area is also used to display controls for adjusting the interference noise suppression coefficient, based on the user's target of the terminal device Operation instructions for the control for adjusting the speech enhancement coefficients of multiple objects and the control for adjusting the interference noise suppression coefficients Acquire the speech enhancement coefficients and interference noise suppression coefficients of multiple objects, and then perform multi-
  • the noise-reduced speech signal of an object is enhanced to obtain the enhanced speech signal of multiple objects; the interference noise signal is suppressed according to the interference noise suppression coefficient to obtain the interference noise suppression signal; then based on the enhanced speech signal of multiple objects and the The interference noise suppresses the signal to obtain the output signal.
  • the output signal is obtained by mixing the enhanced speech signals of multiple objects and the interference noise suppression signals.
  • the display interface of the terminal device includes an area for displaying the video recording result for image 1, displaying the voice enhancement coefficient and object
  • the control of the speech enhancement coefficient of 2 includes a bar-shaped sliding bar and a sliding button; object 2 can adjust the speech enhancement coefficient of object 1 by dragging the sliding button of object 1 on the sliding bar, and can drag the object
  • the sliding button of 2 slides on the sliding bar to adjust the size of the voice enhancement coefficient of object 2, so as to realize the adjustment of the sound volume of object 1 and object 2 during video recording.
  • the object 2 becomes the photographer by dragging the object 2, which is not shown in FIG. 21 .
  • Object 1 can increase the voice enhancement factor of object 2 by dragging the sliding button of object 2 to slide on the slide bar, thereby increasing the voice of object 2, that is, the voice of mother.
  • the controls for adjusting the speech enhancement coefficients of Object 1 and Object 2 are not displayed when the speech enhancement coefficients do not need to be adjusted.
  • the terminal device detects that Object 1
  • the control for adjusting the speech enhancement coefficient of object 1 or object 2 is displayed on the display interface of the terminal device; as shown in the right figure in Figure 23, object 1 needs Adjust the voice enhancement coefficient of object 2.
  • Object 1 long presses or clicks the display area of object 2 on the display interface of the terminal device. Of course, it can also be other operations.
  • the terminal device does not detect any operation on the control for adjusting the voice enhancement coefficient of object 2 , hides the controls for adjusting the speech enhancement factor for object 2.
  • the terminal device determines the voice signal feature of the object 2 from the database storing the voice signal features corresponding to the object, and then proceeds according to the noise reduction method of this application. noise reduction.
  • the terminal device When the terminal device detects a click, long press or other operations on the display interface, the terminal device first needs to recognize the object displayed in the operated area, and then determine the object that needs to be enhanced based on the pre-recorded relationship between the object and the voice signal. The speech signal, and then set the corresponding speech enhancement coefficient.
  • the target voice-related data includes a voice signal including a wake-up word
  • the noisy voice signal includes an audio signal including a command word
  • the above-mentioned smart interactive device is a device capable of voice interaction with the user, such as a sweeping robot, a smart speaker, a smart refrigerator, and the like.
  • the microphone collects audio signals
  • the voice wake-up module analyzes the collected audio signals to determine whether to wake up the device; the voice wake-up module first detects the collected signals and divides the voice segment come out. Then perform wake-up word recognition on the voice segment to determine whether the set wake-up word is included. For example, when using voice commands to control smart speakers by voice, the user generally needs to speak the wake-up word first, such as "little A little A".
  • the audio signal containing the wake-up word obtained by the voice wake-up module is used as the registered voice signal of the target user; the microphone collects the audio signal containing the user's voice command.
  • the user will speak specific commands after waking up the device, such as "what's the weather like tomorrow?", "play where is spring please” and other specific commands.
  • the enhanced voice signal or output signal of the target user The voice signal or the output signal enhances the voice signal of the target user speaking the wake-up word, and effectively suppresses other interfering speakers and background noise.
  • the new voice signal containing the wake-up word is used as the registration voice signal of the new target user, and the user who speaks the new voice signal containing the wake-up word is the target user.
  • this embodiment provides a solution that does not need to register the voice in advance, and does not need to rely on images or other sensor information to enhance the voice of the target person and suppress other background noises and interfering voices. It is suitable for Smart speakers, smart robots, etc. are multi-user-oriented, and users have temporary devices.
  • Interference noise through the introduction of voice enhancement coefficient and interference noise suppression coefficient, it meets the needs of users to adjust the noise reduction intensity; adopts the voice noise reduction model based on TCN or FTB+GRU structure for noise reduction, and delays in voice calls or video calls small, and the user has a good subjective sense of hearing; the noise reduction method of this application can also be used for noise reduction in multi-person scenarios, which meets the needs of multi-user noise reduction in multi-user scenarios; Targeted noise reduction in video scenes can automatically identify the target user, and retrieve the voiceprint information corresponding to the target user from the database to perform noise reduction, thereby improving the user experience; in the call scene or video call scene, based on the Enabling the PNR function according to the noise reduction requirement of the end user can improve the call quality of both parties in the call; adopting the method of this application to automatically enable the PNR function can improve the usability.
  • FIG. 24 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in Figure 24, the terminal device 2400 includes:
  • the acquiring unit 2401 is configured to acquire a noisy voice signal and target voice-related data after the terminal device enters the PNR mode, wherein the noisy voice signal includes an interference noise signal and a target user's voice signal; the target voice-related data is used to indicate the target user the user's voice characteristics;
  • the noise reduction unit 2402 is used to perform noise reduction processing on the first noisy speech signal through the trained speech noise reduction model according to the target speech related data, to obtain the noise reduction speech signal of the target user, wherein the speech noise reduction model is based on implemented by neural networks.
  • the acquiring unit 2401 is also configured to acquire the speech enhancement coefficient of the target user
  • the noise reduction unit 2402 is further configured to perform enhancement processing on the target user's noise-reduced speech signal based on the target user's speech enhancement coefficient to obtain the target user's enhanced speech signal, wherein the amplitude of the target user's enhanced speech signal is the same as the target user's
  • the ratio of the amplitude of the noise-reduced speech signal is the speech enhancement coefficient of the target user.
  • the obtaining unit 2401 is also configured to obtain the interference noise suppression coefficient after obtaining the interference noise signal through the noise reduction processing;
  • the noise reduction unit 2402 is further configured to perform noise reduction processing on the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient ;
  • the interference noise suppression signal is fused with the enhanced speech signal of the target user to obtain an output signal.
  • the obtaining unit 2401 is further configured to obtain the interference noise suppression coefficient after obtaining the interference noise signal through the noise reduction processing;
  • the noise reduction unit 2402 is further configured to suppress the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient;
  • the interference noise suppressed signal is fused with the noise-reduced speech signal of the target user to obtain an output signal.
  • the target users include M
  • the target voice-related data includes the voice-related data of the M target users
  • the noise-reduced voice signals of the target users include the noise-reduced voice signals of the M target users
  • the enhancement coefficient includes speech enhancement coefficients of M target users, M is an integer greater than 1, and the first noisy speech signal is subjected to noise reduction processing through a speech noise reduction model according to the target speech related data to obtain a noise-reduced speech signal of the target user aspect, the noise reduction unit 2402 is specifically used for:
  • the first noisy voice signal is denoised by the voice noise reduction model, so as to obtain the denoising voice signal of the target user A;
  • the noise reduction unit 2402 is specifically used for:
  • the noise-reduced speech signal of target user A is enhanced to obtain the enhanced speech signal of target user A;
  • the ratio of amplitude is the speech enhancement coefficient of target user A; According to this mode, the noise reduction speech signal of each target user in M target users is processed, and the enhanced speech signals of M target users can be obtained;
  • the noise reduction unit 2402 is further configured to obtain an output signal based on the enhanced voice signals of the M target users.
  • the target users include M
  • the target speech-related data includes the speech-related data of M target users
  • the noise-reduced speech signals of the target users include the noise-reduced speech signals of M target users
  • M is greater than 1 is an integer
  • the noise reduction unit 2402 is specifically used for:
  • the first noisy voice signal is denoised through the voice noise reduction model, and the noise-reduced voice signal of the first target user and the first target user are obtained.
  • the first noisy speech signal of the user's speech signal; the first noisy speech signal not containing the speech signal of the first target user through the speech noise reduction model according to the speech related data of the second target user in the M target users Carry out noise reduction processing, to obtain the noise-reduced speech signal of the 2nd target user and the first band noise speech signal that does not contain the speech signal of the 1st target user and the speech signal of the 2nd target user; Repeat above-mentioned process, until According to the voice-related data of the Mth target user, the noise reduction process is performed on the first noisy voice signal that does not contain the voice signals of the 1st to M-1 target users through the voice noise reduction model, and the noise reduction of the Mth target user is obtained.
  • Noise-reduced speech signals and interference noise signals so far, the noise-reduce
  • the target users include M
  • the target speech-related data includes the speech-related data of M target users
  • the noise-reduced speech signals of the target users include the noise-reduced speech signals of M target users
  • M is greater than 1 is an integer
  • the noise reduction unit 2402 is specifically used for:
  • the first noisy voice signal is denoised through the voice denoising model, so as to obtain the denoised voice signals and interference noise signals of the M target users.
  • the target users include M
  • the relevant data of the target users include the registration voice signals of the target users
  • the registration voice signals of the target users are target users collected in an environment where the noise decibel value is lower than a preset value
  • the speech signal, the speech noise reduction model includes the first coding network, the second coding network, TCN and the first decoding network
  • the noise reduction unit 2402 is specifically used for:
  • noise reduction unit 2402 is also used for:
  • An interfering noise signal is also obtained from the first decoding network and the second eigenvector.
  • the relevant data of the target user A includes the registered voice signal of the target user A
  • the registered voice signal of the target user A is the voice of the target user A collected in an environment where the noise decibel value is lower than a preset value signal
  • the speech noise reduction model includes a first coding network, a second coding network, a TCN and a first decoding network, and performs noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech related data of the target user A
  • the noise reduction unit 2402 is specifically used for:
  • the first encoding network and the second encoding network to extract the features of the registration speech signal of the target user A and the first noisy speech signal, so as to obtain the feature vector of the registration speech signal of the target user A and the first noise speech signal.
  • Feature vector According to the feature vector of the registered voice signal of target user A and the feature vector of the first noisy voice signal, the first feature vector is obtained; according to the TCN and the first feature vector, the second feature vector is obtained; according to the first decoding network and the first feature vector The second eigenvector obtains the noise-reduced speech signal of the target user A.
  • the relevant data of the i-th target user among the M target users includes the registration voice signal of the i-th target user, i is an integer greater than 0 and less than or equal to M, and the voice noise reduction model includes A coding network, a second coding network, a TCN and a first decoding network, the noise reduction unit 2402 is specifically used for:
  • the first noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-1 target users; according to the feature vector of the registration speech signal of the i target user and the feature vector of the first noise signal Obtain the first eigenvector; Obtain the second eigenvector according to the TCN and the first eigenvector; Obtain the noise-reduced voice signal and the second noise signal of the i-th target user according to the first decoding network and the second eigenvector, wherein, the second The noise signal is a first noisy speech signal that does not include the speech signals of the 1st to i-th target users.
  • the relevant data of each target user includes the registered voice signal of the target user
  • the registered voice signal of target user A is when the noise decibel value is lower than the preset
  • the voice noise reduction model includes M first encoding network, second encoding network, TCN, first decoding network and M third decoding network, according to the target voice related data Perform noise reduction processing on the first noisy speech signal through the speech noise reduction model, so as to obtain the noise reduction speech signal and the interference noise signal of the target user.
  • the noise reduction unit 2402 is specifically used for:
  • the relevant data of the target user includes the VPU signal of the target user
  • the speech noise reduction model includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module,
  • the noise reduction unit 2402 is specifically used for:
  • the frequency domain signal is fused with the second frequency domain signal to obtain a first fused frequency domain signal;
  • the first fused frequency domain signal is successively processed through a third encoding network, a GRU, and a second decoding network to obtain a voice signal of a target user
  • the first frequency domain signal is post-processed by the post processing module according to the mask of the third frequency domain signal to obtain the third frequency domain signal;
  • the third frequency domain signal is frequency-time transformed, A noise-reduced speech signal of the target user is obtained; wherein, the third encoding module and the second decoding module are both implemented based on convolutional layers and frequency FTB.
  • the noise reduction unit 2402 is specifically used for:
  • the first fused frequency domain signal successively through the third encoding network, GRU and second decoding network to obtain the mask of the first frequency domain signal; through the post-processing module, the first frequency domain signal is processed according to the mask of the first frequency domain signal The signal is post-processed to obtain a fourth frequency domain signal of the interference noise signal; and frequency-time transformation is performed on the fourth frequency domain signal to obtain the interference noise signal.
  • the relevant data of the target user A includes the VPU signal of the target user A
  • the voice noise reduction model includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module.
  • the voice-related data of user A is used for the first noisy voice signal through the voice noise reduction model to obtain the noise-reduced voice signal of target user A.
  • the noise reduction unit 2402 is specifically used for:
  • Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user A through the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the ninth frequency domain of the VPU signal of the target user A signal; the first frequency domain signal and the ninth frequency domain signal are fused to obtain the second fused frequency domain signal; the second fused frequency domain signal is successively processed by the third encoding network, the GRU and the second decoding network to obtain the target
  • the first frequency domain signal is post-processed by the mask of the tenth frequency domain signal by the post-processing module to obtain the tenth frequency domain signal; for the tenth frequency domain
  • the signal is subjected to frequency-time transformation to obtain the noise-reduced speech signal of the target user A;
  • both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
  • the relevant data of the i-th target user among the M target users includes the VPU signal of the i-th target user, i is an integer greater than 0 and less than or equal to M, and the noise reduction unit 2402 is specifically used to :
  • Both the first noise signal and the VPU signal of the i-th target user are time-frequency transformed by the preprocessing module to obtain the eleventh frequency domain signal of the first noise signal and the tenth frequency domain signal of the i-th target user's VPU signal.
  • the noisy voice signal of the signal; the mask of the thirteenth frequency domain signal and the tenth frequency domain signal of the voice signal of the i-th target user are obtained by processing the third fusion frequency domain signal successively through the third encoding network, GRU and the second decoding network A mask of the frequency-domain signal; post-processing the eleventh frequency-domain signal through the post-processing module according to the mask of the thirteenth frequency-domain signal and the mask of the eleventh frequency-domain signal, to obtain the thirteenth frequency-domain signal and the fourteenth frequency-domain signal of the second noise signal; the frequency-time transformation is carried out to the thirteenth frequency-domain signal and the fourteenth frequency-domain signal to obtain the noise-reduced voice signal and the second noise signal of the i-th target user, the first The second noise signal is the first noisy
  • the noise reduction unit 2402 is specifically configured to:
  • the noise-reduced voice signal of target user A is enhanced based on the voice enhancement coefficient of target user A to obtain the enhanced voice signal of target user A;
  • the enhanced voice signal of target user A The ratio of the amplitude of the signal to the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
  • the noise reduction unit 2402 is specifically used for:
  • the enhanced speech signals of the M target users are fused with the interference noise suppression signal to obtain an output signal.
  • the relevant data of the target user includes the VPU signal of the target user, and the acquiring unit 2401 is further configured to: acquire the in-ear sound signal of the target user;
  • the noise reduction unit 2402 is specifically used for:
  • the first frequency domain signal and the fifth frequency domain signal obtain the covariance matrix of the first noisy speech signal and the in-ear sound signal; obtain the first minimum variance distortion-free response MVDR weight based on the covariance matrix; based on the first MVDR weight, The first frequency domain signal and the fifth frequency domain signal obtain the sixth frequency domain signal of the first noisy speech signal and the seventh frequency domain signal of the in-ear sound signal; The eighth frequency domain signal of the noise speech signal; performing frequency-time transformation on the eighth frequency domain signal to obtain the noise-reduced speech signal of the target user.
  • noise reduction unit 2402 is also used for:
  • An interference noise signal is obtained for the first noisy speech signal according to the noise-reduced speech signal of the target user.
  • the relevant data of the target user A includes the VPU signal of the target user A, and the acquiring unit 2401 is also used to acquire the in-ear sound signal of the target user A;
  • the noise reduction unit 2402 is specifically used for:
  • the obtaining unit 2401 is also used to:
  • the first noise segment and the second noise segment of the environment where the terminal device is located are consecutive noise segments in time; obtain the signal-to-noise ratio SNR and sound pressure level SPL of the first noise segment
  • the terminal device 2400 also includes:
  • the obtaining unit 2401 is specifically used to:
  • a first noisy speech signal is determined from the noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a first temporary feature vector.
  • the determining unit 2403 is further configured to:
  • the first prompt message is sent by the terminal device, and the first prompt message is used to prompt whether to enable the terminal device to enter the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  • the obtaining unit 2401 is further configured to obtain the second noisy speech signal when it is detected that the terminal device is used again;
  • the noise reduction unit 2402 is further configured to: when the SNR of the second noisy speech signal is lower than the fourth threshold, perform noise reduction processing on the second noisy speech signal according to the first temporary feature vector, so as to obtain the noise reduction of the current user. noisy speech signal;
  • the determination unit 2403 is further configured to perform damage assessment based on the current user's noise-reduced voice signal and the second noise-containing voice signal to obtain a second damage score; when the second damage score is not greater than the fifth threshold, send a message through the terminal device
  • the second prompt information the second prompt information is used to remind the current user that the terminal equipment can enter the PNR mode; after detecting the current user's consent to enter the PNR mode operation instruction, the terminal equipment is allowed to enter the PNR mode to respond to the third noisy voice
  • the signal is subjected to noise reduction processing, and the third noisy speech signal is obtained after the second noisy speech signal; after detecting the current user's disagreement with the operation instruction to enter the PNR mode, use the non-PNR mode to process the third noisy speech signal. Noise reduction processing for noisy speech signals.
  • the acquiring unit 2401 is further configured to: if the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, and the terminal device has stored the reference temporary sound pattern feature vector, to obtain the third noise segment;
  • the noise reduction unit 2402 is further configured to perform noise reduction processing on the third noise segment according to the reference temporary voiceprint feature vector to obtain a third noise reduction noise segment;
  • the determining unit 2403 is further configured to perform damage assessment according to the third noise segment and the third denoising noise segment to obtain a third damage score; if the third damage score is greater than the sixth threshold and the SNR of the third noise segment is less than the seventh threshold , or the third damage score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then a third prompt message is sent through the terminal device, and the third prompt message is used to prompt the current user that the terminal device can enter the PNR mode; After detecting the operation instruction of the current user agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the fourth noisy voice signal; after detecting the operation instruction of the current user not agreeing to enter the PNR mode , using a non-PNR mode to perform noise reduction processing on the fourth noisy speech signal; wherein, the fourth noisy speech signal is determined from the noise signal generated after the third noise segment.
  • the acquisition unit 2401 is also configured to acquire the first noise segment and the second noise segment of the environment where the terminal device 2400 is located; the first noise segment and the second noise segment are temporally continuous noise segments ; Obtain the signal collected by the microphone array of the auxiliary device of the terminal device 2400 for the environment where the terminal device 2400 is located;
  • the terminal device 2400 also includes:
  • the determining unit 2403 is configured to use the collected signal to calculate the signal arrival angle DOA and SPL of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than the tenth threshold, and the SPL of the first noise segment is greater than the first threshold Eleven thresholds, then extract the second temporary feature vector of the first noise segment, and carry out noise reduction processing to the second noise segment based on the second temporary feature vector, to obtain the third noise reduction noise segment; based on the third noise reduction noise segment and Perform damage assessment on the second noise segment to obtain a fourth damage score; if the fourth damage score is greater than the twelfth threshold, enter the PNR mode;
  • the obtaining unit 2401 is specifically used to:
  • a first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a second temporary feature vector.
  • the determining unit 2403 is further configured to:
  • the fourth prompt message is sent by the terminal device 2400, and the fourth prompt message is used to prompt whether to make the terminal device 2400 enter the PNR mode; the terminal device 2400 enters the PNR mode only after detecting an operation instruction of the target user agreeing to enter the PNR mode.
  • the terminal device 2400 also includes:
  • the detection unit 2404 is configured to not enter the PNR mode when it is detected that the terminal device is in the handset talking state;
  • the terminal device When it is detected that the terminal device is in the hands-free call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user who is using the terminal device;
  • the terminal device When it is detected that the terminal device is in a video call, enter the PNR mode, wherein the target user is the owner of the terminal device or the user closest to the terminal device;
  • the terminal device When it is detected that the terminal device is connected to the headset for talking, enter the PNR mode, wherein the target user is a user wearing the headset; the first noisy voice signal and the target voice-related data are collected through the headset; or,
  • the terminal device When it is detected that the terminal device is connected to a smart large screen device, a smart watch or a vehicle-mounted device, it enters the PNR mode, where the target user is the owner of the terminal device or the user who is using the terminal device, the first noisy voice signal and the target voice-related data It is collected by the audio collection hardware of smart large-screen devices, smart watches, or vehicle-mounted devices.
  • the acquiring unit 2401 is also configured to: acquire the decibel value of the audio signal in the current environment,
  • the terminal device 2400 also includes:
  • the control unit 2405 is configured to determine whether the function activated by the terminal device or the PNR function corresponding to the application is enabled if the decibel value of the audio signal in the current environment exceeds the preset decibel value; if not enabled, then enable the application program activated by the terminal device Corresponding PNR function, and enter PNR mode.
  • the terminal device 2400 includes a display screen 2408, and the display screen 2408 includes multiple display areas,
  • each display area in the plurality of display areas displays a label and a corresponding function key, and the function key is used to control the opening and closing of the PNR function of the application program indicated by the corresponding label.
  • the terminal device 2400 when voice data transmission is performed between the terminal device and another terminal device, the terminal device 2400 further includes:
  • the receiving unit 2406 is configured to receive a voice enhancement request sent by another terminal device, where the voice enhancement request is used to instruct the terminal device to enable the PNR function of the call function;
  • the control unit 2405 is configured to send third prompt information through the terminal device in response to the voice enhancement request, and the third prompt information is used to prompt whether to enable the terminal device to enable the PNR function of the call function; After the PNR function of the call function, turn on the PNR function of the call function and enter the PNR mode;
  • the sending unit 2407 is configured to send a voice enhancement response message to another terminal device, where the voice enhancement response message is used to indicate that the terminal device has enabled the PNR function of the call function.
  • the display interface of the terminal device when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area and a second area, the first area is used to display the content of the video call or video recording, and the second area The area is used to display M controls and corresponding M labels.
  • the M controls correspond to the M target users one by one.
  • Each control in the M controls includes a sliding button and a sliding bar. By controlling the sliding button to slide on the sliding bar, to adjust the speech enhancement coefficient of the target user indicated by the label corresponding to the control.
  • the display interface of the terminal device when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area, and the first area is used to display the content of the video call or video recording; the terminal device 2400 also includes:
  • the control unit 2405 is configured to display a control corresponding to the object in the first area when an operation on any object in the video call content or video recording content is detected, the control includes a sliding button and a sliding bar, and the sliding button is controlled to move between Slide the slider to adjust the voice enhancement factor of the object.
  • the target voice related data is the target user's voice signal including the wake-up word
  • the noisy voice signal is the target user's audio signal including the command word
  • the terminal device 2400 is presented in the form of a unit.
  • the "unit” here may refer to an application-specific integrated circuit (ASIC), a processor and memory executing one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above functions .
  • ASIC application-specific integrated circuit
  • the acquisition unit 2401 , noise reduction unit 2402 , determination unit 2403 , detection unit 2404 and control unit 2405 above may be implemented by the processor 2601 of the terminal device shown in FIG. 26 .
  • FIG. 25 is a schematic structural diagram of another terminal device provided by the implementation of the present application.
  • the terminal device 2500 includes:
  • the sensor collection unit 2501 is configured to collect noisy speech signals, registered speech signals of the target user, VPU signals, video images, depth images and other information that can be used to determine the target user.
  • the storage unit 2502 is configured to store noise reduction parameters (including target user's speech enhancement coefficient and interference noise suppression coefficient), registered target users and their speech feature information.
  • noise reduction parameters including target user's speech enhancement coefficient and interference noise suppression coefficient
  • the UI interaction unit 2504 is configured to receive user interaction information and send it to the noise reduction control unit 2506, and feed back the information fed back by the noise reduction control unit 2506 to the local user.
  • the communication unit 2505 is configured to send and receive interaction information with the peer user, and optionally, transmit a noisy voice signal of the peer and voice registration information of the peer user.
  • the processing unit 2503 includes a noise reduction control unit 2506 and a PNR processing unit 2507, wherein,
  • the noise reduction control unit 2506 is configured to configure the PNR noise reduction parameters according to the interaction information received by the local end and the peer end and the information stored in the storage unit, including but not limited to determining the user or target user for voice enhancement, voice enhancement coefficient and interference noise suppression coefficient, whether to enable the noise reduction function and the noise reduction method.
  • the PNR processing unit 2507 is configured to process the noisy speech signal collected by the sensor collection unit according to the configured noise reduction parameters to obtain an enhanced audio signal, that is, an enhanced speech signal of the target user.
  • the terminal device 2600 can be implemented with the structure in FIG. 26 , and the terminal device 2600 includes at least one processor 2601 , at least one memory 2602 , at least one display screen 2604 and at least one communication interface 2603 .
  • the processor 2601 , the memory 2602 , the display screen 2604 and the communication interface 2603 are connected through the communication bus and complete mutual communication.
  • the processor 2601 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the programs of the above solutions.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the communication interface 2603 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN), etc.
  • RAN radio access network
  • WLAN Wireless Local Area Networks
  • the memory 2602 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM) or other types that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a read-only disc (Compact Disc Read-Only Memory, CD-ROM) or other optical disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be programmed by a computer Any other medium accessed, but not limited to.
  • the memory can exist independently and be connected to the processor through the bus. Memory can also be integrated with the processor.
  • the display screen 2604 may be an LCD display screen, an LED display screen, an OLED display screen, a 3D display screen or other display screens.
  • the memory 2602 is used to store the application program codes for executing the above solutions, and the execution is controlled by the processor 2601, and the function buttons, labels, etc. described in the above method embodiments are displayed on the display screen.
  • the processor 2601 is configured to execute application program codes stored in the memory 2602 .
  • the code stored in the memory 2602 can execute any of the speech enhancement methods provided above, for example: after the terminal device enters the PNR mode, obtain the noisy speech signal and the target speech related data, wherein the noisy speech signal contains the interference noise signal and the target speech signal.
  • the user's voice signal; the target voice-related data is used to indicate the voice characteristics of the target user; according to the target voice-related data, the first noisy voice signal is denoised through the trained voice noise reduction model to obtain the target user's denoising Speech signal; wherein, the speech noise reduction model is implemented based on a neural network.
  • the embodiment of the present application also provides a computer storage medium, wherein the computer storage medium can store a program, and the program includes some or all steps of any speech enhancement method described in the above method embodiments when executed.
  • the disclosed device can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable memory.
  • the technical solution of the present application is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, server or network device, etc.
  • the aforementioned memory includes: various media that can store program codes such as U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A speech enhancement method and a related device. The method comprises: after a terminal device enters a PNR mode, obtaining a first noisy speech signal and target speech related data, the first noisy speech signal comprising an interference noise signal and a speech signal of a target user, and the target speech related data being used for indicating a speech feature of the target user (S301); and according to the target speech related data, performing noise reduction processing on the first noisy speech signal by means of a speech noise reduction model to obtain a noise reduction speech signal of the target user, wherein the speech noise reduction model is implemented on the basis of a neural network (S302). Speech enhancement of the target user and suppression of interference are realized.

Description

语音增强方法及相关设备Speech enhancement method and related equipment
本申请要求于2021年5月31日提交中国国家知识产权局、申请号为202110611024.0、发明名称为“语音增强方法及相关设备”、于2021年6月22日提交中国国家知识产权局、申请号为202110694849.3、发明名称为“语音增强方法及相关设备”和于2021年11月09日提交中国国家知识产权局、申请号为202111323211.5、发明名称为“语音增强方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application is required to be submitted to the State Intellectual Property Office of China on May 31, 2021. The application number is 202110611024.0. The title of the invention is "Voice Enhancement Method and Related Equipment". It is submitted to the State Intellectual Property Office of China on June 22, 2021. The application number is 202110694849.3, the title of the invention is "Voice Enhancement Method and Related Equipment" and the Chinese patent application with the application number 202111323211.5 and the title of invention "Voice Enhancement Method and Related Equipment" submitted to the State Intellectual Property Office of China on November 09, 2021 priority, the entire contents of which are incorporated by reference into this application.
技术领域technical field
本申请涉及语音处理领域,尤其涉及一种语音增强方法及相关设备。The present application relates to the field of speech processing, in particular to a speech enhancement method and related equipment.
背景技术Background technique
近几年,智能设备极大地丰富了人们的生活,当设备工作在安静场景中,语音通话质量和语音交互(唤醒和识别率)功能已经能较好地满足需求,但是当设备工作在环境噪声、语音干扰的场景条件下,语音通话质量、唤醒率和识别率的体验效果会下降,需要依靠语音增强算法实现增强目标语音和滤除干扰的目的。In recent years, smart devices have greatly enriched people's lives. When the device works in a quiet scene, the voice call quality and voice interaction (wake-up and recognition rate) functions can better meet the needs, but when the device works in the environment noise 1. Under the scene conditions of voice interference, the experience effect of voice call quality, wake-up rate and recognition rate will decrease, and it is necessary to rely on the voice enhancement algorithm to achieve the purpose of enhancing the target voice and filtering out interference.
环境噪声抑制和语音干扰抑制一直是的热点问题。通用降噪方法,一种方式是根据背景噪声信号和语音音乐信号之间频谱特征的差异,利用一段时间内采集到的信号进行背景噪声进行估计,然后根据估计出的背景噪声特征进行环境噪声抑制,该方法对于平稳噪声效果较好,但是对于语音干扰则完全失效。另一种方式除了利用背景噪声信号和语音音乐信号之间频谱特征的差异,还利用了不同声道间相关性的差异,例如多通道噪声抑制或者麦克风阵列波束形成技术,这类方法对于具有特定方向的语音干扰具有一定的抑制,但是对于干扰源方位变化跟踪效果往往无法满足需求,且无法实现对特定目标人的语音增强。Environmental noise suppression and speech interference suppression have always been hot issues. A general noise reduction method, one way is to estimate the background noise by using the signal collected within a period of time according to the difference in spectral characteristics between the background noise signal and the voice music signal, and then perform environmental noise suppression according to the estimated background noise characteristics , this method works well for stationary noise, but completely fails for speech interference. In addition to using the difference in spectral characteristics between the background noise signal and the speech music signal, another way also uses the difference in the correlation between different channels, such as multi-channel noise suppression or microphone array beamforming technology. The direction of voice interference can be suppressed to a certain extent, but the tracking effect of the direction change of the interference source often cannot meet the demand, and the voice enhancement of the specific target person cannot be realized.
目前,语音增强和干扰抑制功能的实现主要通过传统或基于人工智能(artificial intelligence,AI)的通用降噪、分离等算法来实现,该方法通常可以提升语音通话和交互体验,但在语音干扰场景条件下,难以实现突出目标语音、抑制干扰语音的效果,体验较差。At present, the realization of voice enhancement and interference suppression functions is mainly realized through traditional or artificial intelligence (AI)-based general noise reduction, separation and other algorithms. Under certain conditions, it is difficult to achieve the effect of highlighting the target voice and suppressing the interference voice, and the experience is poor.
发明内容Contents of the invention
本申请实施例提供一种语音增强方法及相关设备,采用本申请实施例可以在各种环境噪声和语音干扰的场景下,抑制除了目标用户的语音之外的所有干扰噪声,突出目标用户的声音,提升了用户进行语音通话和语音交互等的体验。The embodiment of the present application provides a speech enhancement method and related equipment. By adopting the embodiment of the present application, all interference noises except the target user's voice can be suppressed and the target user's voice can be highlighted in various environmental noise and speech interference scenarios. , which improves the user experience of voice calls and voice interactions.
第一方面,本申请实施例提供一种语音增强方法,包括:在终端设备进入特定人降噪(personalized noise reduction,PNR)模式后,获取带噪语音信号和目标语音相关数据,其中,带噪语音信号包含干扰噪声信号与目标用户的语音信号;目标语音相关数据用于指示目标用户的语音特征;根据目标语音相关数据通过已训练好的语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号;其中,语音降噪模型是基于神经网络实现的。In the first aspect, the embodiment of the present application provides a method for speech enhancement, including: after the terminal device enters into a person-specific noise reduction (personalized noise reduction, PNR) mode, acquiring a noisy speech signal and target speech related data, wherein the noisy The speech signal contains the interference noise signal and the speech signal of the target user; the target speech related data is used to indicate the speech characteristics of the target user; according to the target speech related data, the first noisy speech signal is denoised through the trained speech noise reduction model processing to obtain the noise-reduced speech signal of the target user; wherein, the speech noise-reduction model is realized based on a neural network.
其中,干扰噪声信号包括非目标用户的语音信号、环境噪声信号(比如汽车鸣笛声、机器作业时发出的声音)等。Wherein, the interfering noise signal includes a voice signal of a non-target user, an environmental noise signal (such as a car whistle, a sound emitted by a machine during operation) and the like.
可选地,目标语音相关数据可以为目标用户的注册语音信号,可以为目标用户的语音拾取(voice pick up,VPU)信号,还可以为目标用户的声纹特征或者目标用户的视频唇动信息等。Optionally, the target voice-related data can be the registered voice signal of the target user, can be a voice pickup (voice pick up, VPU) signal of the target user, can also be the voiceprint feature of the target user or the video lip movement information of the target user Wait.
通过目标语音相关数据指导语音降噪模型从带噪语音信号中提取出目标用户的语音信号, 抑制除了目标用户的语音之外的所有干扰噪声,突出目标用户的声音,提升了用户进行语音通话和语音交互等的体验。The voice noise reduction model is guided by the target voice-related data to extract the target user's voice signal from the noisy voice signal, suppressing all interference noises except the target user's voice, highlighting the target user's voice, and improving the user's voice communication and The experience of voice interaction and so on.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
获取目标用户的语音增强系数;基于目标用户的语音增强系数对目标用户的降噪语音信号进行增强处理,以得到目标用户的增强语音信号,其中,目标用户的增强语音信号的幅度与目标用户的降噪语音信号的幅度的比值为目标用户语音增强系数。Acquiring the speech enhancement coefficient of the target user; based on the speech enhancement coefficient of the target user, the noise reduction speech signal of the target user is enhanced to obtain the enhanced speech signal of the target user, wherein the amplitude of the enhanced speech signal of the target user is the same as that of the target user The ratio of the amplitude of the noise-reduced speech signal is the speech enhancement coefficient of the target user.
通过引入目标用户的语音增强系数,可以进一步增强目标用户的语音信号,从而达到进一步突出目标用户的声音,抑制非目标用户的声音的目的,提升了用户进行语音通话和语音交互等的体验。By introducing the target user's voice enhancement coefficient, the target user's voice signal can be further enhanced, so as to further highlight the target user's voice and suppress the non-target user's voice, and improve the user's experience in voice calls and voice interactions.
进一步地,通过降噪处理还得到干扰噪声信号,本申请的方法还包括:Further, the interference noise signal is also obtained through the noise reduction processing, and the method of the present application also includes:
获取干扰噪声抑制系数;基于干扰噪声抑制系数对干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅度与干扰噪声信号的幅度的比值为干扰噪声抑制系数;将干扰噪声抑制信号与目标用户的增强语音信号进行融合,以得到输出信号。Obtain the interference noise suppression coefficient; suppress the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient; The noise-suppressed signal is fused with the target user's enhanced speech signal to obtain an output signal.
可选地,干扰噪声抑制系数的取值范围为(0,1)。Optionally, the value range of the interference noise suppression coefficient is (0,1).
通过引入干扰噪声抑制系数,进一步抑制非目标用户的声音,间接突出了目标用户的声音。By introducing the interference noise suppression coefficient, the voice of the non-target user is further suppressed, and the voice of the target user is highlighted indirectly.
在一个可行的实施例中,通过降噪处理还得到干扰噪声信号,本申请的方法还包括:In a feasible embodiment, the interference noise signal is also obtained through noise reduction processing, and the method of the present application also includes:
获取干扰噪声抑制系数;基于干扰噪声抑制系数对干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅度与干扰噪声信号的幅度的比值为干扰噪声抑制系数;将干扰噪声抑制信号与目标用户的降噪语音信号进行融合,以得到输出信号。Obtain the interference noise suppression coefficient; suppress the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient; The noise suppression signal is fused with the target user's noise reduction speech signal to obtain an output signal.
由于在实际应用中,耳中只出现目标用户的声音,没有噪声,会让用户很不习惯,因此通过引入干扰噪声抑制系数和干扰噪声信号,实现可在引入干扰噪声抑制系数抑制干扰噪声信号的同时,也使得在通话时听到噪音信号,提高了用户体验。In practical applications, only the voice of the target user appears in the ear without noise, which will make the user very uncomfortable. Therefore, by introducing the interference noise suppression coefficient and the interference noise signal, the interference noise signal can be suppressed by introducing the interference noise suppression coefficient. At the same time, noise signals can be heard during a call, which improves user experience.
在一个可行的实施例中,目标用户包括M个,目标语音相关数据包括M个目标用户的语音相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,目标用户的语音增强系数包括M个目标用户的语音增强系数,M为大于1的整数,In a feasible embodiment, the target users include M, the target voice-related data includes the voice-related data of the M target users, the noise-reduced voice signals of the target users include the noise-reduced voice signals of the M target users, and the target user's voice The enhancement coefficient includes speech enhancement coefficients of M target users, and M is an integer greater than 1,
根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号,包括:Carry out denoising processing on the first noisy voice signal through the voice denoising model according to the relevant data of the target voice, and obtain the denoising voice signal of the target user, including:
对于M个目标用户中任一目标用户A,根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号;对于M个目标用户中的每个目标用户均按照该方式进行处理,可得到M个目标用户的降噪语音信号;For any target user A in the M target users, according to the voice-related data of the target user A, the first noisy speech signal is denoised by the speech noise reduction model, so as to obtain the noise-reduced speech signal of the target user A; for M Each target user in the target users is all processed according to this method, and the noise-reduced voice signals of M target users can be obtained;
基于目标用户的语音增强系数对目标用户的降噪语音信号进行增强处理,以得到目标用户的增强语音信号,包括:Based on the speech enhancement coefficient of the target user, the noise reduction speech signal of the target user is enhanced to obtain the enhanced speech signal of the target user, including:
基于目标用户A的语音增强系数对目标用户A的降噪语音信号进行处理,以得到目标用户A的增强语音信号;目标用户A的增强语音信号的幅度与目标用户A的降噪语音信号的幅度的比值为目标用户A的语音增强系数;按照该方式对M个目标用户中每个目标用户的降噪语音信号进行处理,可得到M个目标用户的增强语音信号。Process the noise-reduced speech signal of target user A based on the speech enhancement coefficient of target user A to obtain the enhanced speech signal of target user A; The ratio of is the speech enhancement coefficient of the target user A; according to this method, the noise-reduced speech signal of each of the M target users is processed, and the enhanced speech signals of the M target users can be obtained.
本申请的方法还包括:基于M个目标用户的增强语音信号得到输出信号。The method of the present application further includes: obtaining an output signal based on the enhanced voice signals of the M target users.
采用上述并行的方式可以对多个目标用户的语音信号进行增强,并且对于多个目标用户,可以通过设置语音增强系数来进一步调整目标用户的增强语音信号,从而解决了在多人情况下语音降噪的问题。The voice signals of multiple target users can be enhanced by using the above parallel method, and for multiple target users, the enhanced voice signals of the target users can be further adjusted by setting the voice enhancement coefficient, thus solving the problem of voice degradation in the case of multiple people. noise problem.
在一个可行的实施例中,目标用户包括M个,目标语音相关数据包括M个目标用户的语音相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,M为大于1的整数,In a feasible embodiment, the target users include M, and the target speech-related data includes the speech-related data of M target users, and the noise-reduced speech signals of the target users include the noise-reduced speech signals of M target users, and M is greater than 1 an integer of
根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号和干扰噪声信号,包括:Carry out denoising processing on the first noisy voice signal through the voice denoising model according to the relevant data of the target voice, so as to obtain the denoising voice signal and the interference noise signal of the target user, including:
根据M个目标用户中第1个目标用户的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到第1个目标用户的降噪语音信号和不包含第1个目标用户的语音信号的第一带噪语音信号;根据M个目标用户中第2个目标用户的语音相关数据通过语音降噪模型对不包含第1个目标用户的语音信号的第一带噪语音信号进行降噪处理,以得到第2个目标用户的降噪语音信号和不包含第1个目标用户的语音信号和第2个目标用户的语音信号的第一带噪语音信号;重复上述过程,直至根据第M个目标用户的语音相关数据通过语音降噪模型对不包含第1至M-1个目标用户的语音信号的第一带噪语音信号进行降噪处理,得到第M个目标用户的降噪语音信号和干扰噪声信号;至此,得到M个目标用户的降噪语音信号和干扰噪声信号。According to the voice-related data of the first target user among the M target users, the first noisy voice signal is denoised through the voice noise reduction model, and the noise-reduced voice signal of the first target user and the first target user are obtained. The first noisy speech signal of the user's speech signal; the first noisy speech signal not containing the speech signal of the first target user through the speech noise reduction model according to the speech related data of the second target user in the M target users Carry out noise reduction processing, to obtain the noise-reduced speech signal of the 2nd target user and the first band noise speech signal that does not contain the speech signal of the 1st target user and the speech signal of the 2nd target user; Repeat above-mentioned process, until According to the voice-related data of the Mth target user, the noise reduction process is performed on the first noisy voice signal that does not contain the voice signals of the 1st to M-1 target users through the voice noise reduction model, and the noise reduction of the Mth target user is obtained. Noise-reduced speech signals and interference noise signals; so far, the noise-reduced speech signals and interference noise signals of M target users are obtained.
采用上述串行的方式可以对多个目标用户的语音信号进行增强,从而解决了在多人情况下语音降噪的问题。The voice signals of multiple target users can be enhanced by adopting the above serial method, thereby solving the problem of voice noise reduction in the case of multiple people.
在一个可行的实施例中,目标用户包括M个,目标语音相关数据包括M个目标用户的语音相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,M为大于1的整数,根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号和干扰噪声信号,包括:In a feasible embodiment, the target users include M, and the target speech-related data includes the speech-related data of M target users, and the noise-reduced speech signals of the target users include the noise-reduced speech signals of M target users, and M is greater than 1 is an integer, according to the target voice-related data, the first noisy voice signal is denoised by the voice denoising model, so as to obtain the denoised voice signal and the interference noise signal of the target user, including:
根据M个目标用户的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到M个目标用户的降噪语音信号和干扰噪声信号。According to the voice-related data of the M target users, the first noisy voice signal is denoised through the voice denoising model, so as to obtain the denoised voice signals and interference noise signals of the M target users.
在一个可行的实施例中,对于M个目标用户的语音相关数据,每个目标用户的相关数据包括该目标用户的注册语音信号,目标用户A的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户A的语音信号,语音降噪模型包括M个第一编码网络、第二编码网络、时间卷积网络(time convolution network,TCN)、第一解码网络和M个第三解码网络,根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号和干扰噪声信号,包括:In a feasible embodiment, for the voice-related data of M target users, the relevant data of each target user includes the registered voice signal of the target user, and the registered voice signal of target user A is when the noise decibel value is lower than the preset The voice signal of the target user A collected under the environment of the value, the voice noise reduction model includes M first encoding network, second encoding network, time convolution network (time convolution network, TCN), first decoding network and M first The three-decoding network performs noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech-related data, so as to obtain the target user's noise-reduced speech signal and interference noise signal, including:
利用M个第一编码网络分别对M个目标用户的注册语音信号进行特征提取,得到M个目标用户的注册语音信号的特征向量;利用第二编码网络对带噪语音信号进行特征提取,得到带噪语音信号的特征向量;根据M个目标用户的注册语音信号的特征向量和第一带噪语音信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据M个第三解码网络中的每个第三解码网络、对第二特征向量和与该第三解码网络对应的第一编码网络输出的特征向量得到M个目标用户的降噪语音信号;根据第一解码网络、第二特征向量和第一带噪语音信号的特征向量得到干扰噪声信号。Use M first coding networks to extract features of the registered speech signals of M target users respectively, and obtain the feature vectors of the registered speech signals of M target users; use the second coding network to perform feature extraction on noisy speech signals, and obtain The eigenvector of noise speech signal; Obtain the first eigenvector according to the eigenvector of the registration speech signal of M target users and the eigenvector of the first band noisy speech signal; Obtain the second eigenvector according to TCN and the first eigenvector; According to M Each of the third decoding networks in the third decoding networks obtains the noise-reduced speech signals of M target users for the second feature vector and the feature vectors output by the first encoding network corresponding to the third decoding network; according to the first The decoding network, the second eigenvector and the eigenvector of the first noisy speech signal obtain an interference noise signal.
采用上述方式可以对多个目标用户的语音信号进行降噪,从而解决了在多人情况下语音降噪的问题。By adopting the above method, the voice signals of multiple target users can be denoised, thereby solving the problem of voice denoising in the case of multiple people.
在一个可行的实施例中,目标用户包括M个,目标用户的相关数据包括目标用户的注册语音信号,目标用户的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户的语音信号,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,In a feasible embodiment, the target users include M, and the relevant data of the target users include the registration voice signals of the target users, and the registration voice signals of the target users are target users collected in an environment where the noise decibel value is lower than a preset value The speech signal, the speech noise reduction model includes the first coding network, the second coding network, TCN and the first decoding network,
根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到目标 用户的降噪语音信号,包括:Carry out denoising processing to the first noisy voice signal through the voice denoising model according to the target voice related data, and obtain the denoising voice signal of the target user, including:
利用第一编码网络和第二编码网络分别对目标用户的注册语音信号和第一带噪语音信号进行特征提取,得到目标用户的注册语音信号的特征向量和第一带噪语音信号的特征向量;根据目标用户的注册语音信号的特征向量和带噪语音信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据第一解码网络和第二特征向量得到目标用户的降噪语音信号。Utilize the first encoding network and the second encoding network to carry out feature extraction respectively to the registered voice signal of the target user and the first noisy voice signal, obtain the feature vector of the registered voice signal of the target user and the feature vector of the first noisy voice signal; Obtain the first eigenvector according to the eigenvector of the registered speech signal of the target user and the eigenvector of the noisy speech signal; obtain the second eigenvector according to the TCN and the first eigenvector; obtain the target user according to the first decoding network and the second eigenvector noise-reduced speech signal.
进一步地,本申请的方法还包括:Further, the method of the present application also includes:
根据第一解码网络和第二特征向量还得到干扰噪声信号。An interfering noise signal is also obtained from the first decoding network and the second eigenvector.
在一个可行的实施例中,目标用户A的相关数据包括目标用户A的注册语音信号,目标用户A的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户A的语音信号,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号,包括:In a feasible embodiment, the relevant data of the target user A includes the registered voice signal of the target user A, and the registered voice signal of the target user A is the voice of the target user A collected in an environment where the noise decibel value is lower than a preset value signal, the speech noise reduction model includes a first coding network, a second coding network, a TCN and a first decoding network, and performs noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech related data of the target user A, so as to Obtain the noise-reduced voice signal of the target user A, including:
利用第一编码网络和第二编码网络分别对目标用户A的注册语音信号和第一带噪语音信号进行特征提取,以得到目标用户A的注册语音信号的特征向量和第一带噪语音信号的特征向量;根据目标用户A的注册语音信号的特征向量和第一带噪语音信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据第一解码网络和第二特征向量得到目标用户A的降噪语音信号。Use the first encoding network and the second encoding network to extract the features of the registration speech signal of the target user A and the first noisy speech signal, so as to obtain the feature vector of the registration speech signal of the target user A and the first noise speech signal. Feature vector; According to the feature vector of the registered voice signal of target user A and the feature vector of the first noisy voice signal, the first feature vector is obtained; according to the TCN and the first feature vector, the second feature vector is obtained; according to the first decoding network and the first feature vector The second eigenvector obtains the noise-reduced speech signal of the target user A.
在一个可行的实施例中,M个目标用户中第i个目标用户的相关数据包括第i个目标用户的注册语音信号,i为大于0且小于或者等于M的整数,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,In a feasible embodiment, the relevant data of the i-th target user among the M target users includes the registration voice signal of the i-th target user, i is an integer greater than 0 and less than or equal to M, and the voice noise reduction model includes an encoding network, a second encoding network, a TCN and a first decoding network,
利用第一编码网络和第二编码网络分别对目标用户的注册语音信号和第一噪声信号进行特征提取,得到第i个目标用户的注册语音信号的特征向量和该第一噪声信号的特征向量;其中,第一噪声信号为不包含第1至i-1个目标用户的语音信号的第一带噪语音信号;根据第i个目标用户的注册语音信号的特征向量和第一噪声信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据第一解码网络和第二特征向量得到第i个目标用户的降噪语音信号和第二噪声信号,其中,第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号。Utilize the first coding network and the second coding network to carry out feature extraction respectively to the registration speech signal of target user and the first noise signal, obtain the feature vector of the registration speech signal of the ith target user and the feature vector of this first noise signal; Wherein, the first noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-1 target users; according to the feature vector of the registration speech signal of the i target user and the feature vector of the first noise signal Obtain the first eigenvector; Obtain the second eigenvector according to the TCN and the first eigenvector; Obtain the noise-reduced voice signal and the second noise signal of the i-th target user according to the first decoding network and the second eigenvector, wherein, the second The noise signal is a first noisy speech signal that does not include the speech signals of the 1st to i-th target users.
通过提前注册目标用户的语音信号的方式,在后续的语音交互时,可以增强目标用户的语音信号,抑制干扰语音和噪声,保证在语音唤醒和语音交互时只输入目标用户的语音信号,提升语音唤醒和语音识别的效果和精度;并且基于TCN因果空洞卷积网络构建语音降噪模型,实现语音降噪模型低时延输出语音信号。By registering the voice signal of the target user in advance, in the subsequent voice interaction, the voice signal of the target user can be enhanced, interference voice and noise can be suppressed, and only the voice signal of the target user can be input during voice wake-up and voice interaction to improve voice The effect and accuracy of wake-up and speech recognition; and construct a speech noise reduction model based on the TCN causal hole convolution network, and realize the low-latency output speech signal of the speech noise reduction model.
在一个可行的实施例中,目标用户的相关数据包括目标用户的VPU信号,语音降噪模型包括预处理模块、第三编码网络、门控循环单元(gated recurrent unit,GRU)、第二解码网络和后处理模块,根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号,包括:In a feasible embodiment, the relevant data of the target user includes the VPU signal of the target user, and the speech noise reduction model includes a preprocessing module, a third coding network, a gated recurrent unit (gated recurrent unit, GRU), a second decoding network And the post-processing module, according to the relevant data of the target voice, the noise reduction processing is carried out to the first noisy voice signal by the voice noise reduction model, so as to obtain the noise-reduced voice signal of the target user, including:
通过预处理模块分别对第一带噪语音信号和目标用户的VPU信号进行时频变换,以得到第一带噪语音信号的第一频域信号和VPU信号的第二频域信号;对第一频域信号和第二频域信号进行融合,以得到第一融合频域信号;将第一融合频域信号先后经过第三编码网络、GRU和第二解码网络处理,以得到目标用户的语音信号的第三频域信号的掩膜;通过后处理模块根据第三频域信号的掩膜对第一频域信号进行后处理,以得到第三频域信号;对第三频域信 号进行频时变换,以得到目标用户的降噪语音信号;其中,第三编码模块和第二解码模块均是基于卷积层和频域变换模块(frequency transformation block,FTB)实现的。Carry out time-frequency transformation to the VPU signal of the first band noise speech signal and the target user respectively by the preprocessing module, to obtain the first frequency domain signal of the first band noise speech signal and the second frequency domain signal of the VPU signal; The frequency domain signal is fused with the second frequency domain signal to obtain a first fused frequency domain signal; the first fused frequency domain signal is successively processed by a third encoding network, a GRU, and a second decoding network to obtain a voice signal of a target user The mask of the third frequency domain signal; the first frequency domain signal is post-processed by the post-processing module according to the mask of the third frequency domain signal to obtain the third frequency domain signal; the third frequency domain signal is frequency-time Transform to obtain the noise-reduced voice signal of the target user; wherein, the third encoding module and the second decoding module are all implemented based on the convolutional layer and the frequency transformation block (frequency transformation block, FTB).
其中,后处理包括数学运算,比如点乘等。Among them, the post-processing includes mathematical operations, such as dot multiplication and so on.
进一步地,将第一融合频域信号先后经过第三编码网络、GRU和第二解码网络处理还得到第一频域信号的掩膜;通过后处理模块根据第一频域信号的掩膜对第一频域信号进行后处理,得到干扰噪声信号的第四频域信号;对第四频域信号进行频时变换,以得到干扰噪声信号。Further, the mask of the first frequency domain signal is obtained by processing the first fused frequency domain signal successively through the third coding network, the GRU and the second decoding network; the mask of the first frequency domain signal is processed by the post-processing module A frequency-domain signal is post-processed to obtain a fourth frequency-domain signal of the interference noise signal; frequency-time transformation is performed on the fourth frequency-domain signal to obtain the interference noise signal.
可选地,由于第一带噪语音信号包含目标用户的语音信号和干扰噪声信号,因此在得到目标用户的降噪语音信号后,根据目标用户的降噪语音信号对第一带噪语音信号进行处理,得到干扰噪声信号,也即是将第一带噪语音信号减去目标用户的降噪语音信号的,得到干扰噪声信号。Optionally, since the first noisy speech signal includes the speech signal of the target user and the interference noise signal, after obtaining the noise-reduced speech signal of the target user, the first noisy speech signal is processed according to the noise-reduced speech signal of the target user. processing to obtain an interference noise signal, that is, subtract the noise-reduced speech signal of the target user from the first noisy speech signal to obtain an interference noise signal.
在一个可行的实施例中,目标用户A的相关数据包括目标用户A的VPU信号,语音降噪模型包括预处理模块、第三编码网络、GRU、第二解码网络和后处理模块,根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号,包括:In a feasible embodiment, the relevant data of the target user A includes the VPU signal of the target user A, and the speech noise reduction model includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module, according to the target user A The voice-related data of A uses the voice noise reduction model to perform noise reduction processing on the first noisy voice signal to obtain the noise-reduced voice signal of the target user A, including:
通过预处理模块分别对第一带噪语音信号和目标用户A的VPU信号进行时频变换,以得到第一带噪语音信号的第一频域信号和目标用户A的VPU信号的第九频域信号;对第一频域信号和第九频域信号进行融合,得到第二融合频域信号;将第二融合频域信号先后经过第三编码网络、GRU和第二解码网络处理,以得到目标用户A的语音信号的第十频域信号的掩膜;通过后处理模块根据第十频域信号的掩膜对第一频域信号进行后处理,得到第十频域信号;对第十频域信号进行频时变换,以得到目标用户A的降噪语音信号;其中,第三编码模块和第二解码模块均是基于卷积层和FTB实现的。Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user A through the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the ninth frequency domain of the VPU signal of the target user A signal; the first frequency domain signal and the ninth frequency domain signal are fused to obtain the second fused frequency domain signal; the second fused frequency domain signal is successively processed by the third encoding network, the GRU and the second decoding network to obtain the target The mask of the tenth frequency domain signal of the voice signal of user A; The first frequency domain signal is post-processed by the mask of the tenth frequency domain signal by the post-processing module to obtain the tenth frequency domain signal; for the tenth frequency domain The signal is subjected to frequency-time transformation to obtain the noise-reduced speech signal of the target user A; wherein, the third encoding module and the second decoding module are both implemented based on the convolutional layer and FTB.
在一个可行的实施例中,M个目标用户中第i个目标用户的相关数据包括第i个目标用户的VPU信号,i为大于0且小于或者等于M的整数,In a feasible embodiment, the relevant data of the i-th target user among the M target users includes the VPU signal of the i-th target user, where i is an integer greater than 0 and less than or equal to M,
通过预处理模块对第一噪声信号和第i个目标用户的VPU信号均进行时频变换,以得到该第一噪声信号的第十一频域信号和第i个目标用户的VPU信号的第十二频域信号;对第十一频域信号和第十二频域信号进行融合,得到第三融合频域信号;其中,第一噪声信号为不包含第1至i-1个目标用户的语音信号的第一带噪语音信号;将第三融合频域信号先后经过第三编码网络、GRU和第二解码网络处理得到第i个目标用户的语音信号的第十三频域信号的掩膜和第十一频域信号的掩膜;通过后处理模块根据第十三频域信号的掩膜和第十一频域信号的掩膜对第十一频域信号进行后处理,得到第十三频域信号和第二噪声信号的第十四频域信号;对第十三频域信号和第十四频域信号进行频时变换,得到第i个目标用户的降噪语音信号和第二噪声信号,第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号;其中,第三编码模块和第二解码模块均是基于卷积层和FTB实现的。Both the first noise signal and the VPU signal of the i-th target user are time-frequency transformed by the preprocessing module to obtain the eleventh frequency domain signal of the first noise signal and the tenth frequency domain signal of the i-th target user's VPU signal. Two frequency domain signals; the eleventh frequency domain signal and the twelfth frequency domain signal are fused to obtain the third fusion frequency domain signal; wherein the first noise signal is the voice that does not contain the 1st to i-1 target users The first noisy speech signal of the signal; the mask sum of the thirteenth frequency domain signal of the speech signal of the i-th target user is obtained by processing the third fusion frequency domain signal successively through the third encoding network, the GRU and the second decoding network The mask of the eleventh frequency domain signal; the mask of the eleventh frequency domain signal is post-processed by the post-processing module according to the mask of the thirteenth frequency domain signal and the mask of the eleventh frequency domain signal, to obtain the thirteenth frequency domain signal Domain signal and the 14th frequency domain signal of the second noise signal; Carry out frequency-time transformation to the 13th frequency domain signal and the 14th frequency domain signal, obtain the noise reduction voice signal and the second noise signal of the i-th target user , the second noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-th target users; wherein, both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
通过将目标用户的VPU信号作为辅助信息,用于实时提取目标用户的语音特征,该特征与麦克风采集的带噪语音信号相融合,指导目标用户语音增强和非目标用户语音等干扰的抑制,并且本实施例还提出了一种新的基于FTB和GRU的语音降噪模型用于目标用户的语音增强和非目标用户的语音等干扰的抑制;可以看出,采用本实施例的方案,不需要用户提前注册语音特征信息,可以根据实时VPU信号作为辅助信息,得到增强的目标用户语音并抑制非目标语音的干扰。By using the VPU signal of the target user as auxiliary information, it is used to extract the voice features of the target user in real time, and this feature is fused with the noisy voice signal collected by the microphone to guide the voice enhancement of the target user and the suppression of interference such as the voice of the non-target user, and This embodiment also proposes a new speech noise reduction model based on FTB and GRU for the suppression of interference such as speech enhancement of target users and speech of non-target users; The user registers the voice feature information in advance, and can use the real-time VPU signal as auxiliary information to obtain the enhanced voice of the target user and suppress the interference of non-target voice.
在一个可行的实施例中,基于目标用户的语音增强系数对目标用户的降噪语音信号进行 增强处理,以得到目标用户的增强语音信号,包括:In a feasible embodiment, based on the speech enhancement coefficient of the target user, the noise reduction speech signal of the target user is enhanced to obtain the enhanced speech signal of the target user, including:
对于M个目标用户中的任一目标用户,基于目标用户A的语音增强系数对目标用户A的降噪语音信号进行增强处理,以得到目标用户A的增强语音信号;目标用户A的增强语音信号的幅度与目标用户A的降噪语音信号的幅度的比值为目标用户A的语音增强系数;For any target user in the M target users, the noise-reduced speech signal of target user A is enhanced based on the speech enhancement coefficient of target user A to obtain the enhanced speech signal of target user A; the enhanced speech signal of target user A The ratio of the amplitude of the amplitude and the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
将干扰噪声抑制信号与目标用户的增强语音信号进行融合,以得到输出信号,包括:The interference noise suppression signal is fused with the target user's enhanced speech signal to obtain an output signal, including:
将M个目标用户的增强语音信号与干扰噪声抑制信号进行融合,以得到输出信号。The enhanced speech signals of the M target users are fused with the interference noise suppression signal to obtain an output signal.
对于多个目标用户的降噪语音信号,通过引入多个目标用户的语音增强系数,可按需调整多个目标用户的增强语音信号的大小。For the noise-reduced speech signals of multiple target users, by introducing the speech enhancement coefficients of the multiple target users, the magnitudes of the enhanced speech signals of the multiple target users can be adjusted as required.
在一个可行的实施例中,目标用户的相关数据包括目标用户的VPU信号,本申请的方法还包括:获取目标用户的耳内声音信号;In a feasible embodiment, the relevant data of the target user includes the target user's VPU signal, and the method of the present application further includes: acquiring the target user's in-ear sound signal;
根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号,包括:Carry out denoising processing on the first noisy voice signal through the voice denoising model according to the relevant data of the target voice, so as to obtain the denoising voice signal of the target user, including:
分别对第一带噪语音信号和耳内声音信号进行时频变换,以得到第一带噪语音信号的第一频域信号和耳内声音信号的第五频域信号;根据目标用户的VPU信号、第一频域信号和第五频域信号得到第一带噪语音信号与耳内声音信号的协方差矩阵;基于协方差矩阵得到第一最小方差无失真响应(minimum variance distortionless response,MVDR)权重;基于第一MVDR权重、第一频域信号和第五频域信号得到第一带噪语音信号的第六频域信号和目标用户的耳内声音信号的第七频域信号;根据第六频域信号和第七频域信号得到目标用户的降噪语音信号的第八频域信号;对第八频域信号进行频时变换,以得到目标用户的降噪语音信号。Carrying out time-frequency transformation to the first noisy speech signal and the in-ear sound signal respectively to obtain the first frequency-domain signal of the first noisy speech signal and the fifth frequency-domain signal of the in-ear sound signal; according to the VPU signal of the target user , the first frequency domain signal and the fifth frequency domain signal obtain the covariance matrix of the first noisy speech signal and the in-ear sound signal; obtain the first minimum variance distortionless response (minimum variance distortionless response, MVDR) weight based on the covariance matrix ; Obtain the sixth frequency domain signal of the first noisy speech signal and the seventh frequency domain signal of the ear sound signal of the target user based on the first MVDR weight, the first frequency domain signal and the fifth frequency domain signal; according to the sixth frequency domain domain signal and the seventh frequency domain signal to obtain the eighth frequency domain signal of the noise-reduced speech signal of the target user; and perform frequency-time transformation on the eighth frequency-domain signal to obtain the noise-reduced speech signal of the target user.
进一步地,根据目标用户的降噪语音信号和第一带噪语音信号得到干扰噪声信号。Further, an interference noise signal is obtained according to the noise-reduced speech signal of the target user and the first noisy speech signal.
在一个可行的实施例中,目标用户A的相关数据包括目标用户A的VPU信号,本申请的方法还包括:获取目标用户A的耳内声音信号;In a feasible embodiment, the relevant data of the target user A includes the VPU signal of the target user A, and the method of the present application further includes: acquiring the in-ear sound signal of the target user A;
根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号,包括:According to the voice-related data of the target user A, the first noisy voice signal is denoised through the voice noise reduction model to obtain the denoised voice signal of the target user A, including:
分别对第一带噪语音信号和目标用户A的耳内声音信号进行时频变换,得到第一带噪语音信号的第一频域信号和目标用户A的耳内声音信号的第十五频域信号;根据目标用户A的VPU信号、第一频域信号和第十五频域信号得到第一带噪语音信号和目标用户A的耳内声音信号的协方差矩阵;基于该协方差矩阵得到第二MVDR权重;基于第二MVDR权重、第一频域信号和第十五频域信号得到第一带噪语音信号的第十六频域信号和目标用户A的耳内声音信号的第十七频域信号;根据第十六频域信号和第十七频域信号得到目标用户A的降噪语音信号的第十八频域信号;对十八频域信号进行频时变换,以得到目标用户A的降噪语音信号。Perform time-frequency transformation on the first noisy speech signal and the target user A's in-ear sound signal respectively to obtain the first frequency domain signal of the first noisy speech signal and the fifteenth frequency domain of the target user A's in-ear sound signal signal; according to the VPU signal of target user A, the first frequency domain signal and the 15th frequency domain signal, obtain the covariance matrix of the first noisy speech signal and the in-ear sound signal of target user A; obtain the 2nd based on this covariance matrix Two MVDR weights; based on the second MVDR weight, the first frequency domain signal and the fifteenth frequency domain signal, the sixteenth frequency domain signal of the first noisy speech signal and the seventeenth frequency of the ear sound signal of the target user A are obtained domain signal; obtain the eighteenth frequency domain signal of the noise reduction speech signal of target user A according to the sixteenth frequency domain signal and the seventeenth frequency domain signal; carry out frequency-time transformation to the eighteenth frequency domain signal, to obtain target user A noise-reduced speech signal.
采用本方法,不需要目标用户提前注册其语音特征信息,可以根据实时VPU信号作为辅助信息,得到增强的目标用户或者目标用户A的语音信号并抑制非目标用户的语音等干扰。With this method, it is not necessary for the target user to register its voice feature information in advance, and the enhanced voice signal of the target user or target user A can be obtained according to the real-time VPU signal as auxiliary information, and interference such as voice of non-target users can be suppressed.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
获取终端设备所处环境的第一噪音片段和第二噪音片段;第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取第一噪音片段的信噪比(signal noise ratio,SNR)和声压级(sound pressure level,SPL);若第一噪音片段的SNR大于第一阈值且第一噪音片段的SPL大于第二阈值,则提取第一噪音片段的第一临时特征向量;基于第一临时语音特征向量对第二噪音片段进行降噪处理,以得到第二降噪噪音片段;基于第二降噪噪音片段和第二噪音片段进行损伤评估,以得到第一损伤评分;若第一损伤评分不大于第三阈值,进入PNR模式;Obtain the first noise segment and the second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are temporally continuous noise segments; obtain the signal-to-noise ratio (SNR) of the first noise segment ) and sound pressure level (sound pressure level, SPL); if the SNR of the first noise segment is greater than the first threshold and the SPL of the first noise segment is greater than the second threshold, then extract the first temporary feature vector of the first noise segment; based on The first temporary speech feature vector performs noise reduction processing on the second noise segment to obtain the second noise reduction noise segment; based on the second noise reduction noise segment and the second noise segment, the damage assessment is performed to obtain the first damage score; if the second - If the damage score is not greater than the third threshold, enter the PNR mode;
获取第一带噪语音信号包括:Obtaining the first noisy speech signal includes:
从在第一噪音片段之后产生的噪声信号中确定第一带噪语音信号;注册语音信号的特征向量包括第一临时特征向量。A first noisy speech signal is determined from the noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a first temporary feature vector.
进一步地,若第一损伤评分不大于第三阈值,本申请的方法还包括:Further, if the first damage score is not greater than the third threshold, the method of the present application also includes:
通过终端设备发出第一提示信息,该第一提示信息用于提示是否使得终端设备进入PNR模式;在检测到目标用户的同意进入PNR模式的操作指令后,才进入PNR模式。The first prompt message is sent by the terminal device, and the first prompt message is used to prompt whether the terminal device enters the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
通过该方法可以判断出是否需要采用本申请的方案进行语音降噪,避免了需要进行降噪时却没有进行降噪的情况的发生,实现了灵活自动降噪,提升了用户体验。Through this method, it can be judged whether it is necessary to adopt the solution of the present application for voice noise reduction, which avoids the situation that noise reduction is not performed when noise reduction is required, realizes flexible automatic noise reduction, and improves user experience.
在一个可行的实施例中,目标用户的相关数据包括辅助设备的麦克风阵列信号,本申请的方法还包括:In a feasible embodiment, the relevant data of the target user includes the microphone array signal of the auxiliary device, and the method of the present application further includes:
获取终端设备所处环境的第一噪音片段和第二噪音片段;第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取终端设备的辅助设备的麦克风阵列针对终端设备所处的环境采集的信号,利用采集的信号计算得到第一噪音片段的信号到达角(directionof arrival,DOA)和声压级(sound pressure level,SPL);若第一噪音片段的DOA大于第九阈值且小于第十阈值,且第一噪音片段的SPL大于第十一阈值,则提取第一噪音片段的第二临时特征向量;基于第二临时语音特征向量对第二噪音片段进行降噪处理,以得到第四降噪噪音片段;基于第四降噪噪音片段和第二噪音片段进行损伤评估,以得到第四损伤评分;若第四损伤评分不大于第十二阈值,进入PNR模式。Acquiring the first noise segment and the second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are continuous noise segments in time; acquiring the microphone array of the auxiliary device of the terminal device The signal collected by the environment is calculated by using the collected signal to obtain the signal angle of arrival (direction of arrival, DOA) and sound pressure level (sound pressure level, SPL) of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than The tenth threshold, and the SPL of the first noise segment is greater than the eleventh threshold, then extract the second temporary feature vector of the first noise segment; based on the second temporary speech feature vector, the second noise segment is denoised to obtain the first Four denoising noise segments; performing damage assessment based on the fourth denoising noise segment and the second noise segment to obtain a fourth damage score; if the fourth damage score is not greater than the twelfth threshold, enter the PNR mode.
获取第一带噪语音信号包括:Obtaining the first noisy speech signal includes:
从在第一噪音片段之后产生的噪声信号中确定第一带噪语音信号;注册语音信号的特征向量包括第二临时特征向量。A first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a second temporary feature vector.
其中,利用采集的信号计算得到第一噪音片段的DOA和SPL,具体可以包括:Wherein, the DOA and SPL of the first noise segment are calculated by using the collected signal, which may specifically include:
对麦克风阵列采集的信号进行时频变换,得到第十九频域信号,基于该第十九频域信号,计算第一噪音片段的DOA和SPL。Time-frequency transform is performed on the signal collected by the microphone array to obtain a nineteenth frequency domain signal, and based on the nineteenth frequency domain signal, DOA and SPL of the first noise segment are calculated.
进一步地,若第四损伤评分不大于第十二阈值,本申请的方法还包括:Further, if the fourth damage score is not greater than the twelfth threshold, the method of the present application also includes:
通过终端设备发出第四提示信息,该第四提示信息用于提示是否使得终端设备进入PNR模式;在检测到目标用户的同意进入PNR模式的操作指令后,才进入PNR模式。The fourth prompt message is sent by the terminal device, and the fourth prompt message is used to prompt whether the terminal device enters the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
可选地,辅助设备可以为带有麦克风阵列的设备,比如电脑、平板电脑等。Optionally, the auxiliary device may be a device with a microphone array, such as a computer, a tablet computer, and the like.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
在检测到终端设备再次被使用时,获取第二带噪语音信号;并采用传统降噪算法,也就是采用非PNR模式对第二带噪语音信号进行降噪处理,得到当前通话者的降噪语音信号When it is detected that the terminal device is used again, obtain the second noisy voice signal; and use the traditional noise reduction algorithm, that is, use the non-PNR mode to perform noise reduction processing on the second noisy voice signal, and obtain the noise reduction of the current caller voice signal
在第二带噪语音信号的SNR低于第四阈值时,根据第一临时特征向量对第二带噪语音信号进行降噪处理,以得到当前使用者的降噪语音信号;基于当前使用者的降噪语音信号和第二带噪语音信号进行损伤评估,以得到第二损伤评分;当第二损伤评分不大于第五阈值时,通过终端设备发出第二提示信息,该第二提示信息用于提示当前使用者终端设备能够进入PNR模式;在检测到所同意进入PNR模式的操作指令后,使得终端设备进入PNR模式对第三带噪语音信号进行降噪处理,该第三带噪语音信号是在第二带噪语音信号之后获取的;在检测到当前使用者的不同意进入PNR模式的操作指令后,采用非PNR模式对第三带噪语音信号进行降噪处理。When the SNR of the second noisy speech signal is lower than the fourth threshold, the second noisy speech signal is subjected to noise reduction processing according to the first temporary feature vector to obtain the current user's noise-reduced speech signal; based on the current user's The noise-reduced speech signal and the second noisy speech signal are subjected to damage assessment to obtain a second damage score; when the second damage score is not greater than the fifth threshold, a second prompt message is sent through the terminal device, and the second prompt message is used to Prompting the current user that the terminal device can enter the PNR mode; after detecting the operation command agreed to enter the PNR mode, the terminal device is made to enter the PNR mode to perform noise reduction processing on the third noisy voice signal. The third noisy voice signal is Acquired after the second noisy speech signal; after detecting the current user's operation instruction of not agreeing to enter the PNR mode, use the non-PNR mode to perform noise reduction processing on the third noisy speech signal.
在此需要说明的是,在对第一噪音片段进行临时语音特征提取,得到第一噪音片段的临时特征向量后,终端设备存储该临时特征向量,后续需要使用时直接获取该临时特征向量, 避免了后续在噪声较大的场景下无法获取当前使用者的语音特征,从而无法进行损伤评估。此处的第一噪音片段的临时特征向量可以是第一临时特征向量或第二临时特征向量。What needs to be explained here is that after performing temporary speech feature extraction on the first noise segment to obtain the temporary feature vector of the first noise segment, the terminal device stores the temporary feature vector, and obtains the temporary feature vector directly when it needs to be used later, avoiding This prevents the subsequent inability to obtain the voice features of the current user in a noisy scene, thus making it impossible to perform damage assessment. Here, the temporary feature vector of the first noise segment may be the first temporary feature vector or the second temporary feature vector.
可选地,第四阈值与第一阈值相同,也可以不相同;第五阈值与第三阈值可以相同,也可以不相同。Optionally, the fourth threshold may or may not be the same as the first threshold; the fifth threshold may or may not be the same as the third threshold.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
若第一噪音片段的SNR不大于第一阈值或者第一噪音片段的SPL不大于第二阈值,且终端设备已存储参考临时声纹特征向量,获取第三噪音片段;根据参考临时声纹特征向量对第三噪音片段进行降噪处理,得到第三降噪噪音片段;根据第三噪音片段和第三降噪噪音片段进行损伤评估,以得到第三损伤评分;若第三损伤评分大于第六阈值且第三噪音片段的SNR小于第七阈值,或者第三损伤评分大于第八阈值且第三噪音片段的SNR不小于第七阈值,则通过终端设备发出第三提示信息,第三提示信息用于提示当前使用者终端设备能够进入PNR模式;在检测到当前使用者的同意进入PNR模式的操作指令后,使得终端设备进入PNR模式对第四带噪语音信号进行降噪处理;在检测到当前使用者的不同意进入PNR模式的操作指令后,采用非PNR模式对第四带噪语音信号进行降噪处理;其中,第四带噪语音信号是从在第三噪音片段之后产生的噪声信号中确定的。If the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, and the terminal device has stored the reference temporary voiceprint feature vector, obtain the third noise segment; according to the reference temporary voiceprint feature vector Perform denoising processing on the third noise segment to obtain a third denoising noise segment; perform damage assessment based on the third noise segment and the third denoising noise segment to obtain a third damage score; if the third damage score is greater than the sixth threshold And the SNR of the third noise segment is less than the seventh threshold, or the third damage score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then the third prompt information is sent through the terminal device, and the third prompt information is used to Prompt the current user that the terminal device can enter the PNR mode; after detecting the current user's operation instruction agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the fourth noisy voice signal; After the operator does not agree to enter the operation instruction of PNR mode, adopt the non-PNR mode to carry out noise reduction processing on the fourth noisy speech signal; wherein, the fourth noisy speech signal is determined from the noise signal generated after the third noise segment of.
其中,参考临时声纹特征向量为历史使用者的声纹特征向量。Wherein, the reference temporary voiceprint feature vector is the voiceprint feature vector of the historical user.
可选的,第七阈值可以为10dB或者其他值,第六阈值可以为8dB或者其他值,第八阈值可以为12dB或者其他值。Optionally, the seventh threshold may be 10dB or other values, the sixth threshold may be 8dB or other values, and the eighth threshold may be 12dB or other values.
通过该方法可以判断出是否需要采用本申请的方案进行语音降噪,避免了需要进行降噪时却没有进行降噪的情况的发生,实现了灵活自动降噪,提升了用户体验。Through this method, it can be judged whether it is necessary to adopt the solution of the present application for voice noise reduction, which avoids the situation that noise reduction is not performed when noise reduction is required, realizes flexible automatic noise reduction, and improves user experience.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
当检测到终端设备处于手持通话状态时,不进入PNR模式;When it is detected that the terminal device is in the handheld talking state, it will not enter the PNR mode;
当检测到终端设备处于免提通话状态时,进入PNR模式,其中,目标用户为终端设备的拥有者或者正在使用终端设备的用户;When it is detected that the terminal device is in the hands-free call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user who is using the terminal device;
当检测到终端设备处于视频通话状态时,进入PNR模式,其中,目标用户为终端设备的拥有者或者距离终端设备最近的用户;When it is detected that the terminal device is in a video call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user closest to the terminal device;
当检测到终端设备连接到耳机进行通话时,进入PNR模式,其中,目标用户为佩戴耳机的用户;第一带噪语音信号和目标语音相关数据是通过耳机采集得到的;或,When it is detected that the terminal device is connected to the headset for a call, enter the PNR mode, wherein the target user is a user wearing the headset; the first noisy voice signal and the target voice-related data are collected through the headset; or,
当检测到终端设备连接到智能大屏设备、智能手表或者车载设备时,进入PNR模式,其中目标用户为终端设备的拥有者或者正在使用终端设备的用户,第一带噪语音信号和目标语音相关数据是由智能大屏设备、智能手表或者车载设备的音频采集硬件采集得到的。When it is detected that the terminal device is connected to a smart large-screen device, a smart watch or a vehicle-mounted device, it enters the PNR mode, where the target user is the owner of the terminal device or the user who is using the terminal device, and the first noisy voice signal is related to the target voice The data is collected by the audio collection hardware of smart large-screen devices, smart watches or vehicle-mounted devices.
基于不同的应用场景判断是否开启PNR降噪功能,实现了灵活自动降噪,提升了用户体验。Based on different application scenarios, it is judged whether to enable the PNR noise reduction function, which realizes flexible automatic noise reduction and improves user experience.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
获取当前环境的音频信号的分贝值,若当前环境的音频信号的分贝值超过预设分贝值,则判断终端设备启动的应用程序对应的PNR功能是否开启;若未开启,则开启终端设备启动的应用程序对应的PNR功能,并进入PNR模式。Obtain the decibel value of the audio signal in the current environment. If the decibel value of the audio signal in the current environment exceeds the preset decibel value, it is determined whether the PNR function corresponding to the application started by the terminal device is enabled; if it is not enabled, the PNR function started by the terminal device is enabled. The application corresponds to the PNR function and enters the PNR mode.
其中,应用程序为终端设备上安装的应用程序,比如通话、视频通话、录像应用程序、微信、QQ等。Wherein, the application program is an application program installed on the terminal device, such as call, video call, video recording application program, WeChat, QQ and so on.
基于当前环境的音频信号的大小,判断是否开启PNR功能,实现了灵活自动降噪,提升了用户体验。Based on the size of the audio signal in the current environment, it is judged whether to enable the PNR function, which realizes flexible automatic noise reduction and improves user experience.
在一个可行的实施例中,终端设备包括显示屏,显示屏包括多个显示区域,其中,多个显示区域中的每个显示区域显示标签和对应的功能按键,功能按键用于控制其对应标签所指示的功能或者应用程序的PNR功能的开启和关闭。In a feasible embodiment, the terminal device includes a display screen, and the display screen includes a plurality of display areas, wherein each display area in the plurality of display areas displays a label and a corresponding function key, and the function key is used to control its corresponding label Indicates the activation and deactivation of the function or application's PNR function.
在终端设备的显示屏所显示的界面上设置控制终端设备的某一应用程序(比如通话、录像等)的PNR功能的开启和关闭,实现了用户可以按需开启和关闭PNR功能。The interface displayed on the display screen of the terminal device is set to control the opening and closing of the PNR function of a certain application program (such as calling, recording, etc.) of the terminal device, so that the user can turn on and off the PNR function as required.
在一个可行的实施例中,当终端设备与另一终端设备之间进行语音数据传输时,本申请的方法还包括:In a feasible embodiment, when voice data transmission is performed between the terminal device and another terminal device, the method of the present application further includes:
接收另一终端设备发送的语音增强请求,该语音增强请求用于指示终端设备开启通话功能的PNR功能;响应于语音增强请求,通过终端设备发出第三提示信息,该第三提示信息用于提示是否使得终端设备开启通话功能的PNR功能;当检测到确认开启通话功能的PNR功能的操作指令后,开启通话功能的PNR功能,并进入PNR模式;向另一终端设备发送语音增强响应消息,该语音增强响应消息用于指示终端设备已开启通话功能的PNR功能。Receive a voice enhancement request sent by another terminal device, the voice enhancement request is used to instruct the terminal device to enable the PNR function of the call function; in response to the voice enhancement request, send a third prompt message through the terminal device, the third prompt message is used to prompt Whether to enable the terminal device to enable the PNR function of the call function; after detecting the operation instruction confirming the PNR function of the call function, enable the PNR function of the call function and enter the PNR mode; send a voice enhancement response message to another terminal device, the The voice enhancement response message is used to indicate that the terminal device has enabled the PNR function of the call function.
在通话过程中,当发现对方处于嘈杂环境时,向对方发送开启对方的终端设备的通话功能的PNR功能的请求,提高了双方通话的质量。当然,本实施例还可应用于视频通话等。During the call, when the other party is found to be in a noisy environment, a request for enabling the PNR function of the call function of the other party's terminal equipment is sent to the other party, which improves the quality of the call between the two parties. Of course, this embodiment can also be applied to video calls and the like.
在一个可行的实施例中,当终端设备启动视频通话或者视频录制功能,终端设备的显示界面包括第一区域和第二区域,第一区域用于显示视频通话内容或者视频录制的内容,第二区域用于显示M个控件和对应的M个标签,M个控件与M个目标用户一一对应M个控件中的每个控件包括滑动按钮和滑动条,通过控制滑动按钮在滑动条上滑动,以调节该控件对应的标签所指示目标用户的语音增强系数。In a feasible embodiment, when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area and a second area, the first area is used to display the content of the video call or video recording, and the second area The area is used to display M controls and corresponding M labels. The M controls correspond to the M target users one by one. Each of the M controls includes a sliding button and a sliding bar. By controlling the sliding button to slide on the sliding bar, to adjust the speech enhancement coefficient of the target user indicated by the label corresponding to the control.
通过用户按照需要调整语音增强系数的大小,实现了用户按需调节降噪的力度。当然,还可以按照此方式调节干扰噪声抑制系数。By adjusting the size of the speech enhancement coefficient according to the user's need, the user can adjust the intensity of noise reduction according to his need. Of course, the interference noise suppression coefficient can also be adjusted in this way.
在一个可行的实施例中,当终端设备启动视频通话或者视频录制功能,终端设备的显示界面包括第一区域,第一区域用于显示视频通话内容或者视频录制的内容;In a feasible embodiment, when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area, and the first area is used to display the video call content or the video recording content;
当检测到针对视频通话内容或者视频录制内容中任一对象的操作时,在第一区域显示该对象对应的控件,该控件包括滑动按钮和滑动条,通过控制滑动按钮在滑动条上滑动,以调节该对象的语音增强系数。When an operation on any object in the video call content or video recording content is detected, the control corresponding to the object is displayed in the first area, the control includes a sliding button and a sliding bar, and the sliding button is controlled to slide on the sliding bar to Adjusts the speech enhancement factor for this object.
通过用户按照需要调整语音增强系数的大小,实现了用户按需调节降噪的力度。当然,还可以按照此方式调节干扰噪声抑制系数。By adjusting the size of the speech enhancement coefficient according to the user's need, the user can adjust the intensity of noise reduction according to his need. Of course, the interference noise suppression coefficient can also be adjusted in this way.
在一个可行的实施例中,当终端设备为智能交互设备时,目标语音相关数据包括包含唤醒词的语音信号,第一带噪语音信号包括包含命令词的音频信号。In a feasible embodiment, when the terminal device is an intelligent interactive device, the target voice-related data includes a voice signal including a wake-up word, and the first noisy voice signal includes an audio signal including a command word.
可选地,智能交互设备包括智能音响、扫地机器人、智能冰箱和智能空调等设备。Optionally, the smart interactive devices include devices such as smart speakers, sweeping robots, smart refrigerators, and smart air conditioners.
采用本方式对控制智能交互设备的指令语音进行降噪处理,使得智能交互设备能够快速得到精准的指令,进而完成指令对应的动作。This method is used to perform noise reduction processing on the instruction voice for controlling the intelligent interactive device, so that the intelligent interactive device can quickly obtain accurate instructions, and then complete the actions corresponding to the instructions.
第二方面,本申请实施例提供一种终端设备,该终端设备包括用于执行第一方面的方法的单元或模块。In a second aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a unit or a module configured to execute the method in the first aspect.
第三方面,本申请实施例提供一种终端设备,包括处理器和存储器,其中,处理器和存储器相连,其中,存储器用于存储程序代码,处理器用于调用程序代码,以执行第一方面方法的部分或者全部。In a third aspect, an embodiment of the present application provides a terminal device, including a processor and a memory, wherein the processor is connected to the memory, wherein the memory is used to store program codes, and the processor is used to call the program codes to execute the method of the first aspect part or all of.
第四方面,本申请实施例提供一种芯片系统,该芯片系统应用于电子设备;芯片系统包括一个或多个接口电路,以及一个或多个处理器;接口电路和处理器通过线路互联;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,该信号包括存储器中存储的计 算机指令;当处理器执行计算机指令时,电子设备执行第一方面所述的方法。In the fourth aspect, the embodiment of the present application provides a chip system, which is applied to electronic equipment; the chip system includes one or more interface circuits, and one or more processors; the interface circuits and processors are interconnected through lines; the interface The circuit is used to receive a signal from the memory of the electronic device and send a signal to the processor, the signal including computer instructions stored in the memory; when the processor executes the computer instruction, the electronic device executes the method described in the first aspect.
第五方面,本申请实施例提供一种计算机存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行以实现第一方面所述的方法。In a fifth aspect, an embodiment of the present application provides a computer storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method described in the first aspect.
第六方面,本申请实施例还提供一种计算机程序产品,包括计算机指令,当所述计算机指令在中终端设备上运行时,使得所述终端设备实现执行如第一方面所述方法的部分或者全部。In the sixth aspect, the embodiment of the present application further provides a computer program product, including computer instructions, which, when the computer instructions are run on the terminal device, enable the terminal device to implement part of the method described in the first aspect or all.
本申请的这些方面或其他方面在以下实施例的描述中会更加简明易懂。These or other aspects of the present application will be more concise and understandable in the description of the following embodiments.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present application. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的一种应用场景示意图;FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application;
图2a为本申请实施例提供一种语音降噪处理原理示意图;FIG. 2a is a schematic diagram of a speech noise reduction processing principle provided by an embodiment of the present application;
图2b为本申请实施例提供另一种语音降噪处理原理示意图;FIG. 2b is a schematic diagram of another speech noise reduction processing principle provided by the embodiment of the present application;
图3为本申请实施例提供的一种语音增强方法的流程示意图;FIG. 3 is a schematic flowchart of a speech enhancement method provided by an embodiment of the present application;
图4为本申请实施例提供的一种语音降噪模型的结构示意图;FIG. 4 is a schematic structural diagram of a speech noise reduction model provided in an embodiment of the present application;
图5为本申请实施例提供的一种语音降噪模型的具体结构示意图;FIG. 5 is a schematic structural diagram of a speech noise reduction model provided in an embodiment of the present application;
图6a示意出了TCN模型的框架结构;Figure 6a shows the framework structure of the TCN model;
图6b示意出了因果空洞卷积层单元的结构;Figure 6b illustrates the structure of the causal dilated convolutional layer unit;
图7为本申请实施例提供的另一种语音降噪模型的结构示意图;FIG. 7 is a schematic structural diagram of another speech noise reduction model provided by the embodiment of the present application;
图8为图7中神经网络的具体结构示意图;Fig. 8 is the specific structure diagram of neural network in Fig. 7;
图9为本申请实施例提供的一种语音降噪过程示意图;FIG. 9 is a schematic diagram of a speech noise reduction process provided by an embodiment of the present application;
图10为本申请实施例提供的另一种语音降噪过程示意图;FIG. 10 is a schematic diagram of another speech noise reduction process provided by the embodiment of the present application;
图11为本申请实施例提供的一种多人语音降噪过程示意图;FIG. 11 is a schematic diagram of a multi-person speech noise reduction process provided by an embodiment of the present application;
图12为本申请实施例提供的一种多人语音降噪过程示意图;FIG. 12 is a schematic diagram of a multi-person speech noise reduction process provided by an embodiment of the present application;
图13为本申请实施例提供的一种多人语音降噪过程示意图;FIG. 13 is a schematic diagram of a multi-person speech noise reduction process provided by an embodiment of the present application;
图14为本申请实施例提供的另一种语音降噪模型的结构示意图;FIG. 14 is a schematic structural diagram of another speech noise reduction model provided by the embodiment of the present application;
图15为本申请实施例提供的一种UI界面示意图;FIG. 15 is a schematic diagram of a UI interface provided by the embodiment of the present application;
图16为本申请实施例提供的另一种UI界面示意图;FIG. 16 is a schematic diagram of another UI interface provided by the embodiment of the present application;
图17为本申请实施例提供的另一种UI界面示意图;FIG. 17 is a schematic diagram of another UI interface provided by the embodiment of the present application;
图18为本申请实施例提供的另一种UI界面示意图;FIG. 18 is a schematic diagram of another UI interface provided by the embodiment of the present application;
图19为本申请实施例提供的通话场景下UI界面示意图;FIG. 19 is a schematic diagram of a UI interface in a call scenario provided by an embodiment of the present application;
图20为本申请实施例提供的另一种通话场景下UI界面示意图;FIG. 20 is a schematic diagram of a UI interface in another call scenario provided by the embodiment of the present application;
图21为本申请实施例提供的一种视频录制UI界面示意图;FIG. 21 is a schematic diagram of a video recording UI interface provided by an embodiment of the present application;
图22为本申请实施例提供的一种视频通话UI界面示意图;FIG. 22 is a schematic diagram of a video call UI interface provided by an embodiment of the present application;
图23为本申请实施例提供的另一种视频通话UI界面示意图;FIG. 23 is a schematic diagram of another video call UI interface provided by the embodiment of the present application;
图24为本申请实施例提供的一种终端设备的结构示意图;FIG. 24 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;
图25为本申请实施例提供的另一种终端设备的结构示意图;FIG. 25 is a schematic structural diagram of another terminal device provided by an embodiment of the present application;
图26为本申请实施例提供的另一种终端设备的结构示意图。FIG. 26 is a schematic structural diagram of another terminal device provided by an embodiment of the present application.
具体实施方式Detailed ways
以下分别进行详细说明。Each will be described in detail below.
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同目标用户,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third" and "fourth" in the description and claims of this application and the drawings are used to distinguish different target users, rather than to describe specific order. Furthermore, the terms "include" and "have", as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally further includes For other steps or units inherent in these processes, methods, products or apparatuses.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。"Multiple" means two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship.
下面结合附图对本申请的实施例进行描述。Embodiments of the present application are described below in conjunction with the accompanying drawings.
参见图1,图1为本申请实施例提供的一种应用场景示意图。该应用场景包括音频采集设备102和终端设备101,该终端设备可以为智能手机、智能手表、电视、智能车辆/车载终端、耳机、PC、平板、笔记本电脑、智能音箱、机器人、录音采集设备等需要对声音信号进行采集的终端设备上,例如用于手机语音增强,对麦克风采集的带噪语音信号进行处理,输出目标用户的降噪语音信号,作为语音通话的上行信号,或者语音唤醒和语音识别引擎的输入信号。Referring to FIG. 1 , FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application. The application scenario includes an audio collection device 102 and a terminal device 101. The terminal device can be a smart phone, a smart watch, a TV, a smart vehicle/vehicle terminal, a headset, a PC, a tablet, a notebook computer, a smart speaker, a robot, a recording collection device, etc. On the terminal equipment that needs to collect sound signals, such as for mobile phone voice enhancement, process the noisy voice signal collected by the microphone, output the noise-reduced voice signal of the target user, as the uplink signal of the voice call, or voice wake-up and voice Input signal to the recognition engine.
当然,采集声音信号还可以是与终端设备通过有线方式或者无线方式连接的音频采集设备102采集的,该音频采集设备可以为智能手表、电视、智能车辆/车载终端、耳机、PC、平板、笔记本电脑或者录音采集设备等。Of course, the collected sound signal can also be collected by an audio collection device 102 connected to the terminal device in a wired or wireless manner. Computer or recording collection equipment, etc.
可选地,音频采集设备102与终端设备101是集成在一起的。Optionally, the audio collection device 102 and the terminal device 101 are integrated together.
图2a和图2b示意出了语音降噪处理原理。如图2a所示,采集得到由目标用户的语音、干扰人的语音和其他噪声混合得到的带噪语音信号后,将该带噪语音信号和目标用户的注册语音输入到语音降噪模型中进行处理,得到目标用户的降噪语音信号,或者如图2b所示,将带噪语音信号和目标用户的VPU信号输入到语音降噪模型中进行处理,得到目标用户的降噪语音信号。Fig. 2a and Fig. 2b schematically illustrate the principle of speech noise reduction processing. As shown in Figure 2a, after collecting the noisy speech signal obtained by mixing the speech of the target user, the speech of the interfering person and other noises, the noisy speech signal and the registered speech of the target user are input into the speech noise reduction model for processing to obtain the noise-reduced speech signal of the target user, or as shown in Figure 2b, input the noisy speech signal and the VPU signal of the target user into the speech noise reduction model for processing to obtain the noise-reduced speech signal of the target user.
增强后的语音信号可用于语音通话或者语音唤醒和语音识别功能。对于私人设备(如手机、PC和各种私人穿戴产品等),目标用户是固定的,在通话和语音交互时只保留目标用户的语音信息作为注册语音或者VPU信号,然后按照上述方式进行语音增强,可极大提升用户体验。在有限公共设备(如智能家居、车载、会议室场景等),用户也相对固定,可通过多用户语音注册方式(图2a所示的方式)进行语音增强,可提升多用户场景的体验。The enhanced voice signal can be used for voice calls or voice wake-up and voice recognition functions. For private devices (such as mobile phones, PCs and various personal wearable products, etc.), the target user is fixed, and only the voice information of the target user is kept as the registered voice or VPU signal during the call and voice interaction, and then the voice enhancement is performed in the above-mentioned manner , which can greatly improve the user experience. In limited public equipment (such as smart home, car, conference room, etc.), users are relatively fixed, and voice enhancement can be performed through multi-user voice registration (as shown in Figure 2a), which can improve the experience of multi-user scenarios.
参见图3,图3为本申请实施例提供的一种语音增强方法的流程示意图。如图3所示,该方法包括:Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a speech enhancement method provided by an embodiment of the present application. As shown in Figure 3, the method includes:
S301、在终端设备进入PNR模式后,获取第一带噪语音信号和目标语音相关数据,其中,第一带噪语音信号包含干扰噪声信号和目标用户的语音信号,目标语音相关数据用于指示目标用户的语音特征。S301. After the terminal device enters the PNR mode, acquire the first noisy voice signal and target voice-related data, wherein the first noisy voice signal includes an interference noise signal and the voice signal of the target user, and the target voice-related data is used to indicate the target The user's speech characteristics.
可选地,目标语音相关数据可以为目标用户的注册语音信号,或者目标用户的VPU信号,或者目标用户的声纹特征,或者目标用户的视频唇动信息等。Optionally, the target voice-related data may be the target user's registered voice signal, or the target user's VPU signal, or the target user's voiceprint features, or the target user's video lip movement information.
在一个示例中,通过麦克风采集的目标用户在安静场景下预设时长的语音信号,该语音信号为目标用户的注册语音信号;其中,麦克风的采样频率可以为16000Hz,假设上述预设时长为6s,则目标用户的注册语音信号包括96000个采样点。其中,安静场景具体是指场景的声音大小不高于预设分贝;可选地,预设分贝可以为1dB,2dB,5dB,10dB或者其他值。In an example, the voice signal of the target user with a preset duration collected by the microphone in a quiet scene is the registered voice signal of the target user; wherein, the sampling frequency of the microphone can be 16000 Hz, assuming that the preset duration is 6s , the registered voice signal of the target user includes 96000 sampling points. Wherein, the quiet scene specifically means that the sound level of the scene is not higher than a preset decibel; optionally, the preset decibel may be 1dB, 2dB, 5dB, 10dB or other values.
在另一个示例中,目标用户的VPU信号是通过带有骨声纹传感器的设备获取的,骨声纹传感器中的VPU传感器可以拾取目标用户通过骨传导的声音信号。相比麦克风采集的信号,VPU信号的区别在于:只拾取目标用户的语音且只能拾取低频信号(一般为4kHz以下)。In another example, the target user's VPU signal is acquired through a device with a bone voiceprint sensor, and the VPU sensor in the bone voiceprint sensor can pick up the target user's voice signal through bone conduction. Compared with the signal collected by the microphone, the difference of the VPU signal is that it only picks up the voice of the target user and can only pick up low-frequency signals (generally below 4kHz).
其中,第一带噪语音信号包含目标用户的语音信号和其他噪音信号,该其他噪音信号包括其他用户的语音信号和/或非人产生的噪音信号,比如汽车、工地机器等产生的噪音信号。Wherein, the first noisy speech signal includes the target user's speech signal and other noise signals, and the other noise signals include other user's speech signals and/or noise signals generated by non-human beings, such as noise signals generated by automobiles and construction site machines.
S302、根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号,其中,语音降噪模型是基于神经网络实现的。S302. Perform noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech related data to obtain a noise reduction speech signal of the target user, wherein the speech noise reduction model is implemented based on a neural network.
针对不同的目标语音相关数据,语音降噪模型具有不同的网络结构,也就是说语音降噪模型对不同的目标语音相关数采取不同的处理方式。对于目标语音相关数据为目标用户的注册语音或者目标用户的视频唇动信息,可以采用方式一对应的语音降噪模型对目标语音相关数据和第一带噪语音信号进行降噪处理;对于目标语音相关数据包括目标用户的VPU信号,可以采用方式二或者方式三对应的语音降噪模型对目标语音相关数据和第一带噪语音信号进行降噪处理。以下具体说明方式一、方式二和方式三的处理过程。For different target speech related data, the speech noise reduction model has different network structures, that is to say, the speech noise reduction model adopts different processing methods for different target speech related data. For the target voice related data is the registered voice of the target user or the video lip movement information of the target user, the voice noise reduction model corresponding to the method can be used to carry out noise reduction processing on the target voice related data and the first noisy voice signal; for the target voice The relevant data includes the VPU signal of the target user, and the voice noise reduction model corresponding to mode 2 or mode 3 may be used to perform noise reduction processing on the target voice related data and the first noisy voice signal. The following specifically describes the processing procedures of the manner 1, the manner 2 and the manner 3.
以目标语音相关数据为目标用户的注册语音信号为例具体说明方式一。The first method is specifically described by taking the target voice-related data as the registered voice signal of the target user as an example.
方式一:对于如图4所示,根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号,具体包括如下步骤:Mode 1: As shown in Figure 4, according to the relevant data of the target voice, the noise reduction processing is performed on the first noisy voice signal through the voice noise reduction model to obtain the noise-reduced voice signal of the target user, which specifically includes the following steps:
利用第一编码网络从目标用户的注册语音信号中提取注册语音信号的特征向量;利用第二编码网络从带噪语音信号中提取出该带噪语音信号的特征向量;根据注册语音信号的特征向量和带噪语音信号的特征向量得到第一特征向量,具体地,对册语音信号的特征向量和带噪语音信号的特征向量进行数学运算,比如点乘,以得到第一特征向量;利用TCN对第一特征向量进行处理,得到第二特征向量,再利用第一解码网络对第二特征向量进行处理,得到目标用户的降噪语音信号。由上述描述可知,在方式一中,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络。Utilize the first encoding network to extract the feature vector of the registered voice signal from the registered voice signal of the target user; utilize the second encoding network to extract the feature vector of the noisy voice signal from the noisy voice signal; according to the feature vector of the registered voice signal and the eigenvector of the noisy speech signal to obtain the first eigenvector, specifically, perform a mathematical operation on the eigenvector of the speech signal and the eigenvector of the noisy speech signal, such as dot multiplication, to obtain the first eigenvector; utilize TCN to The first eigenvector is processed to obtain the second eigenvector, and then the first decoding network is used to process the second eigenvector to obtain the noise-reduced speech signal of the target user. It can be known from the above description that in the first manner, the speech noise reduction model includes a first encoding network, a second encoding network, a TCN, and a first decoding network.
具体地,如图5中的a所示,第一编码网络包括卷积层、层归一化(256)、激活函数PReLU(256)和求平均层,卷积层的卷积核的尺寸可以为1*1;采样点为96000的注册语音以40个采样点为一帧输入经过卷积层、层归一化和激活函数PReLU得到尺寸为4800*256的特征矩阵,其中,且相邻两帧的采样点的额重叠率可以为50%,重叠率当然还可以为其他值;然后通过求平均层对该特征矩阵在时间维度求均值得到尺寸为1*256的注册语音信号的特征向量。麦克风采集的第一带噪语音信号时,以20个采样点为一帧,并逐帧输入到第二编码网络中进行特征提取,得到每帧的语音特征向量。其中,如图5中的b所示,第二编码网络包括卷积层,层归一化和激活函数;具体地,将带噪语音以20个采样点为一帧分别经过卷积层、层归一化和激活函数,得到每帧的语音特征向量;将目标语音特征向量和第一带噪语音中每帧的语音特征向量进行数学运算,比如点乘,从而得到第一特征向量。可选地,上述数学运算可以为点乘或者其他运算。TCN模型采用因果空洞卷积模型,图6a示意出了TCN模型的框架结构,如图6a所示,TCN模型包括M个块(block),每个block有N个因果空洞卷积 层单元组成。图6b示意出了因果空洞卷积层单元的结构,第n层对应的卷积扩张率为2 n-1。在本实施例中,TCN模型包括5个block,每个block包括4层因果空洞卷积层单元,因此每个block中1,2,3,4层对应的扩张率分别为1,2,4,8,卷积核为3x1。第一特征向量经过TCN模型得到第二特征向量,第二特征向量的维度为1x256。如图5中的c所示,第一解码网络包括激活函数PReLU(256)和反卷积层(256x20x2);第二特征向量经过激活函数和反卷积层,可以得到目标用户的语音信号。其中,第二编码网络的结构参见第一编码网络的结构,相对于第一编码网络,第二编码网络少了在时间维度求平均的功能。 Specifically, as shown in a in Figure 5, the first encoding network includes a convolutional layer, layer normalization (256), an activation function PReLU (256) and an averaging layer, and the size of the convolution kernel of the convolutional layer can be is 1*1; the registered voice with 96000 sampling points is input with 40 sampling points as a frame, and the feature matrix with a size of 4800*256 is obtained through the convolutional layer, layer normalization and activation function PReLU, in which, and two adjacent The frontal overlap rate of the sampling points of the frame can be 50%, and the overlap rate can of course be other values; then the feature matrix is averaged in the time dimension by the averaging layer to obtain the feature vector of the registered speech signal with a size of 1*256. When the first noisy speech signal is collected by the microphone, 20 sampling points are taken as a frame, and input to the second encoding network frame by frame for feature extraction, and the speech feature vector of each frame is obtained. Among them, as shown in b in Figure 5, the second encoding network includes a convolutional layer, layer normalization and activation function; specifically, the noisy speech is passed through the convolutional layer, layer Normalize and activate the function to obtain the speech feature vector of each frame; perform mathematical operations, such as dot multiplication, on the target speech feature vector and the speech feature vector of each frame in the first noisy speech, so as to obtain the first feature vector. Optionally, the above mathematical operations may be dot multiplication or other operations. The TCN model uses a causal atrous convolution model. Figure 6a shows the framework structure of the TCN model. As shown in Figure 6a, the TCN model includes M blocks, and each block consists of N causal atrous convolutional layer units. Figure 6b shows the structure of the causal dilated convolutional layer unit, and the corresponding convolution expansion rate of the nth layer is 2 n-1 . In this embodiment, the TCN model includes 5 blocks, and each block includes 4 layers of causal atrous convolutional layer units, so the expansion rates corresponding to layers 1, 2, 3, and 4 in each block are 1, 2, and 4 respectively. ,8, the convolution kernel is 3x1. The first eigenvector is passed through the TCN model to obtain the second eigenvector, and the dimension of the second eigenvector is 1x256. As shown in c in Figure 5, the first decoding network includes an activation function PReLU (256) and a deconvolution layer (256x20x2); the second feature vector passes through the activation function and the deconvolution layer to obtain the voice signal of the target user. For the structure of the second encoding network, refer to the structure of the first encoding network. Compared with the first encoding network, the second encoding network lacks the function of averaging in the time dimension.
在此需要说明的是,上述层归一化(256)和上述激活函数PReLU(256)中的256表示层归一化和激活函数输出的特征维度数,反卷积层(256x20x2)中的256x20x2表示反卷积层所使用的卷积核的尺寸。上述描述只是一个示例性说明,不是对本申请的限定。It should be noted here that 256 in the above-mentioned layer normalization (256) and the above-mentioned activation function PReLU (256) indicates the number of feature dimensions output by layer normalization and activation function, and 256x20x2 in the deconvolution layer (256x20x2) Indicates the size of the convolution kernel used by the deconvolution layer. The above description is only an illustration, not a limitation of the present application.
需要指出的是,目标用户的视频唇动信息包括多帧包含目标用户的唇动信息的图像,若目标语音相关数据为目标用户视频唇动信息时,则将方式一中的目标用户的注册语音信号替换为目标用户的视频唇动信息,通过第一编码网络提取目标用户的视频唇动信息的特征向量,然后再按照上述描述的方式一进行后续处理。It should be pointed out that the target user's video lip movement information includes multiple frames of images containing the target user's lip movement information. If the target voice-related data is the target user's video lip movement information, the target user's registered voice The signal is replaced with the video lip movement information of the target user, and the feature vector of the video lip movement information of the target user is extracted through the first coding network, and then the subsequent processing is performed according to the first described method.
通过提前注册目标用户的语音信号的方式,在后续的语音交互时,可以增强目标用户的语音信号,抑制干扰语音和噪声,保证在语音唤醒和语音交互时只输入目标用户的语音信号,提升语音唤醒和语音识别的效果和精度;并且采用TCN因果空洞卷积网络构建语音降噪模型,可以实现语音降噪模型的低延时输出语音信号。By registering the voice signal of the target user in advance, in the subsequent voice interaction, the voice signal of the target user can be enhanced, interference voice and noise can be suppressed, and only the voice signal of the target user can be input during voice wake-up and voice interaction to improve voice The effect and accuracy of wake-up and speech recognition; and the TCN causal hole convolution network is used to build a speech noise reduction model, which can realize the low-latency output speech signal of the speech noise reduction model.
以目标语音相关数据为目标用户的VPU信号为例具体说明方式二和方式三。 Mode 2 and Mode 3 are specifically described by taking the target voice-related data as the VPU signal of the target user as an example.
方式二:对于如图7所示,采用语音降噪模型对目标用户的VPU信号和第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号,具体包括如下步骤:Mode 2: As shown in Figure 7, the VPU signal of the target user and the first noisy voice signal are subjected to noise reduction processing using the voice noise reduction model to obtain the noise-reduced voice signal of the target user, which specifically includes the following steps:
通过预处理模块分别对目标用户的VPU信号和第一带噪语音信号进行时频变换,得到目标用户的VPU信号的频域信号和第一带噪语音信号的频域信号;并对目标用户的VPU信号的频域信号和第一带噪语音信号的频域信号进行融合,得到第一融合频域信号;将第一融合频域信号分别经过第三编码网络、GRU和第二解码网络进行处理,得到目标用户的语音信号的频域信号的掩膜;通过后处理模块根据目标用户的语音信号的频域信号的掩膜对第一带噪语音信号的频域信号进行后处理,比如数学运算中的点乘,得到目标用户的语音信号的频域信号,并对目标用户的语音信号的频域信号进行频时变换,得到目标用户的降噪语音信号。由上述可知,方式二的语音降噪模型包括预处理模块、第三编码网络、GRU、第二解码网络和后处理模块。Carry out time-frequency conversion to the VPU signal of the target user and the first band noise voice signal respectively by the preprocessing module, obtain the frequency domain signal of the VPU signal of the target user and the frequency domain signal of the first band noise voice signal; The frequency domain signal of the VPU signal is fused with the frequency domain signal of the first noisy speech signal to obtain the first fused frequency domain signal; the first fused frequency domain signal is processed through the third encoding network, the GRU and the second decoding network respectively , to obtain the mask of the frequency domain signal of the voice signal of the target user; the frequency domain signal of the first noisy voice signal is post-processed through the post-processing module according to the mask of the frequency domain signal of the voice signal of the target user, such as mathematical operations The dot product in is to obtain the frequency domain signal of the voice signal of the target user, and perform frequency-time transformation on the frequency domain signal of the voice signal of the target user to obtain the noise-reduced voice signal of the target user. It can be seen from the above that the speech noise reduction model of the second mode includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module.
具体地,通过预处理模块分别对目标用户的VPU信号和第一带噪语音信号进行快速傅里叶变换(fast Fourier transform,FFT),得到目标用户的VPU信号的频域信号和第一带噪语音信号的频域信号;并通过预处理模块将目标用户的VPU频域信号和带噪语音的频域信号进行频域上的拼接组合,或者是将目标用户的VPU信号的频域信号的频谱和第一带噪语音信号的频域信号的频谱进行叠加,或者对目标用户的VPU信号频域信号和第一带噪语音信号的频域信号进行点乘运算,从而得到第一融合频域信号。举例说明,从目标用户的VPU信号的频域信号中提取0-1.5kHz的频域信号,从第一带噪语音信号的频域信号中提取1.5kHz-8kHz的频域信号,在频域上将提取出来的两组频域信号直接在频域上进行拼接组合得到第一融合频域信号,此时第一融合频域信号的频率范围为0-8kHz。如图8所示;将第一融合频域信号输入到第三编码网络中进行特征提取,得到第一融合频域信号的特征向量;再将第一融合频域信号的特征向量输入到GRU中进行处理,得到第三特征向量;将第三特征向量输入第二解码网 络中进行处理,得到目标用户的语音信号的频域信号的掩膜(mask)。如图8所示,第三编码网络和第二解码网络均包括2个卷积层和1个FTB。其中,卷积层的卷积核的尺寸均为3x3。通过后处理模块将目标用户的语音信号的频域信号的掩膜与第一带噪语音信号的频域信号进行点乘,得到目标用户的语音信号的频域信号;然后对目标用户的语音信号的频域信号进行快速傅里叶逆变换(inversefast Fourier transform,IFFT),得到目标用户的降噪语音信号。上述描述只是一个示例性说明,不是对本申请的限定。Specifically, fast Fourier transform (fast Fourier transform, FFT) is performed on the VPU signal of the target user and the first noisy voice signal through the preprocessing module to obtain the frequency domain signal of the VPU signal of the target user and the first noisy voice signal. The frequency domain signal of the voice signal; and through the preprocessing module, the frequency domain signal of the VPU frequency domain signal of the target user and the frequency domain signal of the noisy voice are spliced and combined in the frequency domain, or the frequency spectrum of the frequency domain signal of the VPU signal of the target user is Superimpose with the spectrum of the frequency domain signal of the first noisy speech signal, or perform a dot product operation on the frequency domain signal of the VPU signal of the target user and the frequency domain signal of the first noisy speech signal, so as to obtain the first fusion frequency domain signal . For example, the frequency domain signal of 0-1.5kHz is extracted from the frequency domain signal of the VPU signal of the target user, and the frequency domain signal of 1.5kHz-8kHz is extracted from the frequency domain signal of the first noisy speech signal. The extracted two groups of frequency domain signals are directly spliced and combined in the frequency domain to obtain the first fused frequency domain signal. At this time, the frequency range of the first fused frequency domain signal is 0-8 kHz. As shown in Figure 8; the first fused frequency domain signal is input into the third encoding network for feature extraction to obtain the feature vector of the first fused frequency domain signal; then the eigenvector of the first fused frequency domain signal is input into the GRU performing processing to obtain a third eigenvector; inputting the third eigenvector into a second decoding network for processing to obtain a mask of a frequency domain signal of the voice signal of the target user. As shown in Figure 8, both the third encoding network and the second decoding network include 2 convolutional layers and 1 FTB. Among them, the size of the convolution kernel of the convolution layer is 3x3. The mask of the frequency domain signal of the speech signal of the target user is carried out point multiplication with the frequency domain signal of the first band noise speech signal by the post-processing module, obtains the frequency domain signal of the speech signal of the target user; Then the speech signal of the target user Inverse fast Fourier transform (IFFT) is performed on the frequency domain signal to obtain the noise-reduced speech signal of the target user. The above description is only an illustration, not a limitation of the present application.
通过将目标用户的VPU信号作为辅助信息,用于实时提取目标用户的语音特征,该特征与麦克风采集的第一带噪语音信号相融合,指导目标用户语音增强和非目标用户语音等干扰的抑制,并且本实施例还提出了一种新的基于FTB和GRU的语音降噪模型用于目标用户的语音增强和非目标用户的语音等干扰的抑制;可以看出,采用本实施例的方案,不需要用户提前注册语音特征信息,可以根据实时VPU信号作为辅助信息,得到增强的目标用户语音并抑制非目标语音的干扰。By using the VPU signal of the target user as auxiliary information, it is used to extract the voice features of the target user in real time, and this feature is fused with the first noisy voice signal collected by the microphone to guide the voice enhancement of the target user and the suppression of interference such as the voice of non-target users , and this embodiment also proposes a new speech noise reduction model based on FTB and GRU for the suppression of interference such as speech enhancement of target users and speech of non-target users; it can be seen that by adopting the scheme of this embodiment, It does not require the user to register voice feature information in advance, and the real-time VPU signal can be used as auxiliary information to obtain the enhanced voice of the target user and suppress the interference of non-target voice.
方式三:分别对第一带噪语音信号和目标用户的耳内声音信号进行时频变换,得到第一带噪语音信号的频域信号和目标用户的声音信号的频域信号;根据目标用户的VPU信号及分别基于第一带噪语音的频域信号和目标用户的耳内声音信号的频域信号得到第一带噪语音信号与目标用户的耳内声音信号的协方差矩阵;分别基于第一带噪语音信号与目标用户的耳内声音信号的协方差矩阵得到第一MVDR权重;基于第一MVDR权重、第一带噪语音信号的频域信号和目标用户的耳内声音信号的频域信号得到第一语音信号的频域信号和第二语音信号的频域信号;其中,第一语音信号的频域信号与第一带噪语音信号相关,第二语音信号的频域信号与目标用户的耳内声音信号相关,根据第一语音信号的频域信号和第二语音信号的频域信号得到目标用户的降噪语音信号的频域信号;对目标用户的降噪语音信号的频域信号进行频时变换,以得到目标用户的降噪语音信号。Mode 3: Time-frequency transform is performed on the first noisy speech signal and the target user's in-ear sound signal respectively to obtain the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's sound signal; The covariance matrix of the first band noise speech signal and the ear sound signal of the target user is obtained based on the VPU signal and the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's ear sound signal respectively; The covariance matrix of the in-ear sound signal of the noisy speech signal and the target user obtains the first MVDR weight; based on the first MVDR weight, the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's in-ear sound signal Obtain the frequency domain signal of the first voice signal and the frequency domain signal of the second voice signal; wherein, the frequency domain signal of the first voice signal is related to the first noisy voice signal, and the frequency domain signal of the second voice signal is related to the target user's Intra-ear sound signal correlation, according to the frequency domain signal of the first voice signal and the frequency domain signal of the second voice signal, the frequency domain signal of the noise reduction voice signal of the target user is obtained; the frequency domain signal of the noise reduction voice signal of the target user is performed Frequency-time transformation to obtain the noise-reduced speech signal of the target user.
具体地,带有骨声纹传感器的耳机设备,该设备包含骨声纹传感器、耳内麦克风和耳外麦克风,骨声纹传感器中的VPU传感器可以拾取说话人通过骨传导的声音信号;耳内麦克风,用于拾取耳内声音信号;耳外麦克风,用于拾取耳外声音信号,也就是本申请中的第一带噪语音信号;Specifically, an earphone device with a bone voiceprint sensor, the device includes a bone voiceprint sensor, an in-ear microphone and an out-of-ear microphone, and the VPU sensor in the bone voiceprint sensor can pick up the sound signal of the speaker through bone conduction; The microphone is used to pick up the sound signal in the ear; the microphone outside the ear is used to pick up the sound signal outside the ear, which is the first noisy voice signal in this application;
如图9所示,通过语音活动检测(voice activity detection,VAD)算法对目标用户的VPU信号进行处理,得到处理结果;根据处理结果判断目标用户是否在讲话;若判断目标用户在讲话,则将第一标识置为第一值(比如1或者true);若判断目标用户不讲话,则将第一标识置为第二值(比如0或false);As shown in Figure 9, the VPU signal of the target user is processed by the voice activity detection (VAD) algorithm, and the processing result is obtained; according to the processing result, it is judged whether the target user is speaking; if it is judged that the target user is speaking, then the The first identification is set to the first value (such as 1 or true); if it is judged that the target user does not speak, the first identification is set to the second value (such as 0 or false);
当第一标识的值为第二值时,更新协方差矩阵,具体包括:分别对第一带噪语音信号和目标用户的耳内声音信号进行时频变换,比如FFT,得到第一带噪语音信号的频域信号和目标用户的耳内声音信号的频域信号;然后再分别基于第一带噪语音信号的频域信号和目标用户的耳内声音信号的频域信号计算得到目标用户的耳内声音信号与第一带噪语音信号的协方差矩阵;其中,该协方差矩阵可表示为:R n(f)=X(f)X H(f);其中,X(f)为目标用户的耳内声音信号和第一带噪语音信号的双通道频域信号,X H(f)为X(f)的Hermitian变换,或者X(f)的共轭转置;f为频点;然后基于协方差矩阵得到MVDR权重;其中,MVDR权重可表示为: When the value of the first identifier is the second value, update the covariance matrix, which specifically includes: respectively performing time-frequency transformation, such as FFT, on the first noisy speech signal and the target user's in-ear sound signal to obtain the first noisy speech signal The frequency domain signal of the signal and the frequency domain signal of the target user's in-ear sound signal; then calculate the target user's ear based on the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's in-ear sound signal. The covariance matrix of the inner sound signal and the first noisy speech signal; wherein, the covariance matrix can be expressed as: R n (f)=X(f)X H (f); wherein, X(f) is the target user The two-channel frequency-domain signal of the in-ear sound signal and the first noisy speech signal, X H (f) is the Hermitian transformation of X (f), or the conjugate transpose of X (f); f is a frequency point; then The MVDR weight is obtained based on the covariance matrix; among them, the MVDR weight can be expressed as:
Figure PCTCN2022093969-appb-000001
Figure PCTCN2022093969-appb-000001
其中,a(f,θ s)=[a 1(f,θ s)a 2(f,θ s)…a M(f,θ s)] T表示在f频点处对应的信号方位θ s导向矢量,f为频点,θ s为目标方位,该θ s为预设值,如垂直方向90度(耳机佩戴姿态与嘴部位置相对固定),M为麦克风个数,a H(f,θ s)为a(f,θ s)的Hermitian变换,
Figure PCTCN2022093969-appb-000002
为R n(f)的逆矩阵;
Among them, a(f,θ s )=[a 1 (f,θ s )a 2 (f,θ s )…a M (f,θ s )] T represents the corresponding signal azimuth θ s at frequency f Guidance vector, f is the frequency point, θ s is the target orientation, and the θ s is the preset value, such as 90 degrees in the vertical direction (the earphone wearing posture is relatively fixed to the position of the mouth), M is the number of microphones, a H (f, θ s ) is the Hermitian transformation of a(f,θ s ),
Figure PCTCN2022093969-appb-000002
is the inverse matrix of R n (f);
基于第一MVDR权重、第一带噪语音信号的频域信号和目标用户的耳内声音信号的频域 信号,得到第一语音信号的频域信号和第二语音信号的频域信号;其中,第一语音信号的频域信号与第一带噪语音信号相关,第二语音信号的频域信号与目标用户的耳内声音信号相关,该第一语音信号的频域信号和第二语音信号的频域信号可表示为:Y n(f)=w n(f,θ s)X n(f);需要指的是,w n(f,θ s)包含两个向量,分别对应第一语音信号的频域信号和第二语音信号的频域信号;将第一带噪语音信号的频域信号和目标用户的耳内声音信号的频域信号分别与两个向量进行点乘,得到第一语音信号的频域信号和第二语音信号的频域信号;根据第一语音信号的频域信号和第二语音信号的频域信号得到目标用户的降噪语音信号的频域信号,具体地,将第一语音信号的频域信号和第二语音信号的频域信号逐频点进行相加,具体是将第一语音信号的频域信号的第一个频点与第二语音信号的频域信号的第一个频点相加,将第一语音信号的频域信号的第二个频点与第二语音信号的频域信号的第二个频点相加,直至将第一语音信号的频域信号和第二语音信号的频域信号所有对应的频点都相加,得到目标用户的降噪语音信号的频域信号;对目标用户的降噪语音信号的频域信号进行IFFT,得到目标用户的降噪语音信号; Based on the first MVDR weight, the frequency domain signal of the first noisy speech signal and the frequency domain signal of the target user's in-ear sound signal, the frequency domain signal of the first speech signal and the frequency domain signal of the second speech signal are obtained; wherein, The frequency domain signal of the first voice signal is related to the first noisy voice signal, the frequency domain signal of the second voice signal is related to the ear sound signal of the target user, and the frequency domain signal of the first voice signal and the second voice signal The frequency domain signal can be expressed as: Y n (f)=w n (f,θ s )X n (f); it needs to be pointed out that w n (f,θ s ) contains two vectors, corresponding to the first speech The frequency domain signal of the signal and the frequency domain signal of the second voice signal; the frequency domain signal of the first noisy voice signal and the frequency domain signal of the ear sound signal of the target user are respectively multiplied by two vectors to obtain the first The frequency domain signal of the voice signal and the frequency domain signal of the second voice signal; according to the frequency domain signal of the first voice signal and the frequency domain signal of the second voice signal, the frequency domain signal of the noise reduction voice signal of the target user is obtained, specifically, Adding the frequency-domain signal of the first speech signal and the frequency-domain signal of the second speech signal frequency-by-frequency, specifically combining the first frequency point of the frequency-domain signal of the first speech signal with the frequency-domain signal of the second speech signal Add the first frequency point of the signal, add the second frequency point of the frequency domain signal of the first speech signal to the second frequency point of the frequency domain signal of the second speech signal, until the first speech signal All corresponding frequency points of the frequency domain signal of the frequency domain signal and the frequency domain signal of the second voice signal are all added to obtain the frequency domain signal of the noise reduction voice signal of the target user; IFFT is carried out to the frequency domain signal of the noise reduction voice signal of the target user to obtain noise-reduced voice signal of the target user;
当第一标识的值为第一值时,锁定协方差矩阵不更新,也就是说在计算第一MVDR权重采用历史协方差矩阵。When the value of the first identifier is the first value, the locked covariance matrix is not updated, that is to say, the historical covariance matrix is used for calculating the first MVDR weight.
采用方式三,不需要用户提前注册语音特征信息,可以根据实时的VPU信号作为辅助信息,得到增强语音信号,同时抑制干扰噪声。Adopting the third method, the user does not need to register the voice feature information in advance, and the real-time VPU signal can be used as the auxiliary information to obtain the enhanced voice signal while suppressing the interference noise.
在一个可行的实施例中,为了进一步增强目标用户的降噪语音信号,获取目标用户的语音增强系数,基于目标用户的语音增强系数对目标用户的降噪语音信号进行增强处理,以得到目标用户的增强语音信号,其中,目标用户的增强语音信号的幅值和目标用户的降噪语音信号的幅值之比为上述目标用户的语音增强系数。In a feasible embodiment, in order to further enhance the noise-reduced speech signal of the target user, the speech enhancement coefficient of the target user is obtained, and the noise-reduced speech signal of the target user is enhanced based on the speech enhancement coefficient of the target user to obtain the target user The enhanced speech signal of the target user, wherein the ratio of the amplitude of the enhanced speech signal of the target user to the amplitude of the noise-reduced speech signal of the target user is the speech enhancement coefficient of the target user.
由于单独输出用户的语音信号会降低用户体验,因此会在目标用户的语音信号的基础上增加干扰噪声信号,从而提高用户体验。在一个可行的实施例中,对于方式一和方式二中的语音降噪模型,在训练时可使得语音降噪模型中的解码网络(包括第一解码网络和第二解码网络)不仅传输目标用户的增强语音信号,还可以输出干扰噪声信号。对于方式三,可以在得到目标用户的降噪语音信号后,将第一带噪语音信号减去目标用户的降噪语音信号即可得到干扰噪声信号。Since outputting the voice signal of the user alone will reduce the user experience, an interference noise signal is added on the basis of the voice signal of the target user, thereby improving the user experience. In a feasible embodiment, for the speech noise reduction models in ways 1 and 2, during training, the decoding network (including the first decoding network and the second decoding network) in the speech noise reduction model can not only transmit the target user The enhanced speech signal can also output the interference noise signal. For the third way, after the noise-reduced speech signal of the target user is obtained, the interference noise signal can be obtained by subtracting the noise-reduced speech signal of the target user from the first noisy speech signal.
对于方式二,语音降噪模型的第二解码网络还输出第一带噪语音信号的频域信号的掩膜,后处理模块还根据第一带噪语音信号的频域信号的掩膜对第一带噪语音信号的频域信号进行后处理,比如点乘,得到干扰噪声的频域信号,然后对干扰噪声的频域信号进行频时变换,比如IFFT,得到干扰噪声信号。For the second method, the second decoding network of the speech noise reduction model also outputs the mask of the frequency domain signal of the first noisy speech signal, and the post-processing module also performs the first processing according to the mask of the frequency domain signal of the first noisy speech signal. The frequency domain signal of the noisy speech signal is post-processed, such as dot multiplication, to obtain the frequency domain signal of the interference noise, and then frequency-time transform is performed on the frequency domain signal of the interference noise, such as IFFT, to obtain the interference noise signal.
可选地,在得到目标用户的降噪语音信号后,根据目标用户的降噪语音信号对第一带噪语音信号进行处理,得到干扰噪声信号。具体地,将第一带噪语音信号减去目标用户的降噪语音信号,即可得到干扰噪声信号。Optionally, after the noise-reduced speech signal of the target user is obtained, the first noisy speech signal is processed according to the noise-reduced speech signal of the target user to obtain an interference noise signal. Specifically, the interference noise signal can be obtained by subtracting the noise-reduced speech signal of the target user from the first noisy speech signal.
可选地,对于方式一或者方式二或者方式三,在得到干扰噪声信号后,将干扰噪声信号与目标用户的增强语音信号进行融合,得到输出信号;该输出信号是目标用户的增强语音信号及干扰噪声信号混合得到的。Optionally, for mode one or mode two or mode three, after the interference noise signal is obtained, the interference noise signal is fused with the enhanced voice signal of the target user to obtain an output signal; the output signal is the enhanced voice signal of the target user and It is obtained by mixing the interference noise signal.
或者,如图10所示,获取干扰噪声抑制系数,基于该干扰噪声抑制系数对干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅值与干扰噪声的幅值之比为干扰噪声抑制系数;再将干扰噪声抑制信号与目标用户的增强语音信号进行融合,以得到输出信号;该输出信号是目标用户的增强语音信号及干扰噪声抑制信号混合得到的。Alternatively, as shown in Figure 10, the interference noise suppression coefficient is obtained, and the interference noise signal is suppressed based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the amplitude of the interference noise suppression signal and the amplitude of the interference noise The ratio of is the interference noise suppression coefficient; then the interference noise suppression signal is fused with the enhanced speech signal of the target user to obtain an output signal; the output signal is obtained by mixing the enhanced speech signal of the target user and the interference noise suppression signal.
或者,获取干扰噪声抑制系数,基于该干扰噪声抑制系数对干扰噪声信号进行抑制处理, 以得到干扰噪声抑制信号;然后将干扰噪声抑制信号与目标用户的降噪语音信号进行融合,得到输出信号。该输出信号是目标用户的降噪语音信号及干扰噪声抑制信号混合得到的。Alternatively, the interference noise suppression coefficient is acquired, and the interference noise signal is suppressed based on the interference noise suppression coefficient to obtain an interference noise suppression signal; then the interference noise suppression signal is fused with the target user's noise-reduced voice signal to obtain an output signal. The output signal is obtained by mixing the target user's noise-reduced voice signal and the interference noise-suppressed signal.
其中,干扰噪声抑制系数α,目标语音增强系数β,可以是系统预先设定的,例如α=0,β=1。也可以由用户设定,比如用户通过终端设备的UI界面可以设定干扰噪声抑制系数α,目标语音增强系数β。Wherein, the interference noise suppression coefficient α and the target speech enhancement coefficient β may be preset by the system, for example α=0, β=1. It can also be set by the user. For example, the user can set the interference noise suppression coefficient α and the target speech enhancement coefficient β through the UI interface of the terminal device.
在会议、视频通话的场景中,存在多人参与的情况,需要进行语音增强的目标用户可能不止一个人;因此对于多人的语音增强,可以采用方式四、方式五和方式六。In the scenarios of conferences and video calls, there are many people participating, and there may be more than one target user who needs voice enhancement; therefore, methods 4, 5 and 6 can be used for voice enhancement of multiple people.
目标用户包括M个,目标语音相关数据包括M个目标用户的相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,目标用户的语音增强系数包括M个目标用户的语音增强系数;第一带噪语音信号包含M个目标用户的语音信号及干扰噪声信号。The target users include M, the target voice-related data include the relevant data of M target users, the noise-reduced voice signals of the target users include the noise-reduced voice signals of M target users, and the voice enhancement coefficients of the target users include the voices of M target users Enhancement coefficient; the first noisy speech signal includes speech signals of M target users and interference noise signals.
方式四:如图11所示,将M个目标用户中第1个目标用户的语音相关数据和第一带噪语音信号输入到语音降噪模型中进行降噪处理的,得到第1个目标用户的降噪语音信号和不包含第1个目标用户的语音信号的第一带噪语音信号;再将第2个目标用户的语音相关数据和不包含第1个目标用户的语音信号的第一带噪语音信号输入到语音降噪模型中进行降噪处理的,得到第2个目标用户的降噪语音信号和不包含第1个目标用户的语音信号和第2个目标用户的语音信号的第一带噪语音信号;重复上述步骤,直至将第M个目标用户的语音相关数据和不包含第1至M-1个目标用户的语音的第一带噪语音信号输入到语音降噪模型中进行降噪处理,得到第M个目标用户的降噪语音信号和干扰噪声信号,该干扰噪声信号为不包含第1至M个目标用户的语音信号的第一带噪语音信号;基于M个目标用户的语音增强系数分别对M个目标用户的降噪语音信号进行增强处理,以得到M个目标用户的增强语音信号;对于M个目标用户中的任一目标用户O,目标用户O的增强语音信号的幅值与该目标用户的降噪语音信号的幅值之比为目标用户O的语音增强系数;基于干扰噪声抑制系数对干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅值与干扰噪声信号的幅值之比为干扰噪声抑制系数;将M个目标用户的增强语音信号与干扰噪声抑制信号进行融合,得到输出信号。该输出信号是M个目标用户的增强语音信号及干扰噪声抑制信号混合得到的。Method 4: As shown in Figure 11, input the speech-related data and the first noisy speech signal of the first target user among the M target users into the speech noise reduction model for noise reduction processing, and obtain the first target user The noise-reduced speech signal of the first target user and the first noisy speech signal not containing the speech signal of the first target user; The noisy speech signal is input into the speech noise reduction model for noise reduction processing, and the noise-reduced speech signal of the second target user and the first speech signal not including the speech signal of the first target user and the speech signal of the second target user are obtained. Noisy speech signal; repeat the above steps until the speech-related data of the M target user and the first noisy speech signal that does not contain the speech of the 1st to M-1 target users are input into the speech noise reduction model for reduction Noise processing, obtain the noise-reduced speech signal of the M target user and the interference noise signal, this interference noise signal is the first band noise speech signal that does not contain the speech signal of the 1st to M target users; Based on M target users The speech enhancement coefficients are respectively enhanced to the noise-reduced speech signals of M target users to obtain the enhanced speech signals of M target users; for any target user O in the M target users, the enhanced speech signal of target user O is The ratio of the amplitude and the amplitude of the noise-reduced speech signal of the target user is the speech enhancement coefficient of the target user O; based on the interference noise suppression coefficient, the interference noise signal is suppressed to obtain the interference noise suppression signal, wherein, the interference noise suppression The ratio of the amplitude of the signal to the amplitude of the interference noise signal is the interference noise suppression coefficient; the enhanced speech signals of M target users are fused with the interference noise suppression signal to obtain an output signal. The output signal is obtained by mixing the enhanced speech signals of the M target users and the interference noise suppression signals.
对于方式四中的语音降噪模型,当M个目标用户的语音相关数据为注册语音信号或者视频唇动信息时,方式四中的语音降噪模型的结构可以为方式一所描述的结构;当M个目标用户的语音相关数据为VPU信号时,方式四中的语音降噪模型的结构可以为方式二所描述的结构,或者方式四中的语音降噪模型实现方式三所描述的功能。For the voice noise reduction model in mode four, when the voice-related data of the M target users are registered voice signals or video lip movement information, the structure of the voice noise reduction model in mode four can be the structure described in mode one; When the voice-related data of M target users is a VPU signal, the structure of the speech noise reduction model in the fourth method may be the structure described in the second method, or the speech noise reduction model in the fourth method realizes the function described in the third method.
在一个示例中,按照方式四得到M个目标用户的降噪语音信号和干扰噪声信号后,直接对M个目标用户的降噪语音信号和干扰噪声信号进行融合,得到输出信号。该输出信号是M个目标用户的降噪语音信号和干扰噪声信号混合得到的。In an example, after the noise-reduced speech signals and interference noise signals of M target users are obtained according to the fourth manner, the noise-reduced speech signals and interference noise signals of the M target users are directly fused to obtain an output signal. The output signal is obtained by mixing noise-reduced speech signals and interference noise signals of M target users.
方式五:目标用户包括M个,如图12所示,将M个目标用户中第1个目标用户的语音相关数据和第一带噪语音信号输入到语音降噪模型中进行降噪处理的,得到第1个目标用户的降噪语音信号;将第2个目标用户的语音相关数据和第一带噪语音信号输入到语音降噪模型中进行降噪处理的,得到第2个目标用户的降噪语音信号;重复上述步骤,直至将第M个目标用户的语音相关数据和第一带噪语音信号输入到语音降噪模型中进行降噪处理,得到第M个目标用户的降噪语音信号;基于M个目标用户的语音增强系数分别对M个目标用户的降噪语音信号进行增强处理,以得到M个目标用户的增强语音信号;对于M个目标用户中的任一目标用户O,目标用户O的增强语音信号的幅值与目标用户O的降噪语音信号的幅值 之比为目标用户O的语音增强系数;将M个目标用户的增强语音信号进行融合,得到输出信号。该输出信号是M个目标用户的增强语音信号混合得到的。Mode 5: the target users include M, as shown in Figure 12, input the voice-related data and the first noisy voice signal of the first target user among the M target users into the voice noise reduction model for noise reduction processing, Obtain the noise-reduced speech signal of the first target user; input the speech-related data and the first noisy speech signal of the second target user into the speech noise reduction model for noise reduction processing, and obtain the denoised speech signal of the second target user. Noisy voice signal; repeat the steps above, until the voice-related data of the M target user and the first noisy voice signal are input into the voice noise reduction model for noise reduction processing, to obtain the noise reduction voice signal of the M target user; Based on the speech enhancement coefficients of the M target users, the noise-reduced speech signals of the M target users are respectively enhanced to obtain the enhanced speech signals of the M target users; for any target user O in the M target users, the target user The ratio of the amplitude of the enhanced speech signal of O to the amplitude of the noise-reduced speech signal of the target user O is the speech enhancement coefficient of the target user O; the enhanced speech signals of M target users are fused to obtain an output signal. The output signal is obtained by mixing the enhanced speech signals of M target users.
应理解,上述M个目标用户的语音相关数据及第一带噪语音信号是并行输入到语音降噪模型中的,因此上述动作可以是并行处理的。It should be understood that the voice-related data of the above M target users and the first noisy voice signal are input into the voice noise reduction model in parallel, so the above actions may be processed in parallel.
对于方式五中的语音降噪模型,当M个目标用户的语音相关数据为注册语音信号或者视频唇动信息时,方式五中的语音降噪模型的结构可以为方式一所描述的结构;当M个目标用户的语音相关数据为VPU信号时,方式五中的语音降噪模型的结构可以为方式二所描述的结构,或者方式五中的语音降噪模型实现方式三所描述的功能。For the voice noise reduction model in mode five, when the voice-related data of M target users are registered voice signals or video lip movement information, the structure of the voice noise reduction model in mode five can be the structure described in mode one; When the voice-related data of M target users is a VPU signal, the structure of the voice noise reduction model in the fifth way can be the structure described in the second way, or the voice noise reduction model in the fifth way can realize the function described in the third way.
在一个示例中,在通过语音降噪模型得到M个目标用户的增强语音信号后,可直接对M个目标用户的增强语音信号进行融合,从而得到上述输出信号。该输出信号是M个目标用户的增强语音信号混合得到的。In an example, after the enhanced voice signals of the M target users are obtained through the voice noise reduction model, the enhanced voice signals of the M target users may be directly fused to obtain the above output signal. The output signal is obtained by mixing the enhanced speech signals of M target users.
方式六:如图13所示,将M个目标用户的语音相关数据和第一带噪语音信号输入到语音降噪模型中进行降噪处理的,得到M个目标用户的降噪语音信号;基于M个目标用户的语音增强系数分别对M个目标用户的降噪语音信号进行增强处理,得到M个目标用户的增强语音信号;对于M个目标用户中的任一目标用户O,目标用户O的增强语音信号的幅值与目标用户O的降噪语音信号的幅值之比为目标用户O的语音增强系数;基于干扰噪声抑制系数对干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅值与干扰噪声信号的幅值之比为干扰噪声抑制系数;将M个目标用户的增强语音信号与干扰噪声抑制信号进行融合,以得到输出信号。该输出信号是M个目标用户的增强语音信号和干扰噪声抑制信号混合得到的。Method 6: As shown in Figure 13, input the voice-related data of M target users and the first noisy voice signal into the voice noise reduction model for noise reduction processing, and obtain the noise-reduced voice signals of M target users; based on The speech enhancement coefficients of the M target users respectively enhance the noise-reduced speech signals of the M target users to obtain the enhanced speech signals of the M target users; for any target user O in the M target users, the target user O's The ratio of the amplitude of the enhanced speech signal to the amplitude of the noise-reduced speech signal of the target user O is the speech enhancement coefficient of the target user O; based on the interference noise suppression coefficient, the interference noise signal is suppressed to obtain the interference noise suppression signal, where , the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient; the enhanced speech signals of M target users are fused with the interference noise suppression signal to obtain an output signal. The output signal is obtained by mixing the enhanced speech signals of the M target users and the interference noise suppression signals.
进一步,方式六中的语音降噪模型如图14所示,该语音降噪模型包括M个第一编码网络、第二编码网络、TCN和第一解码网络;利用M个第一编码网络分别对M个目标用户的注册语音信号进行特征提取,得到M个目标用户的注册语音信号的特征向量;利用第二编码网络对第一带噪语音信号进行特征提取,得到第一带噪语音信号的特征向量;对M个目标用户的注册语音信号的特征向量和第一带噪语音信号的特征向量进行数学运算,比如点乘,得到第一特征向量;利用TCN对第一特征向量进行处理,得到第二特征向量;并利用第一解码网络进行处理,得到目标用户的降噪语音信号和干扰噪声信号。Further, the speech noise reduction model in way six is shown in Figure 14, the speech noise reduction model includes M first encoding networks, second encoding networks, TCN and first decoding networks; use M first encoding networks to respectively Feature extraction is performed on the registered voice signals of M target users to obtain the feature vectors of the registered voice signals of M target users; the second coding network is used to extract the features of the first noisy voice signal to obtain the features of the first noisy voice signal vector; perform mathematical operations on the eigenvectors of the registered voice signals of M target users and the eigenvectors of the first noisy voice signal, such as dot multiplication, to obtain the first eigenvector; use TCN to process the first eigenvector to obtain the first eigenvector Two feature vectors; and use the first decoding network for processing to obtain the noise-reduced voice signal and interference noise signal of the target user.
需要说明的是,在多人远程会议或者通话过程中,存在一端存在多个人,每个人都戴着耳机,通过这些耳机,可以采集每个人的VPU信号,然后按照上述基于VPU信号进行降噪方案进行降噪处理。It should be noted that during a multi-person teleconference or call, there are multiple people at one end, and everyone is wearing headphones. Through these headphones, the VPU signal of each person can be collected, and then the noise reduction scheme based on the VPU signal can be carried out according to the above-mentioned VPU signal. Perform noise reduction processing.
在一个可行的实施例中,对于干扰噪声抑制系数,可以是默认值,也可以是目标用户基于自己的需求设置的,比如如图15中的左图所示,在终端设备上开启PNR功能后,终端设备进入PNR模式,终端设备的显示界面显示如图15中的右图所示的无极滑动控件,目标用户通过控制无极滑动控件上的灰色旋钮来调节干扰噪声抑制系数,其中,干扰噪声抑制系数的取值范围为[0,1];当控制灰色旋钮滑动到最左侧时,干扰噪声抑制系数为0,表示未进入PNR模式,干扰噪声不被抑制;当控制灰色旋钮滑动到最右侧时,干扰噪声抑制系数为1,表示干扰噪声完全被抑制;当控制灰色旋钮滑动到中间时,表示干扰噪声不完全被抑制。In a feasible embodiment, for the interference noise suppression coefficient, it can be the default value, or it can be set by the target user based on their own needs. For example, as shown in the left figure in Figure 15, after the PNR function is enabled on the terminal device , the terminal device enters the PNR mode, and the display interface of the terminal device displays the stepless sliding control shown in the right figure in Figure 15. The target user can adjust the interference noise suppression coefficient by controlling the gray knob on the stepless sliding control, where the interference noise suppression The value range of the coefficient is [0,1]; when the control gray knob is slid to the far left, the interference noise suppression coefficient is 0, indicating that the PNR mode is not entered, and the interference noise is not suppressed; when the control gray knob is slid to the right When it is on the right side, the interference noise suppression coefficient is 1, which means that the interference noise is completely suppressed; when the control gray knob slides to the middle, it means that the interference noise is not completely suppressed.
通过调整干扰噪声抑制系数的大小调整降噪的力度。Adjust the strength of noise reduction by adjusting the value of interference noise suppression coefficient.
可选地,无极滑动控件可以为图15所示的圆盘形,也可以是条形,还可以为其他形状,在此不做限定。Optionally, the infinite sliding control may be in the shape of a disk as shown in FIG. 15 , or in a bar shape, or in other shapes, which are not limited here.
在此需要说明的是,对于语音增强系数,也可以采用上述方式进行调节。It should be noted here that the speech enhancement coefficient may also be adjusted in the above manner.
在一个可行的实施例中,可以通过以下方式确定降噪是采用传统降噪算法,还是采用本申请公开的降噪方法进行降噪,本申请的方法还包括:In a feasible embodiment, the following method can be used to determine whether to use the traditional noise reduction algorithm or the noise reduction method disclosed in the present application for noise reduction. The method of the present application also includes:
获取终端设备所处环境的第一噪音片段和第二噪音片段,其中,第一噪音片段和第二噪音片段在时间上是连续的;获取第一噪音片段的SNR和SPL,若第一噪音片段的SNR大于第一阈值且第一噪音片段的SPL大于第二阈值,则提取第一噪音片段的第一临时特征向量;基于第一噪音片段的第一临时语音特征向量对第二噪音片段进行降噪处理,以得到第二降噪噪音片段;基于第二降噪噪音片段和第一噪音片段进行损伤评估,以得到第一损伤评分;若第一损伤评分不大于第三阈值时,则进入PNR模式,并从在第一噪音片段之后产生的噪声信号中确定出第一带噪语音信号,将第一临时特征向量作为注册语音信号的特征向量。Obtain the first noise segment and the second noise segment of the environment where the terminal device is located, wherein the first noise segment and the second noise segment are continuous in time; obtain the SNR and SPL of the first noise segment, if the first noise segment The SNR of the first noise segment is greater than the first threshold and the SPL of the first noise segment is greater than the second threshold, then the first temporary feature vector of the first noise segment is extracted; the second noise segment is degraded based on the first temporary speech feature vector of the first noise segment Noise processing to obtain the second noise reduction noise segment; damage assessment based on the second noise reduction noise segment and the first noise segment to obtain the first damage score; if the first damage score is not greater than the third threshold, enter the PNR mode, and determine the first noisy speech signal from the noise signal generated after the first noise segment, and use the first temporary feature vector as the feature vector of the registered speech signal.
进一步地,若第一损伤评分不大于第三阈值时,通过终端设备向目标用户发出第一提示消息,该第一提示信息用于提示目标用户是否使得终端设备进入PNR模式;在检测到目标用户的同意终端设备进入PNR模式的操作指令后,才进入PNR模式。Further, if the first damage score is not greater than the third threshold, the terminal device sends a first prompt message to the target user, the first prompt message is used to prompt the target user whether to make the terminal device enter the PNR mode; when the target user is detected Enter the PNR mode only after agreeing with the operation instruction that the terminal device enters the PNR mode.
具体地,当用户在初次使用终端设备时,终端设备的默认麦克风采集语音信号,并通过传统降噪算法对采集的语音信号进行处理,得到用户的降噪语音信号;并且终端设备按照预设周期(比如每隔10分钟)获取终端设备所处环境的第一噪音片段(比如麦克风当前采集的6s的语音信号)和第二噪音片段(比如麦克风当前采集的6s的语音信号的后续10s的语音信号),并获取第一噪音片段的SNR和SPL;判断第一噪音片段的SNR是否大于20dB且SPL是否大于40dB;若第一噪音片段的SNR大于第一阈值(比如20dB)且SPL大于第二阈值(比如40dB),则提取第一噪音片段的第一临时特征向量;利用第一临时特征向量对第二噪音片段进行降噪处理,以得到第二降噪噪音片段;基于第二降噪噪音片段和第二噪音片段进行损伤评估,得到第一损伤评分,其中,第一损伤评分用于表征终端设备的麦克风采集信号的损伤程度,若第一损伤评分越大,损伤程度越高;若第一损伤评分不大于第三阈值,则表示麦克风采集的语音信号无损伤,通过终端设备向用户发出第一提示信息,该第一提示信息用于提示用户是否使得终端设备进入PNR模式;该提示信息可以是语音信息、可以是通过终端设备的显示屏显示的文本信息,当然还可以是其他形式的信息,在此不做限定;检测到用户针对提示信息的指令,该指令可以为语音指令、触摸指令、手势指令等;若该指令用于指示用户不同意进入PNR模式,则维持采用传统降噪算法进行降噪;若该指令用于指示用户同意进入PNR模式,在等待用户讲完本句话后,进入PNR模式,并从第一噪音片段之后产生的噪声信号中确定出第一带噪语音信号,也就是从第二噪音片段或者在第二噪音片段之后采集的噪声信号中获取第一带噪语音信号,并存储第一临时特征向量,作为注册语音信号的特征向量;若该第一损伤评分大于第三阈值,则间隔预设周期重新获取第一噪音片段和第二噪音片段,并重复执行上述步骤。Specifically, when the user uses the terminal device for the first time, the default microphone of the terminal device collects a voice signal, and processes the collected voice signal through a traditional noise reduction algorithm to obtain the user's noise-reduced voice signal; (For example, every 10 minutes) Acquire the first noise segment (such as the 6s voice signal currently collected by the microphone) and the second noise segment (such as the subsequent 10s voice signal of the 6s voice signal currently collected by the microphone) in the environment where the terminal device is located ), and obtain the SNR and SPL of the first noise segment; judge whether the SNR of the first noise segment is greater than 20dB and whether the SPL is greater than 40dB; if the SNR of the first noise segment is greater than the first threshold (such as 20dB) and the SPL is greater than the second threshold (such as 40dB), then extract the first temporary feature vector of the first noise segment; Utilize the first temporary feature vector to carry out noise reduction processing to the second noise segment to obtain the second noise reduction noise segment; based on the second noise reduction noise segment Perform damage assessment with the second noise segment to obtain a first damage score, wherein the first damage score is used to characterize the damage degree of the signal collected by the microphone of the terminal device, if the first damage score is greater, the damage degree is higher; if the first If the damage score is not greater than the third threshold, it means that the voice signal collected by the microphone is not damaged, and the terminal device sends a first prompt message to the user, and the first prompt message is used to prompt the user whether to make the terminal device enter the PNR mode; the prompt message can be It is voice information, it can be text information displayed on the display screen of the terminal device, and of course it can also be other forms of information, which are not limited here; the user's instruction for the prompt information is detected, and the instruction can be voice instruction, touch instruction , gesture commands, etc.; if the command is used to indicate that the user does not agree to enter the PNR mode, the traditional noise reduction algorithm will be used for noise reduction; if the command is used to indicate the user to agree to enter the PNR mode, after waiting for the user to finish speaking this sentence , enter the PNR mode, and determine the first noisy speech signal from the noise signal generated after the first noise segment, that is, obtain the first noisy speech signal from the second noise segment or the noise signal collected after the second noise segment Speech signal, and store the first temporary feature vector as the feature vector of the registered speech signal; if the first impairment score is greater than the third threshold, reacquire the first noise segment and the second noise segment at preset intervals, and repeat the execution above steps.
其中,从第一噪音片段之后产生的噪声信号中确定出第一带噪语音信号,可以理解成第一带噪语音信号为第一噪音片段之后产生的噪声信号中的部分或者全部。Wherein, determining the first noisy speech signal from the noise signal generated after the first noise segment may be understood as the first noisy speech signal is part or all of the noise signal generated after the first noise segment.
可选地,损伤评分可以为信号失真比(signal-to-distortion ratio,SDR)值或者(perceptual evaluation of speech quality,PESQ)值。Optionally, the impairment score may be a signal-to-distortion ratio (SDR) value or a (perceptual evaluation of speech quality, PESQ) value.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
若第一噪音片段的SNR不大于第一阈值或者第一噪音片段的SPL不大于第二阈值,且终端设备已存储参考临时声纹特征向量,获取第三噪音片段;根据参考临时声纹特征向量对第三噪音片段进行降噪处理,得到第三降噪噪音片段;根据第三噪音片段和第三降噪噪音片段进行损伤评估,以得到第三损伤评分;若第三损伤评分大于第六阈值且第三噪音片段的SNR 小于第七阈值,或者第三损伤评分大于第八阈值且第三噪音片段的SNR不小于第七阈值,则通过终端设备发出第三提示信息,第三提示信息用于提示当前使用者终端设备能够进入PNR模式;在检测到当前使用者的同意进入PNR模式的操作指令后,使得终端设备进入PNR模式对第四带噪语音信号进行降噪处理;在检测到当前使用者的不同意进入PNR模式的操作指令后,采用非PNR模式对第四带噪语音信号进行降噪处理;其中,第四带噪语音信号是从在第三噪音片段之后产生的噪声信号中确定的。If the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, and the terminal device has stored the reference temporary voiceprint feature vector, obtain the third noise segment; according to the reference temporary voiceprint feature vector Perform denoising processing on the third noise segment to obtain a third denoising noise segment; perform damage assessment based on the third noise segment and the third denoising noise segment to obtain a third damage score; if the third damage score is greater than the sixth threshold And the SNR of the third noise segment is less than the seventh threshold, or the third damage score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then the third prompt information is sent through the terminal device, and the third prompt information is used to Prompt the current user that the terminal device can enter the PNR mode; after detecting the current user's operation instruction agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the fourth noisy voice signal; After the operator does not agree to enter the operation instruction of PNR mode, adopt the non-PNR mode to carry out noise reduction processing on the fourth noisy speech signal; wherein, the fourth noisy speech signal is determined from the noise signal generated after the third noise segment of.
具体地,若第一噪音片段的SNR不大于第一阈值或者第一噪音片段的SPL不大于第二阈值,即在本次通话过程中无法提取目标语音特征的场景,此时,如果终端设备已经存储好的历史使用者的声纹信息(比如声纹特征向量),终端设备监测到输入信号中有连续语音(即vad=1)超过2秒钟,终端设备采集该语音信号得到第三噪音片段,并基于已存储的历史使用者的声纹特征向量对第三噪音片段进行降噪处理的,以得到第三降噪噪音片段;基于第三噪音片段和第三降噪噪音片段进行损伤评估,以得到第三损伤得分;在第三损伤得分大于第六阈值(比如8dB)且第三噪音片段的SNR小于第七阈值(比如10dB)时,或者在第三损伤得分大于第八阈值(比如12dB)且第三噪音片段的SNR不小于第七阈值时,表示当前使用者的声纹特征与已存储的声音特征匹配,通过终端设备向用户发出第三提示信息,该第三提示信息用于提示当前使用者是否使得终端设备进入PNR模式;该第三提示信息可以是语音信息、可以是通过终端设备的显示屏显示的文本信息,当然还可以是其他形式的信息,在此不做限定;检测到用户针对提示信息的指令,该指令可以为语音指令、触摸指令、手势指令等。若检测到当前使用者同意开启终端设备的PNR功能的操作指令,则终端设备进入PNR模式,对第四带噪语音信号进行降噪处理,该第四带噪语音信号是在第三噪音片段之后获取的;若检测到当前使用者不同意开启终端设备的PNR功能的操作指令,则维持采用传统降噪算法对第四带噪语音信号进行降噪处理。Specifically, if the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, that is, the scene where the target voice feature cannot be extracted during this call, at this time, if the terminal device has Stored voiceprint information of historical users (such as voiceprint feature vectors), the terminal device detects that there is continuous voice (ie vad=1) in the input signal for more than 2 seconds, and the terminal device collects the voice signal to obtain the third noise segment , and perform denoising processing on the third noise segment based on the stored historical user's voiceprint feature vector to obtain a third denoising noise segment; perform damage assessment based on the third noise segment and the third denoising noise segment, To obtain the third impairment score; when the third impairment score is greater than the sixth threshold (such as 8dB) and the SNR of the third noise segment is less than the seventh threshold (such as 10dB), or when the third impairment score is greater than the eighth threshold (such as 12dB ) and the SNR of the third noise segment is not less than the seventh threshold value, it means that the voiceprint feature of the current user matches the stored voice feature, and a third prompt message is sent to the user through the terminal device, and the third prompt message is used to prompt Whether the current user makes the terminal device enter the PNR mode; the third prompt information can be voice information, can be text information displayed by the display screen of the terminal device, and can certainly be other forms of information, which are not limited here; detection To the user's instruction for the prompt information, the instruction may be a voice instruction, a touch instruction, a gesture instruction, and the like. If it is detected that the current user agrees to the operation instruction of turning on the PNR function of the terminal device, the terminal device enters the PNR mode, and performs noise reduction processing on the fourth noisy voice signal, which is after the third noise segment Acquired; if it is detected that the current user does not agree with the operation instruction to enable the PNR function of the terminal device, then maintain the traditional noise reduction algorithm to perform noise reduction processing on the fourth noisy speech signal.
在一个可行的实施例中,本申请的方法还包括:In a feasible embodiment, the method of the present application also includes:
在检测到终端设备再次被使用时,获取第二带噪语音信号,并采用传统降噪算法,也就是非PNR模式对第二带噪语音信号进行降噪处理,得到当前使用者的降噪语音信号;同时判断第二带噪语音信号的SNR是否低于第四阈值;在第二带噪语音信号的SNR低于第四阈值时,根据第一临时特征向量对第二带噪语音信号进行语音降噪处理,得到当前使用者的降噪语音信号;基于当前使用者的降噪语音信号和第二带噪语音信号进行损伤评估,以得到第二损伤评分;当第二损伤评分不大于第五阈值时,通过终端设备向当前使用者发出第二提示信息,第二提示信息用于提示当前使用者终端设备能够进入PNR模式;在检测到当前使用者的同意终端设备进入PNR模式的操作指令后,进入PNR模式对第三带噪语音信号进行降噪处理,该第三带噪语音信号是在第二带噪语音信号之后获取的;在检测到当前使用者的不同意进入PNR模式的操作指令后,继续采用传统降噪算法对第三带噪语音信号进行降噪处理。When it is detected that the terminal device is used again, the second noisy voice signal is obtained, and the traditional noise reduction algorithm is used, that is, the non-PNR mode is used to perform noise reduction processing on the second noisy voice signal to obtain the current user's noise-reduced voice signal; simultaneously judge whether the SNR of the second noisy speech signal is lower than the fourth threshold; when the SNR of the second noisy speech signal is lower than the fourth threshold, perform speech on the second noisy speech signal according to the first temporary feature vector Noise reduction processing to obtain the noise-reduced voice signal of the current user; performing damage assessment based on the noise-reduced voice signal of the current user and the second noisy voice signal to obtain a second damage score; when the second damage score is not greater than the fifth When the threshold is reached, a second prompt message is sent to the current user through the terminal device, and the second prompt message is used to prompt the current user that the terminal device can enter the PNR mode; after detecting the current user's operation instruction that the terminal device enters the PNR mode , enter the PNR mode to perform noise reduction processing on the third noisy voice signal, which is obtained after the second noisy voice signal; after detecting the current user's disagreement to enter the PNR mode operation instruction After that, continue to use the traditional noise reduction algorithm to perform noise reduction processing on the third noisy speech signal.
具体地,当检测到终端设备再次被使用进行通话时,终端设备的默认麦克风采集第二带噪语音信号,并采用传统降噪算法对第二带噪语音信号进行处理,输出当前使用者的降噪语音信号。同时判断当前环境是否嘈杂,具体判断第二带噪语音信号的SNR是否小于第四阈值;当第二带噪语音信号的SNR小于第四阈值(例如SNR小于10dB),表示当前环境嘈杂;按照本申请的降噪算法,利用前一次存储的语音特征(即上述第一临时特征向量)对第二带噪语音信号进行降噪处理,得到当前使用者的降噪语音信号;基于当前使用者的降噪语音信号和第二带噪语音信号进行损伤评估,以得到第二损伤评分,具体过程可参见上述方法,在此不再叙述;如果第二评分低于第五阈值,表示当前使用者与存储的第一临时特征向量表征的 语音特征相匹配;通过终端设备向当前使用者发出第二提示信息,该第二提示信息用于提示当前使用者可以开启终端设备的PNR通话功能。若检测到当前使用者同意开启终端设备的PNR功能的操作指令,则终端设备进入PNR模式,对第三带噪语音信号进行降噪处理,该第三带噪语音信号是在第二带噪语音信号之后获取的;若检测到当前使用者不同意开启终端设备的PNR功能的操作指令,则维持采用传统降噪算法对第三带噪语音信号进行降噪处理。Specifically, when it is detected that the terminal device is used for a call again, the default microphone of the terminal device collects the second noisy voice signal, and uses a traditional noise reduction algorithm to process the second noisy voice signal, and outputs the current user's reduced voice signal. noisy speech signal. Determine whether the current environment is noisy at the same time, specifically judge whether the SNR of the second noisy speech signal is less than the fourth threshold; when the SNR of the second noisy speech signal is less than the fourth threshold (for example, the SNR is less than 10dB), it means that the current environment is noisy; according to this The noise reduction algorithm of the application uses the previously stored speech features (i.e. the first temporary feature vector) to perform noise reduction processing on the second noisy speech signal to obtain the current user's noise reduction speech signal; based on the current user's noise reduction The noisy speech signal and the second noisy speech signal are subjected to damage assessment to obtain the second damage score. The specific process can be referred to the above method, which will not be described here; if the second score is lower than the fifth threshold, it means that the current user and the storage match the voice features represented by the first temporary feature vector; send a second prompt message to the current user through the terminal device, and the second prompt message is used to prompt the current user to enable the PNR call function of the terminal device. If it is detected that the current user agrees to an operation command to enable the PNR function of the terminal device, the terminal device enters the PNR mode, and performs noise reduction processing on the third noisy voice signal, which is generated in the second noisy voice signal. After the signal is acquired; if it is detected that the current user does not agree to the operation instruction of turning on the PNR function of the terminal device, the traditional noise reduction algorithm is maintained to perform noise reduction processing on the third noisy voice signal.
在一个可行的实施例中,可以通过以下方式确定降噪是采用传统降噪算法,还是采用本申请公开的降噪方法进行降噪,本申请的方法还包括:In a feasible embodiment, the following method can be used to determine whether to use the traditional noise reduction algorithm or the noise reduction method disclosed in the present application for noise reduction. The method of the present application also includes:
获取终端设备所处环境的第一噪音片段和第二噪音片段;第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取终端设备的辅助设备的麦克风阵列针对终端设备所处的环境采集的信号,利用采集的信号计算得到第一噪音片段的DOA和SPL;若第一噪音片段的DOA大于第九阈值且小于第十阈值,且第一噪音片段的SPL大于第十一阈值,则提取第一噪音片段的第二临时特征向量;基于第二临时语音特征向量对第二噪音片段进行降噪处理,以得到第四降噪噪音片段;基于第四降噪噪音片段和第二噪音片段进行损伤评估,以得到第四损伤评分;若第四损伤评分不大于第十二阈值,进入PNR模式。Acquiring the first noise segment and the second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are continuous noise segments in time; acquiring the microphone array of the auxiliary device of the terminal device The signal collected by the environment is used to calculate the DOA and SPL of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than the tenth threshold, and the SPL of the first noise segment is greater than the eleventh threshold, Then extract the second temporary feature vector of the first noise segment; based on the second temporary speech feature vector, the second noise segment is denoised to obtain the fourth noise reduction noise segment; based on the fourth noise reduction noise segment and the second noise Perform damage assessment on the segment to obtain a fourth damage score; if the fourth damage score is not greater than the twelfth threshold, enter the PNR mode.
获取第一带噪语音信号包括:Obtaining the first noisy speech signal includes:
从在第一噪音片段之后产生的噪声信号中确定第一带噪语音信号;注册语音信号的特征向量包括第二临时特征向量。A first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a second temporary feature vector.
其中,利用采集的信号计算得到第一噪音片段的DOA和SPL,具体可以包括:Wherein, the DOA and SPL of the first noise segment are calculated by using the collected signal, which may specifically include:
对麦克风阵列采集的信号进行时频变换,得到第十九频域信号,基于该第十九频域信号,计算第一噪音片段的DOA和SPL。Time-frequency transform is performed on the signal collected by the microphone array to obtain a nineteenth frequency domain signal, and based on the nineteenth frequency domain signal, DOA and SPL of the first noise segment are calculated.
进一步地,若第四损伤评分不大于第十二阈值,本申请的方法还包括:Further, if the fourth damage score is not greater than the twelfth threshold, the method of the present application also includes:
通过终端设备发出第四提示信息,该第四提示信息用于提示是否使得终端设备进入PNR模式;在检测到目标用户的同意进入PNR模式的操作指令后,才进入PNR模式。The fourth prompt message is sent by the terminal device, and the fourth prompt message is used to prompt whether the terminal device enters the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
在一个具体的场景中,终端设备与电脑(辅助设备的一种情况)连接,可以采用有线方式,也可以采用无线方式,电脑的麦克风阵列采集终端设备所处环境的信号;然后终端设备获取该麦克风阵列采集的信号,再按照上述方式进行处理,在此不再叙述。In a specific scenario, the terminal device is connected to a computer (a case of an auxiliary device) in a wired or wireless manner, and the microphone array of the computer collects the signal of the environment where the terminal device is located; then the terminal device obtains the The signals collected by the microphone array are then processed in the manner described above, which will not be described here.
在此需要说明的是,在提取到第一临时特征向量或者第二临时特征向量后,终端设备存储第一临时特征向量或者第二临时特征向量,后续需要使用时直接获取第一临时特征向量或者第二临时特征向量,避免了后续在噪声较大的场景下无法获取当前使用者的语音特征,从而无法进行损伤评估。It should be noted here that after extracting the first temporary feature vector or the second temporary feature vector, the terminal device stores the first temporary feature vector or the second temporary feature vector, and directly obtains the first temporary feature vector or the second temporary feature vector when it needs to be used later. The second temporary eigenvector avoids the failure to obtain the speech characteristics of the current user in a scene with high noise, thereby making it impossible to perform damage assessment.
在本申请中公开了多种降噪方式,对于不同的场景,可以基于场景信息判断是否进入PNR模式,并自动识别目标用户或者对象,并选择对应的降噪方式:Various noise reduction methods are disclosed in this application. For different scenarios, it is possible to judge whether to enter the PNR mode based on the scene information, and automatically identify the target user or object, and select the corresponding noise reduction method:
当检测到终端设备处于手持通话状态时,不进入PNR模式;When it is detected that the terminal device is in the handheld talking state, it will not enter the PNR mode;
当检测到终端设备处于免提通话状态时,进入PNR模式,并且以注册过声纹特征的机主为目标用户;获取当前用户在通话时t秒语音信号进行声纹识别,将识别结果与注册过声纹特征进行比对,若确定当前用户非机主时,将获取的当前用户在通话时的t秒语音信号作为该用户的注册语音信号,并将当前用户作为目标用户,采用方式一所述的方式进行降噪;其中,上述t可以为3或者其他值。When it is detected that the terminal device is in the hands-free call state, it enters the PNR mode, and the owner who has registered the voiceprint feature is the target user; obtain the voice signal of the current user during the call for t seconds for voiceprint recognition, and compare the recognition result with the registration Compare the features of the voiceprint. If it is determined that the current user is not the owner of the phone, use the acquired voice signal of the current user for t seconds during the call as the registered voice signal of the user, and use the current user as the target user. Noise reduction is performed in the manner described above; wherein, the above t can be 3 or other values.
当检测到终端设备处于视频通话状态时,进入PNR模式,并且终端设备处于视频通话时,对摄像头采集的图像进行人脸识别,确定图像中当前用户的身份;若图像中包含多个人,则以距离摄像头最近的人为当前用户;对于图像中人与摄像头之间的距离的确定,可以通过终 端设备上深度传感器等传感器实现;在确定当前用户后,终端设备检测是否已存储当前用户的注册语音或者当前用户的语音特征;若已经存储了当前用户的注册语音或者当前用户的语音特征,将当前用户确定为目标用户,并将当前用户的注册语音或者语音特征作为当前用户的语音相关数据;若终端设备未存储当前用户的注册语音或者语音特征,则终端设备通过唇形检测方法检测当前用户是否在讲话,在检测到当前用户在讲话时,从麦克风采集的语音信号中截取出当前用户的语音信号,作为当前用户的注册语音,该当前用户的注册语音可以由多段信号串接在一起得到的,总时长不少于6s;通过终端设备的麦克风获取第一带噪语音信号,并采用方式一或方式四所述的方式进行降噪处理。When it is detected that the terminal device is in the video call state, enter the PNR mode, and when the terminal device is in the video call, face recognition is performed on the image collected by the camera to determine the identity of the current user in the image; if the image contains multiple people, then use The person closest to the camera is the current user; the distance between the person in the image and the camera can be determined through sensors such as depth sensors on the terminal device; after the current user is determined, the terminal device detects whether the registered voice of the current user has been stored or The voice characteristics of the current user; if the registered voice of the current user or the voice characteristics of the current user have been stored, the current user is determined as the target user, and the registered voice or voice characteristics of the current user are used as the voice-related data of the current user; if the terminal If the device does not store the registered voice or voice features of the current user, the terminal device detects whether the current user is speaking through the lip shape detection method, and when it detects that the current user is speaking, intercepts the voice signal of the current user from the voice signal collected by the microphone , as the registered voice of the current user, the registered voice of the current user can be obtained by concatenating multiple signals, and the total duration is not less than 6s; obtain the first noisy voice signal through the microphone of the terminal device, and use method 1 or Noise reduction processing is performed in the manner described in the fourth manner.
当检测到终端设备连接到耳机,且处于终端设备处于通话状态时,进入PNR模式;并且终端设备检测耳机是否具有骨声纹传感器,若具有,则通过耳机的骨声纹传感器采集目标用户的VPU信号,并采用方式二、方式三和方式四所述的方式进行降噪处理;若耳机不具有骨声纹传感器,则默认将在耳机中已注册过的语音信号的用户作为目标用户,将该用户的注册语音和耳机采集的第一带噪语音信号发送至终端设备,终端设备采用方式一和方式四所述的方式进行降噪;若耳机中没有注册任何人的语音信号,则通过耳机的麦克风获取当前佩戴耳机的用户的通话语音,将该语音中的部分片段作为该用户的注册语音,并将该注册语音和耳机采集的第一带噪语音信号发送至终端设备,终端设备采用方式一和方式四所述的方式进行降噪。When it is detected that the terminal device is connected to the headset and the terminal device is in a call state, it enters the PNR mode; and the terminal device detects whether the headset has a bone voiceprint sensor, and if so, collects the VPU of the target user through the bone voiceprint sensor of the headset signal, and use the methods described in methods 2, 3 and 4 to perform noise reduction processing; if the earphone does not have a bone voiceprint sensor, the user with the voice signal that has been registered in the earphone will be used as the target user by default, and the The user's registered voice and the first noisy voice signal collected by the headset are sent to the terminal device, and the terminal device uses the methods described in methods 1 and 4 to perform noise reduction; The microphone acquires the call voice of the user who is currently wearing the earphone, uses a part of the voice as the user's registered voice, and sends the registered voice and the first noisy voice signal collected by the earphone to the terminal device. The terminal device adopts method one Perform noise reduction in the same way as described in Method 4.
当检测到终端设备连接到智能设备(比如智能大屏设备或者智能手表或者车载蓝牙设备),且处于视频通话状态时,进入PNR模式,判断终端设备中是否已存当前用户的注册语音信号,若终端设备中已存储当前用户的注册语音信号,则通过智能设备采集第一带噪语音信号,并将该第一带噪语音信号发送至终端设备,终端设备采用方式一和方式四所述的方式进行降噪。When it is detected that the terminal device is connected to a smart device (such as a smart large-screen device or a smart watch or a car Bluetooth device) and is in a video call state, enter the PNR mode to determine whether the current user's registered voice signal has been stored in the terminal device, if If the registered voice signal of the current user has been stored in the terminal device, the first noisy voice signal is collected through the smart device, and the first noisy voice signal is sent to the terminal device, and the terminal device adopts the methods described in method 1 and method 4 Perform noise reduction.
在一个可行的实施例中,由于PNR主要用于噪音比较强的环境下,而用户不一定一直处于噪音比较强的环境下,因此可以提供在某特定功能使用过程中/某应用程序执行过程中供用户设置某特定功能或某应用程序的PNR功能的界面。应用程序可以是需要特定语音增强功能的各种应用程序,如通话、语音助手、畅联、录音机等;特定功能可以是各种需要录制本端语音的功能,如接听电话、视频录制、使用语音助手等。如图16中的左图所示,终端设备的显示界面上显示有3个功能标签和该3个功能标签对应的3个PNR控制按键;用户通过该3个PNR控制按键可以分别控制3个功能的PNR功能的关闭和开启;如图16的左图所示,通话和语音助手对应的PNR功能开启,视频录制的PNR功能关闭;如图16中的右图所示,终端设备的显示界面上显示有5个应用标签和该5个应用标签对应的5个PNR控制按键,用户通过该5个PNR控制按键可以分别控制5个应用的PNR功能关闭和开启;如图16中的右图所示,唱吧、录音机和畅联的PNR功能开启,通话和微信的PNR功能关闭。需要指出的是,比如开启通话的PNR功能,在用户使用终端设备进行通话时,终端设备直接进入PNR模式。通过采用上述方式,对于终端设备的不同的语音功能,用户可以灵活设置是否开启PNR功能。In a feasible embodiment, since the PNR is mainly used in an environment with relatively strong noise, and the user may not always be in an environment with relatively strong noise, it can provide An interface for users to set a specific function or the PNR function of an application. Apps can be various applications that require specific voice enhancement functions, such as calls, voice assistants, Changlian, recorders, etc.; specific functions can be various functions that require local voice recording, such as answering calls, video recording, using voice Assistant etc. As shown in the left figure in Figure 16, the display interface of the terminal device displays 3 function labels and 3 PNR control buttons corresponding to the 3 function labels; the user can control 3 functions respectively through the 3 PNR control buttons Turn off and turn on the PNR function; as shown in the left figure of Figure 16, the PNR function corresponding to the call and voice assistant is turned on, and the PNR function of video recording is turned off; as shown in the right figure of Figure 16, the display interface of the terminal device There are 5 application labels and 5 PNR control buttons corresponding to the 5 application labels. The user can control the PNR function of the 5 applications to be turned off and on through the 5 PNR control buttons; as shown in the right figure in Figure 16 , the PNR functions of Changba, Recorder and Changlian are turned on, and the PNR functions of calls and WeChat are turned off. It should be pointed out that, for example, when the PNR function of a call is enabled, when the user uses the terminal device to make a call, the terminal device directly enters the PNR mode. By adopting the above method, for different voice functions of the terminal device, the user can flexibly set whether to enable the PNR function.
如图17所示为以“通话”应用程序/“接听电话”功能为例的终端设备的显示界面,在该界面提供可开启PNR功能的开关,如图17中的“开启PNR”功能按键;图17中的左图为来电时的终端设备的显示界面示意图,该显示界面显示有来电人的信息、“开启PNR”功能按键、“挂断”功能按键和“接听”功能按键;图17中的右图为接听电话时的终端设备的显示界面示意图;该显示界面显示有来电人的信息、“开启PNR”功能按键、“挂断”功能按键。As shown in Figure 17, the display interface of the terminal device taking the "Call" application program/"Answer a call" function as an example, provides a switch for enabling the PNR function on this interface, such as the "Enable PNR" function button in Figure 17; The left picture in Figure 17 is a schematic diagram of the display interface of the terminal device when a call comes in, the display interface displays the information of the caller, the "open PNR" function button, the "hang up" function button and the "answer" function button; in Figure 17 The picture on the right is a schematic diagram of the display interface of the terminal device when answering a call; the display interface displays the information of the caller, the "Enable PNR" function button, and the "Hang up" function button.
在此需要指出的是,本申请中的终端设备的某些特定功能本质上是终端设备所安装的应用程序的功能。比如终端设备的通话功能是通过“电话”这个应用程序实现的。It should be pointed out here that some specific functions of the terminal device in this application are essentially functions of application programs installed on the terminal device. For example, the call function of the terminal device is realized through the application program "Phone".
可选地,检测到目标用户针对通话界面(图17所示的界面)上的“开启PNR”功能按键后,终端设备的显示界面跳转显示如图15中的左图所示显示的界面,目标用户可通过控制图15中的灰色旋钮调节干扰噪声抑制系数的大小,从而调整降噪的力度。Optionally, after it is detected that the target user has targeted the "Enable PNR" function button on the call interface (the interface shown in Figure 17), the display interface of the terminal device jumps to display the interface shown in the left figure in Figure 15, Target users can adjust the intensity of noise reduction by controlling the gray knob in Figure 15 to adjust the size of the interference noise suppression coefficient.
通过图16所显示的UI界面,目标用户可以根据自己的需求灵活开启或者关闭特定功能或者应用程序的PNR功能。Through the UI interface shown in FIG. 16 , target users can flexibly enable or disable specific functions or PNR functions of applications according to their needs.
在一个可行的实施例中,为了减少用户的操作,本申请还包括:判断当前环境声音的分贝值是否超过预设分贝值(比如50dB),或者检测当前环境声音中是否包含非目标用户的声音;若判断当前环境声音的分贝值超过预设分贝值,或者在当前环境声音中检测到非目标用户的声音,则开启PNR功能。当目标用户使用终端设备需要进行降噪时,则直接进入PNR模式;换言之,对于终端设备的特定功能或者应用程序,均可按照上述方式开启对应的PNR功能。In a feasible embodiment, in order to reduce user operations, the present application further includes: judging whether the decibel value of the current ambient sound exceeds a preset decibel value (such as 50dB), or detecting whether the current ambient sound contains the voice of a non-target user ; If it is determined that the decibel value of the current ambient sound exceeds the preset decibel value, or a non-target user's voice is detected in the current ambient sound, the PNR function is enabled. When the target user uses the terminal device and needs to perform noise reduction, it directly enters the PNR mode; in other words, for a specific function or application program of the terminal device, the corresponding PNR function can be enabled in the above manner.
进一步地,当目标用户点击如图18中的a所示的PNR的,进入PNR设置界面,目标用户可以通过图18中的b所示的“智能开启”开关功能键开启PNR的“智能开启”功能,PNR智能开启功能开启后,对于终端设备的特定功能或者应用程序,可采取上述方式开启PNR功能。当关闭PNR的“智能开启”功能,终端设备的显示界面显示如图18中的c所示的内容;目标用户可以通过特定功能或者应用程序对应的PNR功能键根据需求开启或者关闭特定功能或者应用程序的PNR功能。Further, when the target user clicks on the PNR as shown in a in Figure 18, and enters the PNR setting interface, the target user can turn on the "Smart Open" of the PNR through the "Smart Open" switch function key shown in Figure 18 b Function, after the PNR smart activation function is enabled, the PNR function can be enabled in the above-mentioned way for specific functions or applications of the terminal device. When the "smart open" function of the PNR is turned off, the display interface of the terminal device displays the content shown in c in Figure 18; the target user can turn on or off the specific function or application according to the demand through the PNR function key corresponding to the specific function or application Program the PNR function.
按照上述开启智能PNR功能,使得终端设备更加的智能,减少了用户的操作,使得用户体验更佳。Enabling the smart PNR function as described above makes the terminal device more intelligent, reduces user operations, and makes user experience better.
在一个可行的实施例中,在通话场景中,终端设备(也为本端设备)在开启PNR功能后,开启PNR功能后的通话效果只有对端用户知道,目标用户很难判断是否应开启PNR功能或者设置的降噪力度能够使得对端用户听得清楚,终端设备的PNR功能是否开启或者降噪力度有对端设备来设置。In a feasible embodiment, in a call scenario, after the terminal device (also the local device) enables the PNR function, only the opposite end user knows the effect of the call after the PNR function is enabled, and it is difficult for the target user to judge whether to enable the PNR function. The function or the noise reduction strength set can make the peer user hear clearly. Whether the PNR function of the terminal device is enabled or the noise reduction strength is set by the peer device.
对端设备(也就是另一终端设备)在检测到对端设备的用户的开启终端设备的PNR功能的操作后,对端设备向终端设备发送语音增强请求,该增强语音请求用于请求开启终端设备的通话功能的PNR功能;终端设备接收到增强语音请求后,响应于语音增强请求,在终端设备的显示界面上显示提醒标签,也即是第三提示信息,该提醒标签用于提醒目标用户对端设备请求开启本端设备的通话功能的PNR功能,是否使得终端设备开启通话功能的PNR功能;该提醒标签上还包括确认功能按键;当终端设备检测到目标用户针对该确定功能按键的操作后,终端设备开启通话功能的PNR功能,并进入PNR模式,并向对端设备发送响应消息,该响应消息用于响应上述增强语音请求,该响应消息用于告知对端设备已开启终端设备的PNR功能;对端设备接收到该响应消息后,在对端设备的显示界面上显示提示标签,该提示标签用于提示使用对端设备的用户已增强目标用户的语音。After the peer device (that is, another terminal device) detects that the user of the peer device activates the PNR function of the terminal device, the peer device sends a voice enhancement request to the terminal device, and the voice enhancement request is used to request to turn on the terminal device. The PNR function of the call function of the device; after the terminal device receives the enhanced voice request, it responds to the voice enhanced request and displays a reminder label on the display interface of the terminal device, which is the third prompt message. The reminder label is used to remind the target user The peer device requests to enable the PNR function of the call function of the local device, whether to enable the terminal device to enable the PNR function of the call function; the reminder label also includes a confirmation function button; when the terminal device detects the operation of the target user on the confirmation function button Afterwards, the terminal device turns on the PNR function of the call function, enters the PNR mode, and sends a response message to the peer device. The response message is used to respond to the above-mentioned enhanced voice request, and the response message is used to inform the peer device that the PNR function; after receiving the response message, the peer device displays a prompt label on the display interface of the peer device, and the prompt label is used to remind the user of the peer device that the voice of the target user has been enhanced.
可选地,在终端设备(也为本端设备)开启通话的PNR功能后,对端设备向终端设备发送干扰噪声抑制系数,以调节终端设备的降噪力度;或者对端设备向终端设备发送的语音增强请求中携带干扰噪声抑制系数。可选地,在对端设备向终端设备发送干扰噪声抑制系数时,对端设备还向终端设备发送目标用户的语音增强系数。Optionally, after the terminal device (also the local device) turns on the PNR function of the call, the peer device sends the interference noise suppression coefficient to the terminal device to adjust the noise reduction strength of the terminal device; or the peer device sends a signal to the terminal device The voice enhancement request carries the interference noise suppression coefficient. Optionally, when the peer device sends the interference noise suppression coefficient to the terminal device, the peer device also sends the speech enhancement coefficient of the target user to the terminal device.
以用户A与用户B进行通话为例进行说明,如图19所示,用户A的终端设备(对端设备)与用户B的终端设备(上述终端设备,也为本端设备)通过基站进行语音数据的传输,实现用户A与用户B之间的通话。用户A所处的环境很嘈杂,用户B听不清楚用户A所讲的内容;用户B点击用户B的终端设备的显示界面上显示的“增强对方语音”功能按键,以增 强用户A的语音;用户B的终端设备检测到用户B针对“增强对方语音”功能按键,如图20中的a所示,向用户A的终端设备发送增强语音请求,该增强语音请求用于请求用户A的终端设备开启通话功能的PNR功能;用户A的终端设备接收到语音增强请求后,用户A的终端设备的显示界面上显示提醒标签,如图20中的b所示,该提醒标签上显示有“对方请求增强您的语音,是否接受”,以提醒用户A,用户B请求增强其的语音;若用户B同意增强其语音,则用户B点击其终端设备的显示界面显示的“接受”功能按键;用户B的终端设备检测到用户B针对“接受”功能按键的操作后,用户B的终端设备开启通话功能的PNR功能,并通过基站向用户A的终端设备发送响应消息,该响应消息用于告知用户A已开启用户B的终端设备的通话功能的PNR功能;用户B的终端设备接收到基站反馈的上述响应消息后,在其显示界面上显示提示标签“对方语音增强中”,以告知用户B已增强用户A的语音,如图20中的c所示。Take the conversation between user A and user B as an example. As shown in Figure 19, the terminal device of user A (peer device) and the terminal device of user B (the above-mentioned terminal device is also the local device) conduct voice calls through the base station. The transmission of data realizes the call between user A and user B. The environment where user A is located is very noisy, and user B cannot hear what user A is saying; user B clicks the "enhance the other party's voice" function button displayed on the display interface of user B's terminal device to enhance user A's voice; User B's terminal device detects that user B presses the "enhance the other party's voice" function button, as shown in a in Figure 20, and sends an enhanced voice request to user A's terminal device, and the enhanced voice request is used to request user A's terminal device Turn on the PNR function of the call function; after user A's terminal device receives the voice enhancement request, a reminder label is displayed on the display interface of user A's terminal device, as shown in b in Figure 20, and the reminder label displays "the other party's request Enhance your voice, do you accept" to remind user A that user B requests to enhance his voice; if user B agrees to enhance his voice, user B clicks the "accept" function button displayed on the display interface of his terminal device; user B After detecting user B's operation on the "Accept" function button, user B's terminal device turns on the PNR function of the call function, and sends a response message to user A's terminal device through the base station. The response message is used to inform user A The PNR function of the call function of user B's terminal device has been turned on; after user B's terminal device receives the above-mentioned response message fed back by the base station, it will display a prompt label "the other party's voice is being enhanced" on its display interface to inform user B that it has been enhanced User A's voice, as shown in c in Figure 20.
应理解,终端设备(本端设备)也可以按照上述方式控制对端设备开启通话功能的PNR功能。It should be understood that the terminal device (local device) may also control the peer device to enable the PNR function of the call function in the above manner.
在此需要指出的是,终端设备和对端设备之间传输的数据(包括语音增强请求、响应消息等)是通过基于终端设备的电话号码与对端设备的电话号码建立起来的通讯链路实现传输的。It should be pointed out here that the data transmitted between the terminal device and the peer device (including voice enhancement requests, response messages, etc.) is realized through the communication link established based on the phone number of the terminal device and the phone number of the peer device. Transmission.
在通话过程中,对端设备的用户可以根据其听到的目标用户的语音质量的好坏,来决定是否控制本端设备开启通话功能的PNR功能;当然,目标用户可以根据其听到的对端设备的用户的语音质量决定是否控制终端设备开启通话功能的PNR功能,从而提高双方通话的效率。During a call, the user of the peer device can decide whether to control the local device to enable the PNR function of the call function according to the voice quality of the target user it hears; of course, the target user can The voice quality of the user of the end device determines whether to control the PNR function of the end device to enable the call function, thereby improving the efficiency of the call between the two parties.
在一个可行的实施例中,在视频录制场景中,比如在父母给孩子录制视频时,小孩离终端设备(比如拍摄终端)较远,父母离终端设备较近,导致录制视频的效果是小孩的声音小,而父母的声音大,但实际上是录制孩子的声音大,父母声音弱化甚至可以没有的视频。针对该问题,本申请如下解决方案:In a feasible embodiment, in a video recording scene, for example, when a parent records a video for a child, the child is far away from the terminal device (such as the shooting terminal), and the parent is relatively close to the terminal device, resulting in the effect of the child's video recording. The voice is small, while the voice of the parents is loud, but it is actually a video in which the voice of the child is loud, and the voice of the parents is weakened or even absent. For this problem, this application has the following solutions:
在录制视频或者视频通话时,终端设备的显示界面包括第一区域和第二区域,其中第一区域用于实时显示视频录制结果或者视频通话的内容,第二区域用于显示用于调节多个对象(或目标用户)的语音增强系数的控件和对应的标签;按照上述方式四、方式五或者方式六得到多个的增强语音信号后,基于终端设备的使用者针对用于调节多个对象的语音增强系数的控件的操作指令获取多个对象的语音增强系数,然后根据该多个对象的语音增强系数分别对多个对象的降噪语音信号进行增强处理,以得到多个对象的增强语音信号;然后基于多个对象的增强语音信号得到输出信号。该输出信号是多个对象的增强语音信号混合得到的。When recording a video or a video call, the display interface of the terminal device includes a first area and a second area, wherein the first area is used to display the video recording result or the content of the video call in real time, and the second area is used to display and adjust multiple Controls and corresponding labels of voice enhancement coefficients of objects (or target users); The operation instruction of the control of the voice enhancement coefficient obtains the voice enhancement coefficients of multiple objects, and then respectively performs enhancement processing on the noise-reduced voice signals of the multiple objects according to the voice enhancement coefficients of the multiple objects, so as to obtain the enhanced voice signals of the multiple objects ; Then an output signal is obtained based on the enhanced speech signals of multiple objects. The output signal is obtained by mixing the enhanced speech signals of multiple objects.
可选地,按照方式四或者方式六得到多个对象的降噪语音信号和干扰噪声信号后,按照上述方式获取多个对象的语音增强系数,然后根据该多个对象的语音增强系数分别对多个对象的降噪语音信号进行增强处理,以得到多个对象的增强语音信号;然后基于多个对象的增强语音信号和干扰噪声信号得到输出信号。输出信号是多个对象的增强语音信号和干扰噪声信号混合得到的。Optionally, after the noise-reduced speech signals and interference noise signals of multiple objects are obtained according to the fourth or sixth method, the speech enhancement coefficients of the multiple objects are obtained according to the above-mentioned method, and then the speech enhancement coefficients of the multiple objects are respectively assigned to The noise-reduced speech signal of one object is enhanced to obtain the enhanced speech signal of multiple objects; then the output signal is obtained based on the enhanced speech signal of multiple objects and the interference noise signal. The output signal is a mixture of enhanced speech signals and interference noise signals of multiple objects.
可选地,按照方式四或者方式六得到多个对象的降噪语音信号和干扰噪声信号后,上述第二区域还用于显示用于调节干扰噪声抑制系数的控件,基于终端设备的使用者针对用于调节多个对象的语音增强系数的控件和调节干扰噪声抑制系数的控件的操作指令获取多个对象的语音增强系数和干扰噪声抑制系数,然后根据该多个对象的语音增强系数分别对多个对象的降噪语音信号进行增强处理,以得到多个对象的增强语音信号;根据干扰噪声抑制系数对干扰噪声信号进行抑制处理,得到干扰噪声抑制信号;然后基于多个对象的增强语音信号和 干扰噪声抑制信号得到输出信号。输出信号是多个对象的增强语音信号和干扰噪声抑制信号混合得到的。Optionally, after the noise-reduced speech signals and interference noise signals of multiple objects are obtained according to method four or method six, the above-mentioned second area is also used to display controls for adjusting the interference noise suppression coefficient, based on the user's target of the terminal device Operation instructions for the control for adjusting the speech enhancement coefficients of multiple objects and the control for adjusting the interference noise suppression coefficients Acquire the speech enhancement coefficients and interference noise suppression coefficients of multiple objects, and then perform multi- The noise-reduced speech signal of an object is enhanced to obtain the enhanced speech signal of multiple objects; the interference noise signal is suppressed according to the interference noise suppression coefficient to obtain the interference noise suppression signal; then based on the enhanced speech signal of multiple objects and the The interference noise suppresses the signal to obtain the output signal. The output signal is obtained by mixing the enhanced speech signals of multiple objects and the interference noise suppression signals.
在此需要指出的是,多个对象的声音样本均已被注册。It should be noted here that sound samples of multiple objects have been registered.
以对象2为对象1录制视频为例进行说明,如图21所示,终端设备的显示界面包括用于显示针对图像1的视频录制结果的区域、显示用于调节对象1的语音增强系数和对象2的语音增强系数的控件,该控件包括条形滑动条和滑动按钮;对象2可通过拖动对象1的滑动按钮在滑动条上滑动来调整对象1的语音增强系数大小,可通过拖动对象2的滑动按钮在滑动条上滑动来调整对象2的语音增强系数的大小,从而实现针对视频录制时对象1和对象2的声音大小的调节。Taking object 2 as an example to record video for object 1, as shown in Figure 21, the display interface of the terminal device includes an area for displaying the video recording result for image 1, displaying the voice enhancement coefficient and object The control of the speech enhancement coefficient of 2, the control includes a bar-shaped sliding bar and a sliding button; object 2 can adjust the speech enhancement coefficient of object 1 by dragging the sliding button of object 1 on the sliding bar, and can drag the object The sliding button of 2 slides on the sliding bar to adjust the size of the voice enhancement coefficient of object 2, so as to realize the adjustment of the sound volume of object 1 and object 2 during video recording.
需要指出的是,对象2通过拖动对象2为拍摄者,在图21未示意出。It should be pointed out that the object 2 becomes the photographer by dragging the object 2, which is not shown in FIG. 21 .
在视频通话场景中,比如家庭成员间的视频通话,如图22所示,终端设备在女儿(对象1)手上,母亲(对象2)在女儿身后一定距离做饭,父亲在远端,父亲想听母亲说话但听不清楚。对象1可以通过拖动对象2的滑动按钮在滑动条上滑动以增大对象2的语音增强系数,从而增大对象2的声音,也就是妈妈的声音。In a video call scenario, such as a video call between family members, as shown in Figure 22, the terminal device is in the hands of the daughter (subject 1), the mother (subject 2) cooks at a certain distance behind the daughter, the father is at the far end, and the father I want to hear my mother speak but can't hear clearly. Object 1 can increase the voice enhancement factor of object 2 by dragging the sliding button of object 2 to slide on the slide bar, thereby increasing the voice of object 2, that is, the voice of mother.
可选地,如图23中的左图所示,用于调节对象1和对象2的语音增强系数的控件在不需要调节语音增强系数的情况下是不显示的,当终端设备检测到对象1需要调整对象1或者2的语音增强系数的操作时,在终端设备的显示界面上显示用于调节对象1或对象2的语音增强系数的控件;如图23中的右图所示,对象1需要调节对象2的语音增强系数,对象1在终端设备的显示界面上长按或者点击对象2的显示区域,当然也可以是其他操作,终端设备检测到对象1的操作后,在显示界面上显示用于调节对象2的语音增强系数的控件,对象1再通过滑动该控对象2的语音增强系数的控件的一段时间内,终端设备未检测到针对用于调节对象2的语音增强系数的控件的操作时,隐藏用于调节对象2的语音增强系数的控件。Optionally, as shown in the left figure in Figure 23, the controls for adjusting the speech enhancement coefficients of Object 1 and Object 2 are not displayed when the speech enhancement coefficients do not need to be adjusted. When the terminal device detects that Object 1 When it is necessary to adjust the speech enhancement coefficient of object 1 or 2, the control for adjusting the speech enhancement coefficient of object 1 or object 2 is displayed on the display interface of the terminal device; as shown in the right figure in Figure 23, object 1 needs Adjust the voice enhancement coefficient of object 2. Object 1 long presses or clicks the display area of object 2 on the display interface of the terminal device. Of course, it can also be other operations. To adjust the control of the voice enhancement coefficient of object 2, within a period of time when object 1 slides the control of the voice enhancement coefficient of object 2, the terminal device does not detect any operation on the control for adjusting the voice enhancement coefficient of object 2 , hides the controls for adjusting the speech enhancement factor for object 2.
需要指出的是,终端设备在检测到针对显示对象2的区域的操作后,终端设备从存储对象对应的语音信号特征的数据库中确定对象2的语音信号特征,再按照本申请的降噪方式进行降噪。It should be pointed out that after the terminal device detects the operation on the area where the object 2 is displayed, the terminal device determines the voice signal feature of the object 2 from the database storing the voice signal features corresponding to the object, and then proceeds according to the noise reduction method of this application. noise reduction.
应理解,针对显示对象2的区域的操作包括但不限于长按和点击,当然还可以为其他形式的操作。It should be understood that the operations on the area where the object 2 is displayed include but are not limited to long press and click, and of course other forms of operations are also possible.
终端设备在检测针对显示界面的点击、长按或者其他操作时,终端设备首先需要识别出被操作的区域所显示对象,然后基于预先记录的对象与语音信号之间的关联关系,确定需要增强的语音信号,进而设定对应的语音增强系数。When the terminal device detects a click, long press or other operations on the display interface, the terminal device first needs to recognize the object displayed in the operated area, and then determine the object that needs to be enhanced based on the pre-recorded relationship between the object and the voice signal. The speech signal, and then set the corresponding speech enhancement coefficient.
在一个可行的实施例中,当终端设备为智能交互设备时,目标语音相关数据包括包含唤醒词的语音信号,带噪语音信号包括包含命令词的音频信号。In a feasible embodiment, when the terminal device is an intelligent interactive device, the target voice-related data includes a voice signal including a wake-up word, and the noisy voice signal includes an audio signal including a command word.
其中,上述智能交互设备为能够与用户进行语音交互的设备,比如可以为扫地机器人、智能音响、智能冰箱等。Wherein, the above-mentioned smart interactive device is a device capable of voice interaction with the user, such as a sweeping robot, a smart speaker, a smart refrigerator, and the like.
对于智能音箱、智能机器人,往往不能对用户身份进行很严格的限定。例如,家庭中使用的智能音箱,不光需要家庭成员都可以对其进行语音控制,对拜访的客人也需要能够使用语音进行交互。家庭成员可以事先采集语音注册的,但是对于临时拜访的客人,无法事先采集语音注册的。对于从事公共服务的智能机器人,更是需要对每一个可能的用户进行响应,同样无法要求所有可能的用户事先采集语音注册的。但是这些设备在使用的时候,往往会遇到背景嘈杂、说话人众多的复杂情况,在对目标用户进行语音增强,对其他干扰进行抑制方面有着更强烈的需求。针对该需求,本申请提供如下解决方案:For smart speakers and smart robots, it is often not possible to strictly limit user identities. For example, the smart speakers used in the family not only need to be able to be controlled by voice by family members, but also need to be able to use voice to interact with visiting guests. Family members can collect voice registration in advance, but for temporary visiting guests, voice registration cannot be collected in advance. For intelligent robots engaged in public services, it is necessary to respond to every possible user, and it is also impossible to require all possible users to collect voice registration in advance. However, when these devices are used, they often encounter complex situations with noisy backgrounds and many speakers. There is a stronger demand for voice enhancement of target users and suppression of other interference. In response to this requirement, this application provides the following solutions:
以智能音箱的语音命令为例进行说明,麦克风采集音频信号,语音唤醒模块对采集到的音频信号进行分析,确定是否唤醒设备;语音唤醒模块首先对采集到的信号进行检测,并将语音段分割出来。然后对语音段进行唤醒词识别,以确定是否包含设定的唤醒词。例如,使用语音命令对智能音箱在语音控制的时候,一般都需要用户先说出唤醒词,如“小A小A”。Taking the voice command of a smart speaker as an example, the microphone collects audio signals, and the voice wake-up module analyzes the collected audio signals to determine whether to wake up the device; the voice wake-up module first detects the collected signals and divides the voice segment come out. Then perform wake-up word recognition on the voice segment to determine whether the set wake-up word is included. For example, when using voice commands to control smart speakers by voice, the user generally needs to speak the wake-up word first, such as "little A little A".
将语音唤醒模块得到的包含唤醒词的音频信号作为目标用户的注册语音信号;麦克风采集包含用户语音命令的音频信号。一般情况下,用户在唤醒设备后会说出具体的命令,如“明天天气怎么样?”、“请播放春天在哪里”等具体的命令。The audio signal containing the wake-up word obtained by the voice wake-up module is used as the registered voice signal of the target user; the microphone collects the audio signal containing the user's voice command. Under normal circumstances, the user will speak specific commands after waking up the device, such as "what's the weather like tomorrow?", "play where is spring please" and other specific commands.
以说出唤醒词的用户为目标用户,以包含语音命令的音频信号为带噪语音信号,采用方式一的方式进行降噪处理,获得目标用户的增强语音信号或者输出信号,该目标用户的增强语音信号或输出信号对说出唤醒词的目标用户的语音信号进行了增强,对其他干扰说话人和背景噪声都得到了有效抑制。Take the user who speaks the wake-up word as the target user, and use the audio signal containing the voice command as the noisy voice signal, and use the method 1 to perform noise reduction processing to obtain the enhanced voice signal or output signal of the target user. The enhanced voice signal or output signal of the target user The voice signal or the output signal enhances the voice signal of the target user speaking the wake-up word, and effectively suppresses other interfering speakers and background noise.
判断是否有新的唤醒词语音出现,如果有,则将新的包含唤醒词的语音信号作为新的目标用户的注册语音信号,以说出新的包含唤醒词的语音信号的用户为目标用户。It is judged whether there is a new wake-up word voice, and if so, the new voice signal containing the wake-up word is used as the registration voice signal of the new target user, and the user who speaks the new voice signal containing the wake-up word is the target user.
例如,用户C说出唤醒词,“小A小A”,然后用户C可以继续使用语音对智能音箱进行控制,这是用户B不能用语音对智能音箱进行语音控制,只有当用户B输出唤醒词“小A小A”后,用户B接管了音箱的控制权,这个时候用户C的语音命令将不再被音箱响应,只有用户C再次说出“小A小A”后,才能再次接管音箱的控制权。For example, user C speaks the wake-up word, "Little A, little A", and then user C can continue to use voice to control the smart speaker, but user B cannot use voice to control the smart speaker, only when user B outputs the wake-up word After "Little A, Little A", user B takes over the control of the speaker. At this time, user C's voice command will no longer be responded to by the speaker. Only after user C says "Little A, Little A" again can he take over the control of the speaker Control.
可以看出,本实施例给出了一种不需要事先注册语音、不需要借助图像、其他传感器信息也可以实现对目标人语音进行增强,对其他背景噪声和干扰语音进行抑制的方案,适用于智能音箱、智能机器人等面向多用户,用户存在临时性的设备。It can be seen that this embodiment provides a solution that does not need to register the voice in advance, and does not need to rely on images or other sensor information to enhance the voice of the target person and suppress other background noises and interfering voices. It is suitable for Smart speakers, smart robots, etc. are multi-user-oriented, and users have temporary devices.
可以看出,在本申请的方案中,通过目标语音相关数据,并借助语音降噪模型对带噪语音信号进行降噪处理,得到目标用户的降噪语音信号,实现了目标用户语音的增强抑制干扰噪声;通过引入语音增强系数和干扰噪声抑制系数,满足了用户按需调节降噪力度;采用基于TCN或者FTB+GRU结构的语音降噪模型进行降噪,在语音通话或者视频通话中时延小,用户主观听感好;多人场景下也可以采用本申请的降噪方式进行降噪,满足了多用户场景下多人降噪的需求;在视频通话的场景下,可以基于摄像头拍摄的视频场景进行针对性的降噪,能够自动识别目标用户,并从数据库中检索目标用户对应的声纹信息来进行降噪,进而提升用户的使用体验;在通话场景或者视频通话场景下,基于对端用户的降噪需求开启PNR功能,可以提升通话双方的通话质量;采用本申请的方法自动开启PNR功能,能够提升易用性。It can be seen that in the solution of this application, the target user's noise-reduced voice signal is obtained by using the target voice-related data and the voice noise reduction model to perform noise reduction processing on the noisy voice signal, and the enhancement and suppression of the target user's voice is realized. Interference noise; through the introduction of voice enhancement coefficient and interference noise suppression coefficient, it meets the needs of users to adjust the noise reduction intensity; adopts the voice noise reduction model based on TCN or FTB+GRU structure for noise reduction, and delays in voice calls or video calls small, and the user has a good subjective sense of hearing; the noise reduction method of this application can also be used for noise reduction in multi-person scenarios, which meets the needs of multi-user noise reduction in multi-user scenarios; Targeted noise reduction in video scenes can automatically identify the target user, and retrieve the voiceprint information corresponding to the target user from the database to perform noise reduction, thereby improving the user experience; in the call scene or video call scene, based on the Enabling the PNR function according to the noise reduction requirement of the end user can improve the call quality of both parties in the call; adopting the method of this application to automatically enable the PNR function can improve the usability.
参见图24,图24为本申请实施例提供的一种终端设备的结构示意图。如图24所示,该终端设备2400包括:Referring to FIG. 24, FIG. 24 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in Figure 24, the terminal device 2400 includes:
获取单元2401,用于在终端设备进入PNR模式后,获取带噪语音信号和目标语音相关数据,其中,带噪语音信号包含干扰噪声信号与目标用户的语音信号;目标语音相关数据用于指示目标用户的语音特征;The acquiring unit 2401 is configured to acquire a noisy voice signal and target voice-related data after the terminal device enters the PNR mode, wherein the noisy voice signal includes an interference noise signal and a target user's voice signal; the target voice-related data is used to indicate the target user the user's voice characteristics;
降噪单元2402,用于根据目标语音相关数据通过已训练好的语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号,其中,语音降噪模型是基于神经网络实现的。The noise reduction unit 2402 is used to perform noise reduction processing on the first noisy speech signal through the trained speech noise reduction model according to the target speech related data, to obtain the noise reduction speech signal of the target user, wherein the speech noise reduction model is based on implemented by neural networks.
在一个可行的实施例中,获取单元2401,还用于获取目标用户的语音增强系数;In a feasible embodiment, the acquiring unit 2401 is also configured to acquire the speech enhancement coefficient of the target user;
降噪单元2402,还用于基于目标用户的语音增强系数对目标用户的降噪语音信号进行增强处理,以得到目标用户的增强语音信号,其中,目标用户的增强语音信号的幅度与目标用 户的降噪语音信号的幅度的比值为目标用户语音增强系数。The noise reduction unit 2402 is further configured to perform enhancement processing on the target user's noise-reduced speech signal based on the target user's speech enhancement coefficient to obtain the target user's enhanced speech signal, wherein the amplitude of the target user's enhanced speech signal is the same as the target user's The ratio of the amplitude of the noise-reduced speech signal is the speech enhancement coefficient of the target user.
进一步地,获取单元2401,还用于在通过降噪处理还得到干扰噪声信号后,获取干扰噪声抑制系数;Further, the obtaining unit 2401 is also configured to obtain the interference noise suppression coefficient after obtaining the interference noise signal through the noise reduction processing;
降噪单元2402,还用于基于干扰噪声抑制系数对干扰噪声信号进行降噪处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅度与干扰噪声信号的幅度的比值为干扰噪声抑制系数;将干扰噪声抑制信号与目标用户的增强语音信号进行融合,以得到输出信号。The noise reduction unit 2402 is further configured to perform noise reduction processing on the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient ; The interference noise suppression signal is fused with the enhanced speech signal of the target user to obtain an output signal.
在一个可行的实施例中,In one possible embodiment,
获取单元2401,还用于在通过降噪处理还得到干扰噪声信号后,获取干扰噪声抑制系数;The obtaining unit 2401 is further configured to obtain the interference noise suppression coefficient after obtaining the interference noise signal through the noise reduction processing;
降噪单元2402,还用于基于干扰噪声抑制系数对干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,干扰噪声抑制信号的幅度与干扰噪声信号的幅度的比值为干扰噪声抑制系数;将干扰噪声抑制信号与目标用户的降噪语音信号进行融合,以得到输出信号。The noise reduction unit 2402 is further configured to suppress the interference noise signal based on the interference noise suppression coefficient to obtain the interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient; The interference noise suppressed signal is fused with the noise-reduced speech signal of the target user to obtain an output signal.
在一个可行的实施例中,目标用户包括M个,目标语音相关数据包括M个目标用户的语音相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,目标用户的语音增强系数包括M个目标用户的语音增强系数,M为大于1的整数,在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号的方面,降噪单元2402具体用于:In a feasible embodiment, the target users include M, the target voice-related data includes the voice-related data of the M target users, the noise-reduced voice signals of the target users include the noise-reduced voice signals of the M target users, and the target user's voice The enhancement coefficient includes speech enhancement coefficients of M target users, M is an integer greater than 1, and the first noisy speech signal is subjected to noise reduction processing through a speech noise reduction model according to the target speech related data to obtain a noise-reduced speech signal of the target user aspect, the noise reduction unit 2402 is specifically used for:
对于M个目标用户中任一目标用户A,根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号;For any target user A in the M target users, according to the voice-related data of the target user A, the first noisy voice signal is denoised by the voice noise reduction model, so as to obtain the denoising voice signal of the target user A;
在基于目标用户的语音增强系数对目标用户的降噪语音信号进行增强处理,以得到目标用户的增强语音信号的方面,降噪单元2402具体用于:In terms of enhancing the noise-reduced speech signal of the target user based on the speech enhancement coefficient of the target user to obtain the enhanced speech signal of the target user, the noise reduction unit 2402 is specifically used for:
基于目标用户A的语音增强系数对目标用户A的降噪语音信号进行增强处理,以得到目标用户A的增强语音信号;目标用户A的增强语音信号的幅度与目标用户A的降噪语音信号的幅度的比值为目标用户A的语音增强系数;按照该方式对M个目标用户中每个目标用户的降噪语音信号进行处理,可得到M个目标用户的增强语音信号;Based on the speech enhancement coefficient of target user A, the noise-reduced speech signal of target user A is enhanced to obtain the enhanced speech signal of target user A; The ratio of amplitude is the speech enhancement coefficient of target user A; According to this mode, the noise reduction speech signal of each target user in M target users is processed, and the enhanced speech signals of M target users can be obtained;
降噪单元2402,还用于基于M个目标用户的增强语音信号得到输出信号。The noise reduction unit 2402 is further configured to obtain an output signal based on the enhanced voice signals of the M target users.
在一个可行的实施例中,目标用户包括M个,目标语音相关数据包括M个目标用户的语音相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,M为大于1的整数,在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号和干扰噪声信号的方面,降噪单元2402具体用于:In a feasible embodiment, the target users include M, and the target speech-related data includes the speech-related data of M target users, and the noise-reduced speech signals of the target users include the noise-reduced speech signals of M target users, and M is greater than 1 is an integer, in the aspect of performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech-related data, so as to obtain the target user's noise-reduced speech signal and interference noise signal, the noise reduction unit 2402 is specifically used for:
根据M个目标用户中第1个目标用户的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到第1个目标用户的降噪语音信号和不包含第1个目标用户的语音信号的第一带噪语音信号;根据M个目标用户中第2个目标用户的语音相关数据通过语音降噪模型对不包含第1个目标用户的语音信号的第一带噪语音信号进行降噪处理,以得到第2个目标用户的降噪语音信号和不包含第1个目标用户的语音信号和第2个目标用户的语音信号的第一带噪语音信号;重复上述过程,直至根据第M个目标用户的语音相关数据通过语音降噪模型对不包含第1至M-1个目标用户的语音信号的第一带噪语音信号进行降噪处理,得到第M个目标用户的降噪语音信号和干扰噪声信号;至此,得到M个目标用户的降噪语音信号和干扰噪声信号。According to the voice-related data of the first target user among the M target users, the first noisy voice signal is denoised through the voice noise reduction model, and the noise-reduced voice signal of the first target user and the first target user are obtained. The first noisy speech signal of the user's speech signal; the first noisy speech signal not containing the speech signal of the first target user through the speech noise reduction model according to the speech related data of the second target user in the M target users Carry out noise reduction processing, to obtain the noise-reduced speech signal of the 2nd target user and the first band noise speech signal that does not contain the speech signal of the 1st target user and the speech signal of the 2nd target user; Repeat above-mentioned process, until According to the voice-related data of the Mth target user, the noise reduction process is performed on the first noisy voice signal that does not contain the voice signals of the 1st to M-1 target users through the voice noise reduction model, and the noise reduction of the Mth target user is obtained. Noise-reduced speech signals and interference noise signals; so far, the noise-reduced speech signals and interference noise signals of M target users are obtained.
在一个可行的实施例中,目标用户包括M个,目标语音相关数据包括M个目标用户的语音相关数据,目标用户的降噪语音信号包括M个目标用户的降噪语音信号,M为大于1的整数,在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得 到目标用户的降噪语音信号和干扰噪声信号的方面,降噪单元2402具体用于:In a feasible embodiment, the target users include M, and the target speech-related data includes the speech-related data of M target users, and the noise-reduced speech signals of the target users include the noise-reduced speech signals of M target users, and M is greater than 1 is an integer, in the aspect of performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech-related data, so as to obtain the target user's noise-reduced speech signal and interference noise signal, the noise reduction unit 2402 is specifically used for:
根据M个目标用户的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到M个目标用户的降噪语音信号和干扰噪声信号。According to the voice-related data of the M target users, the first noisy voice signal is denoised through the voice denoising model, so as to obtain the denoised voice signals and interference noise signals of the M target users.
在一个可行的实施例中,目标用户包括M个,目标用户的相关数据包括目标用户的注册语音信号,目标用户的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户的语音信号,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,In a feasible embodiment, the target users include M, and the relevant data of the target users include the registration voice signals of the target users, and the registration voice signals of the target users are target users collected in an environment where the noise decibel value is lower than a preset value The speech signal, the speech noise reduction model includes the first coding network, the second coding network, TCN and the first decoding network,
在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号的方面,降噪单元2402具体用于:In terms of performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech related data to obtain the noise-reduced speech signal of the target user, the noise reduction unit 2402 is specifically used for:
利用第一编码网络和第二编码网络分别对目标用户的注册语音信号和第一带噪语音信号进行特征提取,得到目标用户的注册语音信号的特征向量和第一带噪语音信号的特征向量;根据目标用户的注册语音信号的特征向量和带噪语音信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据第一解码网络和第二特征向量得到目标用户的降噪语音信号。Utilize the first encoding network and the second encoding network to carry out feature extraction respectively to the registered voice signal of the target user and the first noisy voice signal, obtain the feature vector of the registered voice signal of the target user and the feature vector of the first noisy voice signal; Obtain the first eigenvector according to the eigenvector of the registered speech signal of the target user and the eigenvector of the noisy speech signal; obtain the second eigenvector according to the TCN and the first eigenvector; obtain the target user according to the first decoding network and the second eigenvector noise-reduced speech signal.
进一步地,降噪单元2402还用于:Further, the noise reduction unit 2402 is also used for:
根据第一解码网络和第二特征向量还得到干扰噪声信号。An interfering noise signal is also obtained from the first decoding network and the second eigenvector.
在一个可行的实施例中,目标用户A的相关数据包括目标用户A的注册语音信号,目标用户A的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户A的语音信号,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,在根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号的方面,降噪单元2402具体用于:In a feasible embodiment, the relevant data of the target user A includes the registered voice signal of the target user A, and the registered voice signal of the target user A is the voice of the target user A collected in an environment where the noise decibel value is lower than a preset value signal, the speech noise reduction model includes a first coding network, a second coding network, a TCN and a first decoding network, and performs noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech related data of the target user A, In order to obtain the noise-reduced speech signal of the target user A, the noise reduction unit 2402 is specifically used for:
利用第一编码网络和第二编码网络分别对目标用户A的注册语音信号和第一带噪语音信号进行特征提取,以得到目标用户A的注册语音信号的特征向量和第一带噪语音信号的特征向量;根据目标用户A的注册语音信号的特征向量和第一带噪语音信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据第一解码网络和第二特征向量得到目标用户A的降噪语音信号。Use the first encoding network and the second encoding network to extract the features of the registration speech signal of the target user A and the first noisy speech signal, so as to obtain the feature vector of the registration speech signal of the target user A and the first noise speech signal. Feature vector; According to the feature vector of the registered voice signal of target user A and the feature vector of the first noisy voice signal, the first feature vector is obtained; according to the TCN and the first feature vector, the second feature vector is obtained; according to the first decoding network and the first feature vector The second eigenvector obtains the noise-reduced speech signal of the target user A.
在一个可行的实施例中,M个目标用户中第i个目标用户的相关数据包括第i个目标用户的注册语音信号,i为大于0且小于或者等于M的整数,语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,降噪单元2402具体用于:In a feasible embodiment, the relevant data of the i-th target user among the M target users includes the registration voice signal of the i-th target user, i is an integer greater than 0 and less than or equal to M, and the voice noise reduction model includes A coding network, a second coding network, a TCN and a first decoding network, the noise reduction unit 2402 is specifically used for:
利用第一编码网络和第二编码网络分别对目标用户的注册语音信号和第一噪声信号进行特征提取,得到第i个目标用户的注册语音信号的特征向量和该第一噪声信号的特征向量;其中,第一噪声信号为不包含第1至i-1个目标用户的语音信号的第一带噪语音信号;根据第i个目标用户的注册语音信号的特征向量和第一噪声信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据第一解码网络和第二特征向量得到第i个目标用户的降噪语音信号和第二噪声信号,其中,第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号。Utilize the first coding network and the second coding network to carry out feature extraction respectively to the registration speech signal of target user and the first noise signal, obtain the feature vector of the registration speech signal of the ith target user and the feature vector of this first noise signal; Wherein, the first noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-1 target users; according to the feature vector of the registration speech signal of the i target user and the feature vector of the first noise signal Obtain the first eigenvector; Obtain the second eigenvector according to the TCN and the first eigenvector; Obtain the noise-reduced voice signal and the second noise signal of the i-th target user according to the first decoding network and the second eigenvector, wherein, the second The noise signal is a first noisy speech signal that does not include the speech signals of the 1st to i-th target users.
在一个可行的实施例中,对于M个目标用户的语音相关数据,每个目标用户的相关数据包括该目标用户的注册语音信号,目标用户A的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户A的语音信号,语音降噪模型包括M个第一编码网络、第二编码网络、TCN、第一解码网络和M个第三解码网络,在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号和干扰噪声信号的方面,降噪单元2402具体用于:In a feasible embodiment, for the voice-related data of M target users, the relevant data of each target user includes the registered voice signal of the target user, and the registered voice signal of target user A is when the noise decibel value is lower than the preset The voice signal of the target user A collected under the environment of the value, the voice noise reduction model includes M first encoding network, second encoding network, TCN, first decoding network and M third decoding network, according to the target voice related data Perform noise reduction processing on the first noisy speech signal through the speech noise reduction model, so as to obtain the noise reduction speech signal and the interference noise signal of the target user. The noise reduction unit 2402 is specifically used for:
利用M个第一编码网络分别对M个目标用户的注册语音信号进行特征提取,得到M个目标用户的注册语音信号的特征向量;利用第二编码网络对带噪语音信号进行特征提取,得到带噪语音信号的特征向量;根据M个目标用户的注册语音信号的特征向量和第一带噪语音信号的特征向量得到第一特征向量;根据TCN和第一特征向量得到第二特征向量;根据M个第三解码网络中的每个第三解码网络、对第二特征向量和与该第三解码网络对应的第一编码网络输出的特征向量得到M个目标用户的降噪语音信号;根据第一解码网络、第二特征向量和第一带噪语音信号的特征向量得到干扰噪声信号。Use M first coding networks to extract features of the registered speech signals of M target users respectively, and obtain the feature vectors of the registered speech signals of M target users; use the second coding network to perform feature extraction on noisy speech signals, and obtain The eigenvector of noise speech signal; Obtain the first eigenvector according to the eigenvector of the registration speech signal of M target users and the eigenvector of the first band noisy speech signal; Obtain the second eigenvector according to TCN and the first eigenvector; According to M Each of the third decoding networks in the third decoding networks obtains the noise-reduced speech signals of M target users for the second feature vector and the feature vectors output by the first encoding network corresponding to the third decoding network; according to the first The decoding network, the second eigenvector and the eigenvector of the first noisy speech signal obtain an interference noise signal.
在一个可行的实施例中,目标用户的相关数据包括目标用户的VPU信号,语音降噪模型包括预处理模块、第三编码网络、GRU、第二解码网络和后处理模块,In a feasible embodiment, the relevant data of the target user includes the VPU signal of the target user, and the speech noise reduction model includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module,
在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号的方面,降噪单元2402具体用于:In terms of performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech related data, so as to obtain the noise-reduced speech signal of the target user, the noise reduction unit 2402 is specifically used for:
通过预处理模块分别对第一带噪语音信号和目标用户的VPU信号进行时频变换,以得到第一带噪语音信号的第一频域信号和VPU信号的第二频域信号;对第一频域信号和第二频域信号进行融合,得到第一融合频域信号;将第一融合频域信号先后经过第三编码网络、GRU和第二解码网络处理,以得到目标用户的语音信号的第三频域信号的掩膜;通过后处理模块根据第三频域信号的掩膜对第一频域信号进行后处理,得到第三频域信号;对第三频域信号进行频时变换,得到目标用户的降噪语音信号;其中,第三编码模块和第二解码模块均是基于卷积层和频FTB实现的。Carry out time-frequency transformation to the VPU signal of the first band noise speech signal and the target user respectively by the preprocessing module, to obtain the first frequency domain signal of the first band noise speech signal and the second frequency domain signal of the VPU signal; The frequency domain signal is fused with the second frequency domain signal to obtain a first fused frequency domain signal; the first fused frequency domain signal is successively processed through a third encoding network, a GRU, and a second decoding network to obtain a voice signal of a target user The mask of the third frequency domain signal; the first frequency domain signal is post-processed by the post processing module according to the mask of the third frequency domain signal to obtain the third frequency domain signal; the third frequency domain signal is frequency-time transformed, A noise-reduced speech signal of the target user is obtained; wherein, the third encoding module and the second decoding module are both implemented based on convolutional layers and frequency FTB.
在一个可行的实施例中,降噪单元2402具体用于:In a feasible embodiment, the noise reduction unit 2402 is specifically used for:
将第一融合频域信号先后经过第三编码网络、GRU和第二解码网络处理还得到第一频域信号的掩膜;通过后处理模块根据第一频域信号的掩膜对第一频域信号进行后处理,得到干扰噪声信号的第四频域信号;以对第四频域信号进行频时变换,以得到干扰噪声信号。Process the first fused frequency domain signal successively through the third encoding network, GRU and second decoding network to obtain the mask of the first frequency domain signal; through the post-processing module, the first frequency domain signal is processed according to the mask of the first frequency domain signal The signal is post-processed to obtain a fourth frequency domain signal of the interference noise signal; and frequency-time transformation is performed on the fourth frequency domain signal to obtain the interference noise signal.
在一个可行的实施例中,目标用户A的相关数据包括目标用户A的VPU信号,语音降噪模型包括预处理模块、第三编码网络、GRU、第二解码网络和后处理模块,在根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号,以得到目标用户A的降噪语音信号的方面,降噪单元2402具体用于:In a feasible embodiment, the relevant data of the target user A includes the VPU signal of the target user A, and the voice noise reduction model includes a preprocessing module, a third encoding network, a GRU, a second decoding network and a postprocessing module. The voice-related data of user A is used for the first noisy voice signal through the voice noise reduction model to obtain the noise-reduced voice signal of target user A. The noise reduction unit 2402 is specifically used for:
通过预处理模块分别对第一带噪语音信号和目标用户A的VPU信号进行时频变换,以得到第一带噪语音信号的第一频域信号和目标用户A的VPU信号的第九频域信号;对第一频域信号和第九频域信号进行融合,得到第二融合频域信号;将第二融合频域信号先后经过第三编码网络、GRU和第二解码网络处理,以得到目标用户A的语音信号的第十频域信号的掩膜;通过后处理模块根据第十频域信号的掩膜对第一频域信号进行后处理,得到第十频域信号;对第十频域信号进行频时变换,以得到目标用户A的降噪语音信号;Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user A through the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the ninth frequency domain of the VPU signal of the target user A signal; the first frequency domain signal and the ninth frequency domain signal are fused to obtain the second fused frequency domain signal; the second fused frequency domain signal is successively processed by the third encoding network, the GRU and the second decoding network to obtain the target The mask of the tenth frequency domain signal of the voice signal of user A; The first frequency domain signal is post-processed by the mask of the tenth frequency domain signal by the post-processing module to obtain the tenth frequency domain signal; for the tenth frequency domain The signal is subjected to frequency-time transformation to obtain the noise-reduced speech signal of the target user A;
其中,第三编码模块和第二解码模块均是基于卷积层和FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
在一个可行的实施例中,M个目标用户中第i个目标用户的相关数据包括第i个目标用户的VPU信号,i为大于0且小于或者等于M的整数,降噪单元2402具体用于:In a feasible embodiment, the relevant data of the i-th target user among the M target users includes the VPU signal of the i-th target user, i is an integer greater than 0 and less than or equal to M, and the noise reduction unit 2402 is specifically used to :
通过预处理模块对第一噪声信号和第i个目标用户的VPU信号均进行时频变换,以得到该第一噪声信号的第十一频域信号和第i个目标用户的VPU信号的第十二频域信号;对第十一频域信号和第十二频域信号进行融合,得到第三融合频域信号;其中,第一噪声信号为不包含第1至i-1个目标用户的语音信号的带噪语音信号;将第三融合频域信号先后经过第三编码网络、GRU和第二解码网络处理得到第i个目标用户的语音信号的第十三频域信号的掩膜和第十一频域信号的掩膜;通过后处理模块根据第十三频域信号的掩膜和第十一频域信号的 掩膜对第十一频域信号进行后处理,得到第十三频域信号和第二噪声信号的第十四频域信号;对第十三频域信号和第十四频域信号进行频时变换,得到第i个目标用户的降噪语音信号和第二噪声信号,第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号;其中,第三编码模块和第二解码模块均是基于卷积层和FTB实现的。Both the first noise signal and the VPU signal of the i-th target user are time-frequency transformed by the preprocessing module to obtain the eleventh frequency domain signal of the first noise signal and the tenth frequency domain signal of the i-th target user's VPU signal. Two frequency domain signals; the eleventh frequency domain signal and the twelfth frequency domain signal are fused to obtain the third fusion frequency domain signal; wherein the first noise signal is the voice that does not contain the 1st to i-1 target users The noisy voice signal of the signal; the mask of the thirteenth frequency domain signal and the tenth frequency domain signal of the voice signal of the i-th target user are obtained by processing the third fusion frequency domain signal successively through the third encoding network, GRU and the second decoding network A mask of the frequency-domain signal; post-processing the eleventh frequency-domain signal through the post-processing module according to the mask of the thirteenth frequency-domain signal and the mask of the eleventh frequency-domain signal, to obtain the thirteenth frequency-domain signal and the fourteenth frequency-domain signal of the second noise signal; the frequency-time transformation is carried out to the thirteenth frequency-domain signal and the fourteenth frequency-domain signal to obtain the noise-reduced voice signal and the second noise signal of the i-th target user, the first The second noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-th target users; wherein, the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
在一个可行的实施例中,在基于目标用户的语音增强系数对目标用户的降噪语音信号进行增强处理,以得到目标用户的增强语音信号的方面,降噪单元2402具体用于:In a feasible embodiment, the noise reduction unit 2402 is specifically configured to:
对于M个目标用户中的任一目标用户A,基于目标用户A的语音增强系数对目标用户A的降噪语音信号进行增强处理,以得到目标用户A的增强语音信号;目标用户A的增强语音信号的幅度与目标用户A的降噪语音信号的幅度的比值为目标用户A的语音增强系数;For any target user A among the M target users, the noise-reduced voice signal of target user A is enhanced based on the voice enhancement coefficient of target user A to obtain the enhanced voice signal of target user A; the enhanced voice signal of target user A The ratio of the amplitude of the signal to the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
在将干扰噪声抑制信号与目标用户的增强语音信号进行融合,以得到输出信号的方面,降噪单元2402具体用于:In terms of fusing the interference noise suppression signal with the enhanced speech signal of the target user to obtain an output signal, the noise reduction unit 2402 is specifically used for:
将M个目标用户的增强语音信号与干扰噪声抑制信号进行融合,以得到输出信号。The enhanced speech signals of the M target users are fused with the interference noise suppression signal to obtain an output signal.
在一个可行的实施例中,目标用户的相关数据包括目标用户的VPU信号,获取单元2401还用于:获取目标用户的耳内声音信号;In a feasible embodiment, the relevant data of the target user includes the VPU signal of the target user, and the acquiring unit 2401 is further configured to: acquire the in-ear sound signal of the target user;
在根据目标语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号的方面,降噪单元2402具体用于:In terms of performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech related data, so as to obtain the noise-reduced speech signal of the target user, the noise reduction unit 2402 is specifically used for:
分别对第一带噪语音信号和耳内声音信号进行时频变换,以得到第一带噪语音信号的第一频域信号和耳内声音信号的第五频域信号;根据目标用户的VPU信号、第一频域信号和第五频域信号得到第一带噪语音信号与耳内声音信号的协方差矩阵;基于协方差矩阵得到第一最小方差无失真响应MVDR权重;基于第一MVDR权重、第一频域信号和第五频域信号得到第一带噪语音信号的第六频域信号和耳内声音信号的第七频域信号;根据第六频域信号和第七频域信号得到降噪语音信号的第八频域信号;对第八频域信号进行频时变换,以得到目标用户的降噪语音信号。Carrying out time-frequency transformation to the first noisy speech signal and the in-ear sound signal respectively to obtain the first frequency-domain signal of the first noisy speech signal and the fifth frequency-domain signal of the in-ear sound signal; according to the VPU signal of the target user , the first frequency domain signal and the fifth frequency domain signal obtain the covariance matrix of the first noisy speech signal and the in-ear sound signal; obtain the first minimum variance distortion-free response MVDR weight based on the covariance matrix; based on the first MVDR weight, The first frequency domain signal and the fifth frequency domain signal obtain the sixth frequency domain signal of the first noisy speech signal and the seventh frequency domain signal of the in-ear sound signal; The eighth frequency domain signal of the noise speech signal; performing frequency-time transformation on the eighth frequency domain signal to obtain the noise-reduced speech signal of the target user.
进一步地,降噪单元2402还用于:Further, the noise reduction unit 2402 is also used for:
根据目标用户的降噪语音信号对第一带噪语音信号得到干扰噪声信号。An interference noise signal is obtained for the first noisy speech signal according to the noise-reduced speech signal of the target user.
在一个可行的实施例中,目标用户A的相关数据包括目标用户A的VPU信号,获取单元2401,还用于获取目标用户A的耳内声音信号;In a feasible embodiment, the relevant data of the target user A includes the VPU signal of the target user A, and the acquiring unit 2401 is also used to acquire the in-ear sound signal of the target user A;
在根据目标用户A的语音相关数据通过语音降噪模型对第一带噪语音信号进行降噪处理,以得到目标用户A的降噪语音信号的方面,降噪单元2402具体用于:In terms of performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech-related data of the target user A to obtain the noise-reduced speech signal of the target user A, the noise reduction unit 2402 is specifically used for:
分别对第一带噪语音信号和目标用户A的耳内声音信号进行时频变换,得到第一带噪语音信号的第一频域信号和目标用户A的耳内声音信号的第十五频域信号;根据目标用户A的VPU信号、第一频域信号和第十五频域信号得到第一带噪语音信号与目标用户A的耳内声音信号的协方差矩阵;基于协方差矩阵得到第二MVDR权重;基于第二MVDR权重、第一频域信号和第十五频域信号得到第一带噪语音信号的第十六频域信号和目标用户A的耳内声音信号的第十七频域信号;根据第十六频域信号和第十七频域信号得到目标用户A的降噪语音信号的第十八频域信号;对第十八频域信号进行频时变换,以得到目标用户A的降噪语音信号。Perform time-frequency transformation on the first noisy speech signal and the target user A's in-ear sound signal respectively to obtain the first frequency domain signal of the first noisy speech signal and the fifteenth frequency domain of the target user A's in-ear sound signal signal; according to the VPU signal of target user A, the first frequency domain signal and the 15th frequency domain signal, obtain the covariance matrix of the first noisy speech signal and the in-ear sound signal of target user A; obtain the second based on the covariance matrix MVDR weight; Based on the second MVDR weight, the first frequency domain signal and the fifteenth frequency domain signal, the sixteenth frequency domain signal of the first noisy speech signal and the seventeenth frequency domain of the ear sound signal of the target user A are obtained Signal; Obtain the eighteenth frequency domain signal of the noise reduction voice signal of target user A according to the sixteenth frequency domain signal and the seventeenth frequency domain signal; Carry out frequency-time transformation to the eighteenth frequency domain signal, to obtain target user A noise-reduced speech signal.
在一个可行的实施例中,获取单元2401还用于:In a feasible embodiment, the obtaining unit 2401 is also used to:
获取终端设备所处环境的第一噪音片段和第二噪音片段;第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取第一噪音片段的信噪比SNR和声压级SPLObtain the first noise segment and the second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are consecutive noise segments in time; obtain the signal-to-noise ratio SNR and sound pressure level SPL of the first noise segment
终端设备2400还包括:The terminal device 2400 also includes:
确定单元2403,用于若第一噪音片段的SNR大于第一阈值且第一噪音片段的SPL大于第二阈值,则提取第一噪音片段的第一临时特征向量;基于第一临时语音特征向量对第二噪音片段进行降噪处理,以得到第二降噪噪音片段;基于第二降噪噪音片段和第二噪音片段进行损伤评估,以得到第一损伤评分;若第一损伤评分不大于第三阈值,进入PNR模式;A determining unit 2403, configured to extract a first temporary feature vector of the first noise segment if the SNR of the first noise segment is greater than a first threshold and the SPL of the first noise segment is greater than a second threshold; based on the first temporary speech feature vector pair Perform noise reduction processing on the second noise segment to obtain the second noise reduction segment; perform damage assessment based on the second noise reduction segment and the second noise segment to obtain the first damage score; if the first damage score is not greater than the third Threshold, enter PNR mode;
在获取第一带噪语音信号的方面,获取单元2401具体用于:In terms of obtaining the first noisy speech signal, the obtaining unit 2401 is specifically used to:
从在第一噪音片段之后产生的噪声信号中确定第一带噪语音信号;注册语音信号的特征向量包括第一临时特征向量。A first noisy speech signal is determined from the noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a first temporary feature vector.
在一个可行的实施例中,若第一损伤评分不大于第三阈值,确定单元2403还用于:In a feasible embodiment, if the first damage score is not greater than the third threshold, the determining unit 2403 is further configured to:
通过终端设备发出第一提示信息,第一提示信息用于提示是否使得终端设备进入PNR模式;在检测到目标用户的同意进入PNR模式的操作指令后,才进入PNR模式。The first prompt message is sent by the terminal device, and the first prompt message is used to prompt whether to enable the terminal device to enter the PNR mode; the PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
在一个可行的实施例中,获取单元2401,还用于在检测到终端设备再次被使用时,获取第二带噪语音信号;In a feasible embodiment, the obtaining unit 2401 is further configured to obtain the second noisy speech signal when it is detected that the terminal device is used again;
降噪单元2402,还用于:在第二带噪语音信号的SNR低于第四阈值时,根据第一临时特征向量对第二带噪语音信号进行降噪处理,以得到当前使用者的降噪语音信号;The noise reduction unit 2402 is further configured to: when the SNR of the second noisy speech signal is lower than the fourth threshold, perform noise reduction processing on the second noisy speech signal according to the first temporary feature vector, so as to obtain the noise reduction of the current user. noisy speech signal;
确定单元2403,还用于基于当前使用者的降噪语音信号和第二带噪语音信号进行损伤评估,以得到第二损伤评分;当第二损伤评分不大于第五阈值时,通过终端设备发出第二提示信息,第二提示信息用于提示当前使用者终端设备能够进入PNR模式;在检测到当前使用者的同意进入PNR模式的操作指令后,使得终端设备进入PNR模式对第三带噪语音信号进行降噪处理,该第三带噪语音信号是在第二带噪语音信号之后获取的;在检测到当前使用者的不同意进入PNR模式的操作指令后,采用非PNR模式对第三带噪语音信号进行降噪处理。The determination unit 2403 is further configured to perform damage assessment based on the current user's noise-reduced voice signal and the second noise-containing voice signal to obtain a second damage score; when the second damage score is not greater than the fifth threshold, send a message through the terminal device The second prompt information, the second prompt information is used to remind the current user that the terminal equipment can enter the PNR mode; after detecting the current user's consent to enter the PNR mode operation instruction, the terminal equipment is allowed to enter the PNR mode to respond to the third noisy voice The signal is subjected to noise reduction processing, and the third noisy speech signal is obtained after the second noisy speech signal; after detecting the current user's disagreement with the operation instruction to enter the PNR mode, use the non-PNR mode to process the third noisy speech signal. Noise reduction processing for noisy speech signals.
在一个可行的实施例中,所述获取单元2401,还用于若第一噪音片段的SNR不大于第一阈值或者第一噪音片段的SPL不大于第二阈值,且终端设备已存储参考临时声纹特征向量,获取第三噪音片段;In a feasible embodiment, the acquiring unit 2401 is further configured to: if the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, and the terminal device has stored the reference temporary sound pattern feature vector, to obtain the third noise segment;
降噪单元2402,还用于根据参考临时声纹特征向量对第三噪音片段进行降噪处理,得到第三降噪噪音片段;The noise reduction unit 2402 is further configured to perform noise reduction processing on the third noise segment according to the reference temporary voiceprint feature vector to obtain a third noise reduction noise segment;
确定单元2403,还用于根据第三噪音片段和第三降噪噪音片段进行损伤评估,以得到第三损伤评分;若第三损伤评分大于第六阈值且第三噪音片段的SNR小于第七阈值,或者第三损伤评分大于第八阈值且第三噪音片段的SNR不小于第七阈值,则通过终端设备发出第三提示信息,第三提示信息用于提示当前使用者终端设备能够进入PNR模式;在检测到当前使用者的同意进入PNR模式的操作指令后,使得终端设备进入PNR模式对第四带噪语音信号进行降噪处理;在检测到当前使用者的不同意进入PNR模式的操作指令后,采用非PNR模式对第四带噪语音信号进行降噪处理;其中,第四带噪语音信号是从在第三噪音片段之后产生的噪声信号中确定的。The determining unit 2403 is further configured to perform damage assessment according to the third noise segment and the third denoising noise segment to obtain a third damage score; if the third damage score is greater than the sixth threshold and the SNR of the third noise segment is less than the seventh threshold , or the third damage score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then a third prompt message is sent through the terminal device, and the third prompt message is used to prompt the current user that the terminal device can enter the PNR mode; After detecting the operation instruction of the current user agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the fourth noisy voice signal; after detecting the operation instruction of the current user not agreeing to enter the PNR mode , using a non-PNR mode to perform noise reduction processing on the fourth noisy speech signal; wherein, the fourth noisy speech signal is determined from the noise signal generated after the third noise segment.
在一个可行的实施例中,获取单元2401,还用于获取终端设备2400所处环境的第一噪音片段和第二噪音片段;第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取终端设备2400的辅助设备的麦克风阵列针对终端设备2400所处的环境采集的信号;In a feasible embodiment, the acquisition unit 2401 is also configured to acquire the first noise segment and the second noise segment of the environment where the terminal device 2400 is located; the first noise segment and the second noise segment are temporally continuous noise segments ; Obtain the signal collected by the microphone array of the auxiliary device of the terminal device 2400 for the environment where the terminal device 2400 is located;
终端设备2400还包括:The terminal device 2400 also includes:
确定单元2403,用于利用采集的信号计算得到第一噪音片段的信号到达角DOA和SPL;若第一噪音片段的DOA大于第九阈值且小于第十阈值,且第一噪音片段的SPL大于第十一阈值,则提取第一噪音片段的第二临时特征向量,基于第二临时特征向量对第二噪音片段进行降噪处理,以得到第三降噪噪音片段;基于第三降噪噪音片段和第二噪音片段进行损伤评 估,以得到第四损伤评分;若第四损伤评分大于第十二阈值,则进入PNR模式;The determining unit 2403 is configured to use the collected signal to calculate the signal arrival angle DOA and SPL of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than the tenth threshold, and the SPL of the first noise segment is greater than the first threshold Eleven thresholds, then extract the second temporary feature vector of the first noise segment, and carry out noise reduction processing to the second noise segment based on the second temporary feature vector, to obtain the third noise reduction noise segment; based on the third noise reduction noise segment and Perform damage assessment on the second noise segment to obtain a fourth damage score; if the fourth damage score is greater than the twelfth threshold, enter the PNR mode;
在获取第一带噪语音信号的方面,获取单元2401具体用于:In terms of obtaining the first noisy speech signal, the obtaining unit 2401 is specifically used to:
从在第一噪音片段之后产生的噪声信号中确定第一带噪语音信号;注册语音信号的特征向量包括第二临时特征向量。A first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registered speech signal includes a second temporary feature vector.
在一个可行的实施例中,若第四损伤评分不大于第十二阈值,确定单元2403还用于:In a feasible embodiment, if the fourth damage score is not greater than the twelfth threshold, the determining unit 2403 is further configured to:
通过终端设备2400发出第四提示信息,该第四提示信息用于提示是否使得终端设备2400进入PNR模式;在检测到目标用户的同意进入PNR模式的操作指令后,才进入PNR模式。The fourth prompt message is sent by the terminal device 2400, and the fourth prompt message is used to prompt whether to make the terminal device 2400 enter the PNR mode; the terminal device 2400 enters the PNR mode only after detecting an operation instruction of the target user agreeing to enter the PNR mode.
在一个可行的实施例中,终端设备2400还包括:In a feasible embodiment, the terminal device 2400 also includes:
检测单元2404,用于当检测到终端设备处于手持通话状态时,不进入PNR模式;The detection unit 2404 is configured to not enter the PNR mode when it is detected that the terminal device is in the handset talking state;
当检测到终端设备处于免提通话状态时,进入PNR模式,其中,目标用户为终端设备的拥有者或者正在使用终端设备的用户;When it is detected that the terminal device is in the hands-free call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user who is using the terminal device;
当检测到终端设备处于视频通话时,进入PNR模式,其中,目标用户为终端设备的拥有者或者距离终端设备最近的用户;When it is detected that the terminal device is in a video call, enter the PNR mode, wherein the target user is the owner of the terminal device or the user closest to the terminal device;
当检测到终端设备连接到耳机进行通话状态时,进入PNR模式,其中,目标用户为佩戴耳机的用户;第一带噪语音信号和目标语音相关数据是通过耳机采集得到的;或,When it is detected that the terminal device is connected to the headset for talking, enter the PNR mode, wherein the target user is a user wearing the headset; the first noisy voice signal and the target voice-related data are collected through the headset; or,
当检测到终端设备连接到智能大屏设备、智能手表或者车载设备时,进入PNR模式,其中目标用户为终端设备的拥有者或者正在终端设备的用户,第一带噪语音信号和目标语音相关数据是由智能大屏设备、智能手表或者车载设备的音频采集硬件采集得到的。When it is detected that the terminal device is connected to a smart large screen device, a smart watch or a vehicle-mounted device, it enters the PNR mode, where the target user is the owner of the terminal device or the user who is using the terminal device, the first noisy voice signal and the target voice-related data It is collected by the audio collection hardware of smart large-screen devices, smart watches, or vehicle-mounted devices.
在一个可行的实施例中,获取单元2401还用于:获取当前环境的音频信号的分贝值,In a feasible embodiment, the acquiring unit 2401 is also configured to: acquire the decibel value of the audio signal in the current environment,
终端设备2400还包括:The terminal device 2400 also includes:
控制单元2405,用于若当前环境的音频信号的分贝值超过预设分贝值,则判断终端设备启动的功能或者应用程序对应的PNR功能是否开启;若未开启,则开启终端设备启动的应用程序对应的PNR功能,并进入PNR模式。The control unit 2405 is configured to determine whether the function activated by the terminal device or the PNR function corresponding to the application is enabled if the decibel value of the audio signal in the current environment exceeds the preset decibel value; if not enabled, then enable the application program activated by the terminal device Corresponding PNR function, and enter PNR mode.
在一个可行的实施例中,终端设备2400包括显示屏2408,该显示屏2408包括多个显示区域,In a feasible embodiment, the terminal device 2400 includes a display screen 2408, and the display screen 2408 includes multiple display areas,
其中,多个显示区域中的每个显示区域显示标签和对应的功能按键,功能按键用于控制其对应标签所指示的应用程序的PNR功能的开启和关闭。Wherein, each display area in the plurality of display areas displays a label and a corresponding function key, and the function key is used to control the opening and closing of the PNR function of the application program indicated by the corresponding label.
在一个可行的实施例中,当终端设备与另一终端设备之间进行语音数据传输时,终端设备2400还包括:In a feasible embodiment, when voice data transmission is performed between the terminal device and another terminal device, the terminal device 2400 further includes:
接收单元2406,用于接收另一终端设备发送的语音增强请求,语音增强请求用于指示终端设备开启通话功能的PNR功能;The receiving unit 2406 is configured to receive a voice enhancement request sent by another terminal device, where the voice enhancement request is used to instruct the terminal device to enable the PNR function of the call function;
控制单元2405,用于响应于语音增强请求,通过终端设备发出第三提示信息,第三提示信息用于提示是否使得终端设备开启通话功能的PNR功能;当检测到目标用户针对终端设备的确认开启通话功能的PNR功能后,开启通话功能的PNR功能,并进入PNR模式;The control unit 2405 is configured to send third prompt information through the terminal device in response to the voice enhancement request, and the third prompt information is used to prompt whether to enable the terminal device to enable the PNR function of the call function; After the PNR function of the call function, turn on the PNR function of the call function and enter the PNR mode;
发送单元2407,用于向另一终端设备发送语音增强响应消息,语音增强响应消息用于指示终端设备已开启通话功能的PNR功能。The sending unit 2407 is configured to send a voice enhancement response message to another terminal device, where the voice enhancement response message is used to indicate that the terminal device has enabled the PNR function of the call function.
在一个可行的实施例中,当终端设备启动视频通话或者视频录制功能,终端设备的显示界面包括第一区域和第二区域,第一区域用于显示视频通话内容或者视频录制的内容,第二区域用于显示M个控件和对应的M个标签,M个控件与M个目标用户一一对应M个控件中的每个控件包括滑动按钮和滑动条,通过控制滑动按钮在滑动条上滑动,以调节该控件对应的标签所指示目标用户的语音增强系数。In a feasible embodiment, when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area and a second area, the first area is used to display the content of the video call or video recording, and the second area The area is used to display M controls and corresponding M labels. The M controls correspond to the M target users one by one. Each control in the M controls includes a sliding button and a sliding bar. By controlling the sliding button to slide on the sliding bar, to adjust the speech enhancement coefficient of the target user indicated by the label corresponding to the control.
在一个可行的实施例中,当终端设备启动视频通话或者视频录制功能,终端设备的显示界面包括第一区域,第一区域用于显示视频通话内容或者视频录制的内容;终端设备2400还包括:In a feasible embodiment, when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area, and the first area is used to display the content of the video call or video recording; the terminal device 2400 also includes:
控制单元2405,用于当检测到针对视频通话内容或者视频录制内容中任一对象的操作时,在第一区域显示该对象对应的控件,该控件包括滑动按钮和滑动条,通过控制滑动按钮在滑动条上滑动,以调节该对象的语音增强系数。The control unit 2405 is configured to display a control corresponding to the object in the first area when an operation on any object in the video call content or video recording content is detected, the control includes a sliding button and a sliding bar, and the sliding button is controlled to move between Slide the slider to adjust the voice enhancement factor of the object.
在一个可行的实施例中,当终端设备为智能交互设备时,目标语音相关数据为目标用户的包含唤醒词的语音信号,带噪语音信号为目标用户的包含命令词的音频信号。In a feasible embodiment, when the terminal device is an intelligent interactive device, the target voice related data is the target user's voice signal including the wake-up word, and the noisy voice signal is the target user's audio signal including the command word.
需要说明的是,上述各单元(获取单元2401、降噪单元2402、确定单元2403、检测单元2404、控制单元2405、接收单元2406、发送单元2407和显示屏2408)用于执行上述方法的相关步骤。It should be noted that the above-mentioned units (acquisition unit 2401, noise reduction unit 2402, determination unit 2403, detection unit 2404, control unit 2405, receiving unit 2406, sending unit 2407, and display screen 2408) are used to execute the relevant steps of the above-mentioned method .
在本实施例中,终端设备2400是以单元的形式来呈现。这里的“单元”可以指特定应用集成电路(application-specific integrated circuit,ASIC),执行一个或多个软件或固件程序的处理器和存储器,集成逻辑电路,和/或其他可以提供上述功能的器件。此外,以上获取单元2401、降噪单元2402、确定单元2403、检测单元2404和控制单元2405可通过图26所示的终端设备的处理器2601来实现。In this embodiment, the terminal device 2400 is presented in the form of a unit. The "unit" here may refer to an application-specific integrated circuit (ASIC), a processor and memory executing one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above functions . In addition, the acquisition unit 2401 , noise reduction unit 2402 , determination unit 2403 , detection unit 2404 and control unit 2405 above may be implemented by the processor 2601 of the terminal device shown in FIG. 26 .
参见图25,图25为本申请实施提供的另一种终端设备的结构示意图。如图25所示,该终端设备2500包括:Referring to FIG. 25 , FIG. 25 is a schematic structural diagram of another terminal device provided by the implementation of the present application. As shown in Figure 25, the terminal device 2500 includes:
传感器采集单元2501,用于采集带噪语音信号以及目标用户的注册语音信号、VPU信号、视频图像、深度图像等能够用于确定目标用户的信息。The sensor collection unit 2501 is configured to collect noisy speech signals, registered speech signals of the target user, VPU signals, video images, depth images and other information that can be used to determine the target user.
存储单元2502,用于存储降噪参数(包括目标用户的语音增强系数和干扰噪声抑制系数)、已注册的目标用户及其语音特征信息。The storage unit 2502 is configured to store noise reduction parameters (including target user's speech enhancement coefficient and interference noise suppression coefficient), registered target users and their speech feature information.
UI交互单元2504,用于接收用户的交互信息并传送给降噪控制单元2506,将降噪控制单元2506反馈的信息反馈给本端用户。The UI interaction unit 2504 is configured to receive user interaction information and send it to the noise reduction control unit 2506, and feed back the information fed back by the noise reduction control unit 2506 to the local user.
通信单元2505,用于发送和接收与对端用户的交互信息,可选地,也可以传输对端带噪语音信号及对端用户的语音注册信息。The communication unit 2505 is configured to send and receive interaction information with the peer user, and optionally, transmit a noisy voice signal of the peer and voice registration information of the peer user.
处理单元2503包括降噪控制单元2506和PNR处理单元2507,其中,The processing unit 2503 includes a noise reduction control unit 2506 and a PNR processing unit 2507, wherein,
降噪控制单元2506,用于根据本端和对端接收到的交互信息及存储单元存储的信息,对PNR降噪参数进行配置,包括但不限于确定进行语音增强的用户或目标用户,语音增强系数和干扰噪声抑制系数,是否开启降噪功能以及降噪方式。The noise reduction control unit 2506 is configured to configure the PNR noise reduction parameters according to the interaction information received by the local end and the peer end and the information stored in the storage unit, including but not limited to determining the user or target user for voice enhancement, voice enhancement coefficient and interference noise suppression coefficient, whether to enable the noise reduction function and the noise reduction method.
PNR处理单元2507,用于根据配置好的降噪参数对传感器采集单元采集到的带噪语音信号进行处理,获得增强音频信号,也就是目标用户的增强语音信号。The PNR processing unit 2507 is configured to process the noisy speech signal collected by the sensor collection unit according to the configured noise reduction parameters to obtain an enhanced audio signal, that is, an enhanced speech signal of the target user.
在此需要指出的是,PNR处理单元2507的具体功能可以参见降噪单元2402的功能的相关描述。It should be noted here that, for the specific functions of the PNR processing unit 2507, reference may be made to the relevant description of the functions of the noise reduction unit 2402.
如图26所示终端设备2600可以以图26中的结构来实现,该终端设备2600包括至少一个处理器2601,至少一个存储器2602、至少一个显示屏2604以及至少一个通信接口2603。所述处理器2601、所述存储器2602、显示屏2604和所述通信接口2603通过所述通信总线连接并完成相互间的通信。As shown in FIG. 26 , the terminal device 2600 can be implemented with the structure in FIG. 26 , and the terminal device 2600 includes at least one processor 2601 , at least one memory 2602 , at least one display screen 2604 and at least one communication interface 2603 . The processor 2601 , the memory 2602 , the display screen 2604 and the communication interface 2603 are connected through the communication bus and complete mutual communication.
处理器2601可以是通用中央处理器(CPU),微处理器,特定应用集成电路 (application-specific integrated circuit,ASIC),或一个或多个用于控制以上方案程序执行的集成电路。The processor 2601 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the programs of the above solutions.
通信接口2603,用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),无线局域网(Wireless Local Area Networks,WLAN)等。The communication interface 2603 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN), etc.
存储器2602可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。The memory 2602 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM) or other types that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a read-only disc (Compact Disc Read-Only Memory, CD-ROM) or other optical disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be programmed by a computer Any other medium accessed, but not limited to. The memory can exist independently and be connected to the processor through the bus. Memory can also be integrated with the processor.
显示屏2604可以是LCD显示屏、LED显示屏、OLED显示屏、3D显示屏或者其他显示屏。The display screen 2604 may be an LCD display screen, an LED display screen, an OLED display screen, a 3D display screen or other display screens.
其中,所述存储器2602用于存储执行以上方案的应用程序代码,并由处理器2601来控制执行,在显示屏上显示上述方法实施例所述的功能按键、标签等。所述处理器2601用于执行所述存储器2602中存储的应用程序代码。Wherein, the memory 2602 is used to store the application program codes for executing the above solutions, and the execution is controlled by the processor 2601, and the function buttons, labels, etc. described in the above method embodiments are displayed on the display screen. The processor 2601 is configured to execute application program codes stored in the memory 2602 .
存储器2602存储的代码可执行以上提供的任一种语音增强方法,比如:在终端设备进入PNR模式后,获取带噪语音信号和目标语音相关数据,其中,带噪语音信号包含干扰噪声信号与目标用户的语音信号;目标语音相关数据用于指示目标用户的语音特征;根据目标语音相关数据通过已训练好的语音降噪模型对第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号;其中,语音降噪模型是基于神经网络实现的。The code stored in the memory 2602 can execute any of the speech enhancement methods provided above, for example: after the terminal device enters the PNR mode, obtain the noisy speech signal and the target speech related data, wherein the noisy speech signal contains the interference noise signal and the target speech signal. The user's voice signal; the target voice-related data is used to indicate the voice characteristics of the target user; according to the target voice-related data, the first noisy voice signal is denoised through the trained voice noise reduction model to obtain the target user's denoising Speech signal; wherein, the speech noise reduction model is implemented based on a neural network.
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时包括上述方法实施例中记载的任何一种语音增强方法的部分或全部步骤。The embodiment of the present application also provides a computer storage medium, wherein the computer storage medium can store a program, and the program includes some or all steps of any speech enhancement method described in the above method embodiments when executed.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that for the foregoing method embodiments, for the sake of simple description, they are expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Depending on the application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions and modules involved are not necessarily required by this application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the foregoing embodiments, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以 采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. Several instructions are included to make a computer device (which may be a personal computer, server or network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: various media that can store program codes such as U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , Read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), magnetic disk or optical disc, etc.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been introduced in detail above, and specific examples have been used in this paper to illustrate the principles and implementation methods of the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; meanwhile, for Those skilled in the art will have changes in specific implementation methods and application scopes based on the ideas of the present application. In summary, the contents of this specification should not be construed as limiting the present application.

Claims (69)

  1. 一种语音增强方法,所述方法应用于终端设备,其特征在于,包括:A voice enhancement method, the method is applied to a terminal device, characterized in that it comprises:
    在所述终端设备进入特定人降噪PNR模式后,获取第一带噪语音信号和目标语音相关数据,其中,所述第一带噪语音信号包含干扰噪声信号与目标用户的语音信号;所述目标语音相关数据用于指示所述目标用户的语音特征;After the terminal device enters the specific person noise reduction PNR mode, the first noisy speech signal and target speech related data are acquired, wherein the first noisy speech signal includes an interference noise signal and a target user's speech signal; The target voice-related data is used to indicate the voice characteristics of the target user;
    根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号;其中,所述语音降噪模型是基于神经网络实现的。Perform noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech related data to obtain a noise reduction speech signal of the target user; wherein the speech noise reduction model is based on a neural network Achieved.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, further comprising:
    获取所述目标用户的语音增强系数;Acquiring the speech enhancement coefficient of the target user;
    基于所述目标用户的语音增强系数对所述目标用户的降噪语音信号进行增强处理,以得到所述目标用户的增强语音信号,其中,所述目标用户的增强语音信号的幅度与所述目标用户的降噪语音信号的幅度的比值为所述语音增强系数。The noise-reduced speech signal of the target user is enhanced based on the speech enhancement coefficient of the target user to obtain the enhanced speech signal of the target user, wherein the amplitude of the enhanced speech signal of the target user is the same as that of the target user The ratio of the amplitude of the user's noise-reduced speech signal is the speech enhancement coefficient.
  3. 根据权利要求2所述的方法,其特征在于,通过所述降噪处理还得到所述干扰噪声信号;所述方法还包括:The method according to claim 2, wherein the interference noise signal is also obtained through the noise reduction processing; the method further comprises:
    获取干扰噪声抑制系数;Obtain the interference noise suppression coefficient;
    基于所述干扰噪声抑制系数对所述干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,所述干扰噪声抑制信号的幅度与所述干扰噪声信号的幅度的比值为所述干扰噪声抑制系数;The interference noise signal is suppressed based on the interference noise suppression coefficient to obtain an interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient;
    将所述干扰噪声抑制信号与所述目标用户的增强语音信号进行融合,以得到输出信号。The interference noise suppression signal is fused with the enhanced speech signal of the target user to obtain an output signal.
  4. 根据权利要求1所述的方法,其特征在于,通过所述降噪处理还得到所述干扰噪声信号;所述方法还包括:The method according to claim 1, wherein the interference noise signal is also obtained through the noise reduction processing; the method further comprises:
    获取干扰噪声抑制系数;Obtain the interference noise suppression coefficient;
    基于所述干扰噪声抑制系数对所述干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,所述干扰噪声抑制信号的幅度与所述干扰噪声信号的幅度的比值为所述干扰噪声抑制系数;The interference noise signal is suppressed based on the interference noise suppression coefficient to obtain an interference noise suppression signal, wherein the ratio of the amplitude of the interference noise suppression signal to the amplitude of the interference noise signal is the interference noise suppression coefficient;
    将所述干扰噪声抑制信号与所述目标用户的降噪语音信号进行融合,以得到输出信号。The interference noise suppressed signal is fused with the noise-reduced speech signal of the target user to obtain an output signal.
  5. 根据权利要求2所述的方法,其特征在于,所述目标用户包括M个,所述目标语音相关数据包括所述M个目标用户的语音相关数据,所述目标用户的降噪语音信号包括所述M个目标用户的降噪语音信号,所述目标用户的语音增强系数包括所述M个目标用户的语音增强系数,所述M为大于1的整数;The method according to claim 2, wherein the target users include M, the target voice-related data include the voice-related data of the M target users, and the noise-reduced voice signals of the target users include the The noise reduction voice signals of the M target users, the voice enhancement coefficients of the target users include the voice enhancement coefficients of the M target users, and the M is an integer greater than 1;
    所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号,包括:The step of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain a noise-reduced speech signal of the target user includes:
    对于所述M个目标用户中任一目标用户A,根据所述目标用户A的语音相关数据通过所述语音降噪模型对所述第一带噪语音信号经过进行降噪处理,以得到所述目标用户A的降噪语音信号;For any target user A among the M target users, according to the speech-related data of the target user A, the noise reduction process is performed on the first noisy speech signal through the speech noise reduction model, so as to obtain the The noise-reduced voice signal of the target user A;
    所述基于所述目标用户的语音增强系数对所述目标用户的降噪语音信号进行增强处理,以得到所述目标用户的增强语音信号,包括:The step of enhancing the noise-reduced speech signal of the target user based on the speech enhancement coefficient of the target user to obtain the enhanced speech signal of the target user includes:
    基于所述目标用户A的语音增强系数对所述目标用户A的降噪语音信号进行增强处理,以得到所述目标用户A的增强语音信号;所述目标用户A的增强语音信号的幅度与所述目标用户A的降噪语音信号的幅度的比值为所述目标用户A的语音增强系数;Based on the speech enhancement coefficient of the target user A, the noise-reduced speech signal of the target user A is enhanced to obtain the enhanced speech signal of the target user A; the amplitude of the enhanced speech signal of the target user A is related to the The ratio of the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
    所述方法还包括:The method also includes:
    基于所述M个目标用户的增强语音信号得到输出信号。An output signal is obtained based on the enhanced speech signals of the M target users.
  6. 根据权利要求3所述的方法,其特征在于,所述目标用户包括M个,所述目标语音相关数据包括所述M个目标用户的语音相关数据,所述目标用户的降噪语音信号包括所述M个目标用户的降噪语音信号,所述M为大于1的整数;The method according to claim 3, wherein the target users include M, the target voice-related data include the voice-related data of the M target users, and the noise-reduced voice signals of the target users include the The denoising voice signals of the M target users, the M being an integer greater than 1;
    根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号和所述干扰噪声信号,包括:Perform noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech related data, to obtain the target user's noise-reduced speech signal and the interference noise signal, including:
    根据所述M个目标用户中第1个目标用户的语音相关数据通过所述语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述第1个目标用户的降噪语音信号和不包含所述第1个目标用户的语音信号的第一带噪语音信号;Perform noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech related data of the first target user among the M target users, so as to obtain the noise reduction of the first target user a voice signal and a first noisy voice signal that does not contain the voice signal of the first target user;
    根据所述M个目标用户中第2个目标用户的语音相关数据通过所述语音降噪模型对所述不包含所述第1个目标用户的语音信号的第一带噪语音信号进行降噪处理,得到所述第2个目标用户的降噪语音信号和不包含所述第1个目标用户的语音信号和第2个目标用户的语音信号的第一带噪语音信号;Perform noise reduction processing on the first noisy speech signal that does not contain the speech signal of the first target user through the speech noise reduction model according to the speech-related data of the second target user among the M target users. , obtaining the noise-reduced speech signal of the second target user and the first noisy speech signal not including the speech signal of the first target user and the speech signal of the second target user;
    重复上述过程,直至根据第M个目标用户的语音相关数据通过所述语音降噪模型对不包含所述第1至M-1个目标用户的语音信号的第一带噪语音信号进行降噪处理,得到所述第M个目标用户的降噪语音信号和所述干扰噪声信号。Repeat the above process until the first noisy speech signal that does not contain the speech signals of the 1st to M-1 target users is subjected to noise reduction processing through the speech noise reduction model according to the speech related data of the Mth target user , to obtain the noise-reduced speech signal of the Mth target user and the interference noise signal.
  7. 根据权利要求3所述的方法,其特征在于,所述目标用户包括M个,所述目标语音相关数据包括所述M个目标用户的语音相关数据,所述目标用户的降噪语音信号包括所述M个目标用户的降噪语音信号,所述M为大于1的整数;The method according to claim 3, wherein the target users include M, the target voice-related data include the voice-related data of the M target users, and the noise-reduced voice signals of the target users include the The denoising voice signals of the M target users, the M being an integer greater than 1;
    根据所述目标语音相关数据通过所述语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号和所述干扰噪声信号,包括:Performing noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the target speech related data, so as to obtain the noise-reduced speech signal of the target user and the interference noise signal, including:
    根据所述M个目标用户的语音相关数据通过所述语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述M个目标用户的降噪语音信号和所述干扰噪声信号。Perform noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech-related data of the M target users, so as to obtain the noise-reduced speech signals of the M target users and the interference noise Signal.
  8. 根据权利要求1-4任一项所述的方法,其特征在于,所述目标用户包括M个,所述目标用户的相关数据包括所述目标用户的注册语音信号,所述语音降噪模型包括第一编码网络、第二编码网络、时间卷积网络TCN和第一解码网络;The method according to any one of claims 1-4, wherein the target user comprises M, the relevant data of the target user comprises the registered voice signal of the target user, and the voice noise reduction model comprises A first encoding network, a second encoding network, a temporal convolutional network TCN and a first decoding network;
    所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号,包括:The step of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain a noise-reduced speech signal of the target user includes:
    利用所述第一编码网络和所述第二编码网络分别对所述目标用户的注册语音信号和所述第一带噪语音信号进行特征提取,以得到所述目标用户的注册语音信号的特征向量和所述第一带噪语音信号的特征向量;Using the first coding network and the second coding network to perform feature extraction on the registration speech signal of the target user and the first noisy speech signal respectively, so as to obtain a feature vector of the registration speech signal of the target user and the feature vector of the first noisy speech signal;
    根据所述目标用户的注册语音信号的特征向量和所述第一带噪语音信号的特征向量得到第一特征向量;Obtaining a first eigenvector according to the eigenvector of the registration voice signal of the target user and the eigenvector of the first noisy voice signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述第一解码网络和所述第二特征向量得到所述目标用户的降噪语音信号。Obtaining the noise-reduced speech signal of the target user according to the first decoding network and the second feature vector.
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:The method according to claim 8, characterized in that the method further comprises:
    根据所述第一解码网络和所述第二特征向量还得到所述干扰噪声信号。The interference noise signal is also obtained according to the first decoding network and the second eigenvector.
  10. 根据权利要求5所述的方法,其特征在于,所述目标用户A的相关数据包括所述目标用户A的注册语音信号,所述语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络;The method according to claim 5, wherein the relevant data of the target user A includes the registered voice signal of the target user A, and the voice noise reduction model includes a first coding network, a second coding network, a TCN and the first decoding network;
    所述根据所述目标用户A的语音相关数据通过语音降噪模型对所述第一带噪语音信号进 行降噪处理,以得到所述目标用户A的降噪语音信号,包括:According to the voice-related data of the target user A, the noise reduction processing is performed on the first noisy voice signal through a voice noise reduction model, so as to obtain the noise-reduced voice signal of the target user A, including:
    利用所述第一编码网络和所述第二编码网络分别对所述目标用户A的注册语音信号和所述第一带噪语音信号进行特征提取,以得到所述目标用户A的注册语音信号的特征向量和所述第一带噪语音信号的特征向量;Using the first encoding network and the second encoding network to perform feature extraction on the registration voice signal of the target user A and the first noisy voice signal, so as to obtain the registration voice signal of the target user A A feature vector and a feature vector of the first noisy speech signal;
    根据所述目标用户A的注册语音信号的特征向量和所述第一带噪语音信号的特征向量得到第一特征向量;Obtaining a first feature vector according to the feature vector of the registered voice signal of the target user A and the feature vector of the first noisy voice signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述第一解码网络和所述第二特征向量得到所述目标用户A的降噪语音信号。The noise-reduced speech signal of the target user A is obtained according to the first decoding network and the second feature vector.
  11. 根据权利要求6所述的方法,其特征在于,所述M个目标用户中第i个目标用户的相关数据包括所述第i个目标用户的注册语音信号,所述i为大于0且小于或者等于M的整数,所述语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,The method according to claim 6, wherein the relevant data of the i-th target user among the M target users includes the registration voice signal of the i-th target user, and the i is greater than 0 and less than or An integer equal to M, the speech noise reduction model includes a first coding network, a second coding network, a TCN and a first decoding network,
    利用所述第一编码网络和所述第二编码网络分别对所述目标用户的注册语音信号和第一噪声信号进行特征提取,得到所述第i个目标用户的注册语音信号的特征向量和该第一噪声信号的特征向量;其中,所述第一噪声信号为不包含第1至i-1个目标用户的语音信号的第一带噪语音信号;Using the first coding network and the second coding network to perform feature extraction on the registration speech signal of the target user and the first noise signal respectively, to obtain the feature vector and the registration speech signal of the i-th target user. The eigenvector of the first noise signal; wherein, the first noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-1 target users;
    根据所述第i个目标用户的注册语音信号的特征向量和所述第一噪声信号的特征向量得到第一特征向量;Obtaining a first feature vector according to the feature vector of the i-th target user's registered voice signal and the feature vector of the first noise signal;
    根据所述TCN和第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述第一解码网络和所述第二特征向量得到所述第i个目标用户的降噪语音信号和第二噪声信号,其中,所述第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号。According to the first decoding network and the second eigenvector, the noise-reduced speech signal and the second noise signal of the i-th target user are obtained, wherein the second noise signal does not contain the first to i-th targets A first noisy speech signal of the user's speech signal.
  12. 根据权利要求7所述的方法,其特征在于,对于所述M个目标用户的语音相关数据,每个目标用户的相关数据包括该目标用户的注册语音信号,所述语音降噪模型包括M个第一编码网络、第二编码网络、TCN、第一解码网络和M个第三解码网络;The method according to claim 7, wherein, for the voice-related data of the M target users, the relevant data of each target user includes the registered voice signal of the target user, and the voice noise reduction model includes M a first encoding network, a second encoding network, a TCN, a first decoding network and M third decoding networks;
    所述根据所述M个目标用户的语音相关数据通过所述语音降噪模型对所述带噪语音进行降噪处理,以得到所述M个目标用户的降噪语音信号和所述干扰噪声信号,包括:performing noise reduction processing on the noisy speech through the speech noise reduction model according to the speech-related data of the M target users, so as to obtain the noise-reduced speech signals of the M target users and the interference noise signal ,include:
    利用所述M个第一编码网络分别对所述M个目标用户的注册语音信号进行特征提取,得到M个目标用户的注册语音信号的特征向量;利用所述第二编码网络对所述第一带噪语音信号进行特征提取,得到所述第一带噪语音信号的特征向量;Utilize the M first encoding networks to perform feature extraction on the registered voice signals of the M target users respectively, to obtain feature vectors of the registered voice signals of the M target users; use the second encoding network to extract the features of the first Performing feature extraction on the noisy speech signal to obtain a feature vector of the first noisy speech signal;
    根据所述M个目标用户的注册语音信号的特征向量和所述第一带噪语音信号的特征向量得到第一特征向量;Obtaining a first eigenvector according to the eigenvectors of the registration voice signals of the M target users and the eigenvectors of the first noisy voice signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述M个第三解码网络中的每个第三解码网络、所述第二特征向量和与该第三解码网络对应的第一编码网络输出的特征向量得到M个目标用户的降噪语音信号;According to each of the M third decoding networks, the second feature vector and the feature vector output by the first encoding network corresponding to the third decoding network, the noise-reduced voices of M target users are obtained. Signal;
    根据所述第一解码网络、所述第二特征向量与所述第一带噪语音信号的特征向量得到所述干扰噪声信号。The interference noise signal is obtained according to the first decoding network, the second feature vector and the feature vector of the first noisy speech signal.
  13. 根据权利要求1-4任一项所述的方法,其特征在于,所述目标用户的相关数据包括所述目标用户的语音拾取VPU信号,所述语音降噪模型包括预处理模块、第三编码网络、门控循环单元GRU、第二解码网络和后处理模块;The method according to any one of claims 1-4, wherein the relevant data of the target user includes the voice pickup VPU signal of the target user, and the voice noise reduction model includes a preprocessing module, a third coding A network, a gated recurrent unit GRU, a second decoding network and a post-processing module;
    所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号,包括:The step of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain a noise-reduced speech signal of the target user includes:
    通过所述预处理模块分别对所述第一带噪语音信号和所述目标用户的VPU信号进行时频变换,以得到所述第一带噪语音信号的第一频域信号和所述VPU信号的第二频域信号;Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user through the preprocessing module, so as to obtain the first frequency domain signal and the VPU signal of the first noisy speech signal The second frequency domain signal;
    对所述第一频域信号和所述第二频域信号进行融合,以得到第一融合频域信号;merging the first frequency domain signal and the second frequency domain signal to obtain a first fused frequency domain signal;
    将所述第一融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理,以得到所述目标用户的语音信号的第三频域信号的掩膜;sequentially processing the first fused frequency domain signal through the third encoding network, the GRU, and the second decoding network to obtain a mask of a third frequency domain signal of the speech signal of the target user;
    通过所述后处理模块根据所述第三频域信号的掩膜对所述第一频域信号进行后处理,以得到所述第三频域信号;performing post-processing on the first frequency-domain signal by the post-processing module according to the mask of the third frequency-domain signal, to obtain the third frequency-domain signal;
    对所述第三频域信号进行频时变换,以得到所述目标用户的降噪语音信号;performing frequency-time transformation on the third frequency domain signal to obtain a noise-reduced speech signal of the target user;
    其中,所述第三编码模块和所述第二解码模块均是基于卷积层和频域变换模块FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on a convolutional layer and a frequency domain transformation module FTB.
  14. 根据权利要求13所述的方法,其特征在于,The method according to claim 13, characterized in that,
    将所述第一融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理还得到所述第一频域信号的掩膜;Processing the first fused frequency domain signal successively through the third encoding network, the GRU, and the second decoding network to obtain a mask of the first frequency domain signal;
    通过所述后处理模块根据所述第一频域信号的掩膜对所述第一频域信号进行后处理,得到所述干扰噪声信号的第四频域信号;performing post-processing on the first frequency-domain signal by the post-processing module according to the mask of the first frequency-domain signal, to obtain a fourth frequency-domain signal of the interference noise signal;
    对所述第四频域信号进行频时变换,以得到所述干扰噪声信号。Perform frequency-time transformation on the fourth frequency domain signal to obtain the interference noise signal.
  15. 根据权利要求5所述的方法,其特征在于,所述目标用户A的相关数据包括所述目标用户A的VPU信号,所述语音降噪模型包括预处理模块、第三编码网络、GRU、第二解码网络和后处理模块,所述根据所述目标用户A的语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户A的降噪语音信号,包括:The method according to claim 5, wherein the relevant data of the target user A includes the VPU signal of the target user A, and the speech noise reduction model includes a preprocessing module, a third encoding network, a GRU, a first The second decoding network and post-processing module, the first noisy speech signal is subjected to noise reduction processing through a speech noise reduction model according to the speech related data of the target user A, so as to obtain the noise-reduced speech of the target user A Signals, including:
    通过所述预处理模块分别对所述第一带噪语音信号和所述目标用户A的VPU信号进行时频变换,以得到所述第一带噪语音信号的第一频域信号和所述目标用户A的VPU信号的第九频域信号;Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user A through the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the target A ninth frequency domain signal of the VPU signal of user A;
    对所述第一频域信号和所述第九频域信号进行融合,得到第二融合频域信号;merging the first frequency domain signal and the ninth frequency domain signal to obtain a second fused frequency domain signal;
    将所述第二融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理,以得到所述目标用户A的语音信号的第十频域信号的掩膜;sequentially processing the second fused frequency domain signal through the third encoding network, the GRU, and the second decoding network to obtain a mask of the tenth frequency domain signal of the voice signal of the target user A;
    通过所述后处理模块根据所述第十频域信号的掩膜对所述第一频域信号进行后处理,得到所述第十频域信号;performing post-processing on the first frequency domain signal by the post-processing module according to the mask of the tenth frequency domain signal, to obtain the tenth frequency domain signal;
    对所述第十频域信号进行频时变换,以得到所述目标用户A的降噪语音信号;performing frequency-time transformation on the tenth frequency domain signal to obtain the noise-reduced speech signal of the target user A;
    其中,所述第三编码模块和所述第二解码模块均是基于卷积层和FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
  16. 根据权利要求6所述的方法,其特征在于,所述M个目标用户中第i个目标用户的相关数据包括所述第i个目标用户的VPU信号,所述i为大于0且小于或者等于M的整数,The method according to claim 6, wherein the relevant data of the i-th target user among the M target users includes the VPU signal of the i-th target user, and the i is greater than 0 and less than or equal to an integer of M,
    通过所述预处理模块对第一噪声信号和所述第i个目标用户的VPU信号均进行时频变换,以得到该第一噪声信号的第十一频域信号和所述第i个目标用户的VPU信号的第十二频域信号;Time-frequency transformation is performed on the first noise signal and the VPU signal of the ith target user by the preprocessing module to obtain the eleventh frequency domain signal of the first noise signal and the ith target user The twelfth frequency domain signal of the VPU signal;
    对所述第十一频域信号和所述第十二频域信号进行融合,得到第三融合频域信号;其中,所述第一噪声信号为不包含第1至i-1个目标用户的语音信号的第一带噪语音信号;merging the eleventh frequency domain signal and the twelfth frequency domain signal to obtain a third fused frequency domain signal; wherein the first noise signal does not include the 1st to i-1 target users a first noisy speech signal of the speech signal;
    将所述第三融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理得到所述第i个目标用户的语音信号的第十三频域信号的掩膜和所述第十一频域信号的掩膜;Processing the third fused frequency domain signal successively through the third coding network, the GRU, and the second decoding network to obtain a mask of the thirteenth frequency domain signal of the voice signal of the i-th target user and a mask of the eleventh frequency domain signal;
    通过所述后处理模块根据所述第十三频域信号的掩膜和所述第十一频域信号的掩膜对所 述第十一频域信号进行后处理,得到所述第十三频域信号和第二噪声信号的第十四频域信号;The post-processing module performs post-processing on the eleventh frequency domain signal according to the mask of the thirteenth frequency domain signal and the mask of the eleventh frequency domain signal to obtain the thirteenth frequency domain signal. domain signal and a fourteenth frequency domain signal of the second noise signal;
    对所述第十三频域信号和所述第十四频域信号进行频时变换,得到所述第i个目标用户的降噪语音信号和所述第二噪声信号,所述第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号;performing frequency-time transformation on the thirteenth frequency domain signal and the fourteenth frequency domain signal to obtain the noise-reduced voice signal of the ith target user and the second noise signal, the second noise signal be the first noisy speech signal that does not contain the speech signals of the 1st to i target users;
    其中,所述第三编码模块和所述第二解码模块均是基于卷积层和FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
  17. 根据权利要求6、7、11、12和16任一项所述的方法,其特征在于,所述基于所述目标用户的语音增强系数对所述目标用户的降噪语音信号进行增强处理,以得到所述目标用户的增强语音信号,包括:The method according to any one of claims 6, 7, 11, 12, and 16, wherein the noise-reduced speech signal of the target user is enhanced based on the speech enhancement coefficient of the target user, so as to Obtaining the enhanced voice signal of the target user includes:
    对于所述M个目标用户中的目标用户A,基于所述目标用户A的语音增强系数对所述目标用户A的降噪语音信号进行增强处理,以得到所述目标用户A的增强语音信号;所述目标用户A的增强语音信号的幅度与所述目标用户A的降噪语音信号的幅度的比值为所述目标用户A的语音增强系数;For the target user A among the M target users, performing enhancement processing on the noise-reduced voice signal of the target user A based on the voice enhancement coefficient of the target user A, to obtain the enhanced voice signal of the target user A; The ratio of the amplitude of the enhanced voice signal of the target user A to the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
    所述将所述干扰噪声抑制信号与所述目标用户的增强语音信号进行融合,以得到输出信号,包括:The merging of the interference noise suppression signal and the enhanced speech signal of the target user to obtain an output signal includes:
    将M个目标用户的增强语音信号与所述干扰噪声抑制信号进行融合,以得到所述输出信号。The enhanced speech signals of the M target users are fused with the interference noise suppression signal to obtain the output signal.
  18. 根据权利要求1-4任一项所述的方法,其特征在于,所述目标用户的相关数据包括所述目标用户的VPU信号,所述方法还包括:获取所述目标用户的耳内声音信号;The method according to any one of claims 1-4, wherein the relevant data of the target user includes a VPU signal of the target user, and the method further includes: acquiring an in-ear sound signal of the target user ;
    所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号,包括:The step of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain a noise-reduced speech signal of the target user includes:
    分别对所述第一带噪语音信号和所述耳内声音信号进行时频变换,以得到所述第一带噪语音信号的第一频域信号和所述耳内声音信号的第五频域信号;respectively performing time-frequency transformation on the first noisy speech signal and the in-ear sound signal to obtain a first frequency-domain signal of the first noisy speech signal and a fifth frequency-domain signal of the in-ear sound signal Signal;
    根据所述目标用户的VPU信号、所述第一频域信号和所述第五频域信号得到所述第一带噪语音信号与所述耳内声音信号的协方差矩阵;obtaining the covariance matrix of the first noisy speech signal and the in-ear sound signal according to the target user's VPU signal, the first frequency domain signal, and the fifth frequency domain signal;
    基于所述协方差矩阵得到第一最小方差无失真响应MVDR权重;Obtaining the first minimum variance distortion-free response MVDR weight based on the covariance matrix;
    基于所述第一MVDR权重、所述第一频域信号和所述第五频域信号得到所述第一带噪语音信号的第六频域信号和所述耳内声音信号的第七频域信号;Obtaining a sixth frequency domain signal of the first noisy speech signal and a seventh frequency domain signal of the in-ear sound signal based on the first MVDR weight, the first frequency domain signal and the fifth frequency domain signal Signal;
    根据所述第六频域信号和所述第七频域信号得到所述降噪语音信号的第八频域信号;obtaining an eighth frequency domain signal of the noise-reduced speech signal according to the sixth frequency domain signal and the seventh frequency domain signal;
    对所述第八频域信号进行频时变换,以得到所述降噪语音信号。performing frequency-time transformation on the eighth frequency domain signal to obtain the noise-reduced speech signal.
  19. 根据权利要求18所述的方法,其特征在于,所述方法还包括:The method according to claim 18, further comprising:
    根据所述降噪语音信号和所述第一带噪语音信号得到所述干扰噪声信号。The interference noise signal is obtained according to the noise-reduced speech signal and the first noisy speech signal.
  20. 根据权利要求5所述的方法,其特征在于,所述目标用户A的相关数据包括所述目标用户A的VPU信号,所述方法还包括:获取所述目标用户A的耳内声音信号;The method according to claim 5, wherein the relevant data of the target user A includes the VPU signal of the target user A, and the method further comprises: acquiring an in-ear sound signal of the target user A;
    所述根据所述目标用户A的语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户A的降噪语音信号,包括:According to the voice-related data of the target user A, the noise reduction processing is performed on the first noisy voice signal through a voice noise reduction model, so as to obtain the noise-reduced voice signal of the target user A, including:
    分别对所述第一带噪语音信号和所述目标用户A的耳内声音信号进行时频变换,得到所述第一带噪语音信号的第一频域信号和所述目标用户A的耳内声音信号的第十五频域信号;performing time-frequency transformation on the first noisy speech signal and the target user A's ear sound signal respectively, to obtain the first frequency domain signal of the first noisy speech signal and the target user A's ear sound signal a fifteenth frequency domain signal of the sound signal;
    根据所述目标用户A的VPU信号、所述第一频域信号和所述第十五频域信号得到所述第一带噪语音信号和所述目标用户A的耳内声音信号的协方差矩阵;Obtaining the covariance matrix of the first noisy speech signal and the in-ear sound signal of the target user A according to the VPU signal of the target user A, the first frequency domain signal and the fifteenth frequency domain signal ;
    基于所述协方差矩阵得到第二MVDR权重;obtaining a second MVDR weight based on the covariance matrix;
    基于所述第二MVDR权重、所述第一频域信号和所述第十五频域信号得到所述第一带噪 语音信号的第十六频域信号和所述目标用户A的耳内声音信号的第十七频域信号;根据所述第十六频域信号和所述第十七频域信号得到所述目标用户A的降噪语音信号的第十八频域信号;Obtaining the sixteenth frequency domain signal of the first noisy speech signal and the in-ear sound of the target user A based on the second MVDR weight, the first frequency domain signal and the fifteenth frequency domain signal The seventeenth frequency domain signal of the signal; the eighteenth frequency domain signal of the noise-reduced voice signal of the target user A is obtained according to the sixteenth frequency domain signal and the seventeenth frequency domain signal;
    对所述十八频域信号进行频时变换,以得到所述目标用户A的降噪语音信号。Perform frequency-time transformation on the eighteen frequency domain signals to obtain the noise-reduced speech signal of the target user A.
  21. 根据权利要求8-12任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 8-12, wherein the method further comprises:
    获取所述终端设备所处环境的第一噪音片段和第二噪音片段;所述第一噪音片段和第二噪音片段在时间上是连续的噪音片段;Acquiring a first noise segment and a second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are continuous noise segments in time;
    获取所述第一噪音片段的信噪比SNR和声压级SPL;Acquiring the signal-to-noise ratio SNR and sound pressure level SPL of the first noise segment;
    若所述第一噪音片段的SNR大于第一阈值且所述第一噪音片段的SPL大于第二阈值,则提取所述第一噪音片段的第一临时特征向量;If the SNR of the first noise segment is greater than a first threshold and the SPL of the first noise segment is greater than a second threshold, extracting a first temporary feature vector of the first noise segment;
    基于所述第一临时语音特征向量对所述第二噪音片段进行降噪处理,以得到第二降噪噪音片段;performing noise reduction processing on the second noise segment based on the first temporary speech feature vector to obtain a second noise reduction noise segment;
    基于所述第二降噪噪音片段和所述第二噪音片段进行损伤评估,以得到第一损伤评分;performing damage assessment based on the second denoised noise segment and the second noise segment to obtain a first damage score;
    若所述第一损伤评分不大于第三阈值,进入所述PNR模式;If the first damage score is not greater than a third threshold, enter the PNR mode;
    所述获取第一带噪语音信号包括:The acquisition of the first noisy speech signal includes:
    从在所述第一噪音片段之后产生的噪声信号中确定所述第一带噪语音信号;determining said first noisy speech signal from a noise signal generated after said first noise segment;
    所述注册语音信号的特征向量包括所述第一临时特征向量。The feature vector of the enrollment speech signal includes the first temporary feature vector.
  22. 根据权利要求21所述的方法,其特征在于,若所述第一损伤评分不大于第三阈值,所述方法还包括:The method according to claim 21, wherein if the first damage score is not greater than a third threshold, the method further comprises:
    通过所述终端设备发出第一提示信息,所述第一提示信息用于提示是否使得所述终端设备进入所述PNR模式;Sending first prompt information through the terminal device, where the first prompt information is used to prompt whether to enable the terminal device to enter the PNR mode;
    在检测到所述目标用户的同意进入所述PNR模式的操作指令后,才进入所述PNR模式。The PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  23. 根据权利要求21或22所述的方法,其特征在于,所述方法还包括:The method according to claim 21 or 22, further comprising:
    在检测到终端设备再次被使用时,获取第二带噪语音信号;When it is detected that the terminal device is used again, acquiring a second noisy voice signal;
    在所述第二带噪语音信号的SNR低于第四阈值时,根据所述第一临时特征向量对所述第二带噪语音信号进行降噪处理,以得到所述当前使用者的降噪语音信号;When the SNR of the second noisy speech signal is lower than a fourth threshold, perform noise reduction processing on the second noisy speech signal according to the first temporary feature vector, so as to obtain the noise reduction of the current user voice signal;
    基于所述当前使用者的降噪语音信号和所述第二带噪语音信号进行损伤评估,以得到第二损伤评分;performing impairment assessment based on the noise-reduced speech signal of the current user and the second noisy speech signal to obtain a second impairment score;
    当所述第二损伤评分不大于第五阈值时,通过所述终端设备发出所述第二提示信息,所述第二提示信息用于提示所述当前使用者所述终端设备能够进入PNR模式;When the second damage score is not greater than the fifth threshold, sending the second prompt message through the terminal device, the second prompt message is used to prompt the current user that the terminal device can enter the PNR mode;
    在检测到所述当前使用者的同意进入所述PNR模式的操作指令后,使得所述终端设备进入PNR模式对第三带噪语音信号进行降噪处理,所述第三带噪语音信号是在所述第二带噪语音信号之后获取的;After detecting the operation instruction of the current user agreeing to enter the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the third noisy voice signal, and the third noisy voice signal is in the PNR mode. acquired after the second noisy speech signal;
    在检测到所述当前使用者的不同意进入所述PNR模式的操作指令后,采用非PNR模式对所述第三带噪语音信号进行降噪处理。After detecting that the current user does not agree to enter the operation instruction of the PNR mode, the non-PNR mode is used to perform noise reduction processing on the third noisy speech signal.
  24. 根据权利要求21或22所述的方法,其特征在于,所述方法还包括:The method according to claim 21 or 22, further comprising:
    若所述第一噪音片段的SNR不大于所述第一阈值或者所述第一噪音片段的SPL不大于所述第二阈值,且所述终端设备已存储参考临时声纹特征向量,获取第三噪音片段;If the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, and the terminal device has stored a reference temporary voiceprint feature vector, obtain the third noise fragments;
    根据所述参考临时声纹特征向量对所述第三噪音片段进行降噪处理,得到第三降噪噪音片段;performing noise reduction processing on the third noise segment according to the reference temporary voiceprint feature vector to obtain a third noise reduction noise segment;
    根据所述第三噪音片段和所述第三降噪噪音片段进行损伤评估,以得到第三损伤评分;performing damage assessment according to the third noise segment and the third noise-reduced noise segment to obtain a third damage score;
    若所述第三损伤评分大于第六阈值且所述第三噪音片段的SNR小于第七阈值,或者所述第三损伤评分大于第八阈值且所述第三噪音片段的SNR不小于所述第七阈值,则通过所述终端设备发出所述第三提示信息,所述第三提示信息用于提示当前使用者所述终端设备能够进入PNR模式;If the third impairment score is greater than the sixth threshold and the SNR of the third noise segment is less than the seventh threshold, or the third impairment score is greater than the eighth threshold and the SNR of the third noise segment is not less than the first seven thresholds, the terminal device sends the third prompt information, and the third prompt information is used to prompt the current user that the terminal device can enter the PNR mode;
    在检测到所述当前使用者的同意进入所述PNR模式的操作指令后,使得所述终端设备进入PNR模式对第四带噪语音信号进行降噪处理;在检测到所述当前使用者的不同意进入所述PNR模式的操作指令后,采用非PNR模式对所述第四带噪语音信号进行降噪处理;After detecting the operation instruction of the current user agreeing to enter the PNR mode, make the terminal device enter the PNR mode to perform noise reduction processing on the fourth noisy voice signal; After agreeing to enter the operation instruction of the PNR mode, use the non-PNR mode to perform noise reduction processing on the fourth noisy speech signal;
    其中,所述第四带噪语音信号是从在所述第三噪音片段之后产生的噪声信号中确定的。Wherein, the fourth noisy speech signal is determined from a noise signal generated after the third noise segment.
  25. 根据权利要求8-12任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 8-12, wherein the method further comprises:
    获取所述终端设备所处环境的第一噪音片段和第二噪音片段;所述第一噪音片段和第二噪音片段在时间上是连续的噪音片段;Acquiring a first noise segment and a second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are continuous noise segments in time;
    获取所述终端设备的辅助设备的麦克风阵列针对所述终端设备所处的环境采集的信号;利用所述采集的信号计算得到所述第一噪音片段的信号到达角DOA和SPL;若所述第一噪音片段的DOA大于第九阈值且小于第十阈值,且所述第一噪音片段的SPL大于第十一阈值,则提取所述第一噪音片段的第二临时特征向量,基于所述第二临时特征向量对所述第二噪音片段进行降噪处理,以得到第三降噪噪音片段;基于所述第三降噪噪音片段和所述第二噪音片段进行损伤评估,以得到第四损伤评分;若所述第四损伤评分大于第十二阈值,则进入所述PNR模式;Acquiring the signal collected by the microphone array of the auxiliary device of the terminal device for the environment where the terminal device is located; using the collected signal to calculate the signal arrival angle DOA and SPL of the first noise segment; if the second The DOA of a noise segment is greater than the ninth threshold and less than the tenth threshold, and the SPL of the first noise segment is greater than the eleventh threshold, then the second temporary feature vector of the first noise segment is extracted, based on the second Performing denoising processing on the second noise segment with the temporary feature vector to obtain a third denoising noise segment; performing damage assessment based on the third denoising noise segment and the second noise segment to obtain a fourth damage score ; If the fourth impairment score is greater than the twelfth threshold, enter the PNR mode;
    所述获取第一带噪语音信号包括:The acquisition of the first noisy speech signal includes:
    从在所述第一噪音片段之后产生的噪声信号中确定所述第一带噪语音信号;determining said first noisy speech signal from a noise signal generated after said first noise segment;
    所述注册语音信号的特征向量包括所述第二临时特征向量。The feature vector of the enrollment speech signal includes the second temporary feature vector.
  26. 根据权利要求25所述的方法,其特征在于,若所述第四损伤评分不大于所述第十二阈值,所述方法还包括:The method according to claim 25, wherein if the fourth damage score is not greater than the twelfth threshold, the method further comprises:
    通过所述终端设备发出第四提示信息,所述第四提示信息用于提示是否使得所述终端设备进入所述PNR模式;Sending fourth prompt information through the terminal device, where the fourth prompt information is used to prompt whether to make the terminal device enter the PNR mode;
    在检测到所述目标用户的同意进入所述PNR模式的操作指令后,才进入所述PNR模式。The PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  27. 根据权利要求1-20任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-20, wherein the method further comprises:
    当检测到终端设备处于手持通话状态时,不进入所述PNR模式;When it is detected that the terminal device is in a handheld talking state, the PNR mode is not entered;
    当检测到所述终端设备处于免提通话状态时,进入所述PNR模式,其中,所述目标用户为所述终端设备的拥有者或者正在使用所述终端设备的用户;Entering the PNR mode when it is detected that the terminal device is in a hands-free call state, wherein the target user is the owner of the terminal device or the user who is using the terminal device;
    当检测到所述终端设备处于视频通话状态时,进入所述PNR模式,其中,所述目标用户为所述终端设备的拥有者或者距离所述终端设备最近的用户;When it is detected that the terminal device is in a video call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user closest to the terminal device;
    当检测到所述终端设备连接到耳机进行通话时,进入所述PNR模式,其中,所述目标用户为佩戴所述耳机的用户;所述第一带噪语音信号和所述目标语音相关数据是通过所述耳机采集得到的;或When it is detected that the terminal device is connected to the earphone for conversation, enter the PNR mode, wherein the target user is a user wearing the earphone; the first noisy voice signal and the target voice-related data are collected by said headset; or
    当检测到所述终端设备连接到智能大屏设备、智能手表或者车载设备时,进入所述PNR模式,其中所述目标用户为所述终端设备的拥有者或者正在使用所述终端设备的用户,所述第一带噪语音信号和目标语音相关数据是由所述智能大屏设备、所述智能手表或者所述车载 设备的音频采集硬件采集得到的。When it is detected that the terminal device is connected to a smart large-screen device, a smart watch or a vehicle-mounted device, enter the PNR mode, wherein the target user is the owner of the terminal device or the user who is using the terminal device, The first noisy voice signal and target voice-related data are collected by audio collection hardware of the smart large-screen device, the smart watch, or the vehicle-mounted device.
  28. 根据权利要求1-20任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-20, wherein the method further comprises:
    获取当前环境的音频信号的分贝值;Get the decibel value of the audio signal in the current environment;
    若所述当前环境的音频信号的分贝值超过预设分贝值,且所述终端设备启动的应用程序对应的PNR功能未开启,则开启所述终端设备启动的应用程序对应的PNR功能,并进入所述PNR模式。If the decibel value of the audio signal in the current environment exceeds the preset decibel value, and the PNR function corresponding to the application program started by the terminal device is not enabled, then enable the PNR function corresponding to the application program started by the terminal device, and enter The PNR mode.
  29. 根据权利要求1-20任一项所述的方法,其特征在于,所述终端设备包括显示屏,所述显示屏包括多个显示区域,The method according to any one of claims 1-20, wherein the terminal device includes a display screen, and the display screen includes a plurality of display areas,
    其中,所述多个显示区域中的每个显示区域显示标签和对应的功能按键,所述功能按键用于控制对应标签所指示的功能或者应用程序的PNR功能的开启和关闭。Wherein, each display area in the plurality of display areas displays a label and a corresponding function key, and the function key is used to control the function indicated by the corresponding label or the opening and closing of the PNR function of the application program.
  30. 根据权利要求1-20任一项所述的方法,其特征在于,当所述终端设备与另一终端设备之间进行语音数据传输时,所述方法还包括:The method according to any one of claims 1-20, wherein when voice data transmission is performed between the terminal device and another terminal device, the method further comprises:
    接收所述另一终端设备发送的语音增强请求,所述语音增强请求用于指示所述终端设备开启通话功能的PNR功能;receiving a voice enhancement request sent by the other terminal device, where the voice enhancement request is used to instruct the terminal device to enable the PNR function of the call function;
    响应于所述语音增强请求,通过所述终端设备发出第三提示信息,所述第三提示信息用于提示是否使得所述终端设备开启所述通话功能的PNR功能;In response to the voice enhancement request, sending third prompt information through the terminal device, where the third prompt information is used to prompt whether to enable the terminal device to enable the PNR function of the call function;
    当检测到确认开启通话功能的PNR功能的操作指令后,开启所述通话功能的PNR功能,并进入PNR模式;After detecting the operation command for confirming to enable the PNR function of the call function, enable the PNR function of the call function, and enter the PNR mode;
    向所述另一终端设备发送语音增强响应消息,所述语音增强响应消息用于指示所述终端设备已开启通话功能的PNR功能。Sending a speech enhancement response message to the other terminal device, where the speech enhancement response message is used to indicate that the terminal device has enabled the PNR function of the call function.
  31. 根据权利要求5-7、10-12和17任一项所述的方法,其特征在于,当所述终端设备启动视频通话或者视频录制功能,所述终端设备的显示界面包括第一区域和第二区域,所述第一区域用于显示视频通话内容或者视频录制的内容,所述第二区域用于显示M个控件和对应的M个标签,所述M个控件与所述M个目标用户一一对应,所述M个控件中的每个控件包括滑动按钮和滑动条,通过控制所述滑动按钮在所述滑动条上滑动,以调节该控件对应的标签所指示目标用户的语音增强系数。The method according to any one of claims 5-7, 10-12 and 17, wherein when the terminal device starts the video call or video recording function, the display interface of the terminal device includes the first area and the second area. Two areas, the first area is used to display video call content or video recording content, the second area is used to display M controls and corresponding M labels, and the M controls are related to the M target users In one-to-one correspondence, each control in the M controls includes a sliding button and a sliding bar, and by controlling the sliding button to slide on the sliding bar, the voice enhancement coefficient of the target user indicated by the label corresponding to the control is adjusted .
  32. 根据权利要求5-7、10-12和17任一项所述的方法,其特征在于,当所述终端设备启动视频通话或者视频录制功能,所述终端设备的显示界面包括第一区域,所述第一区域用于显示视频通话内容或者视频录制的内容;The method according to any one of claims 5-7, 10-12 and 17, wherein when the terminal device starts the video call or video recording function, the display interface of the terminal device includes a first area, so The above-mentioned first area is used to display the content of the video call or the content of the video recording;
    当检测到针对所述视频通话内容或者视频录制内容中任一对象的操作时,在所述第一区域显示该对象对应的控件,该控件包括滑动按钮和滑动条,通过控制所述滑动按钮在所述滑动条上滑动,以调节该对象的语音增强系数。When an operation on any object in the video call content or video recording content is detected, the control corresponding to the object is displayed in the first area, and the control includes a sliding button and a sliding bar. Slide on the slide bar to adjust the speech enhancement coefficient of the object.
  33. 根据权利要求1-4和8任一项所述的方法,其特征在于,当所述终端设备为智能交互设备时,所述目标语音相关数据包括包含唤醒词的语音信号,所述第一带噪语音信号包括包含命令词的音频信号。The method according to any one of claims 1-4 and 8, wherein when the terminal device is an intelligent interactive device, the target voice-related data includes a voice signal containing a wake-up word, and the first band The noisy speech signal includes an audio signal containing command words.
  34. 一种终端设备,其特征在于,包括:A terminal device, characterized in that it includes:
    获取单元,用于在所述终端设备进入特定人降噪PNR模式后,获取第一带噪语音信号和目标语音相关数据,其中,所述第一带噪语音信号包含干扰噪声信号与所述目标用户的语音信号,所述目标语音相关数据用于指示所述目标用户的语音特征;An acquisition unit, configured to acquire a first noisy speech signal and target speech related data after the terminal device enters a person-specific noise reduction PNR mode, wherein the first noisy speech signal includes an interference noise signal and the target speech signal a user's voice signal, the target voice-related data is used to indicate the voice characteristics of the target user;
    降噪单元,用于根据所述目标语音相关数据和语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号,其中,所述语音降噪模型是基于神经网 络实现的。A noise reduction unit, configured to perform noise reduction processing on the first noisy speech signal according to the target speech-related data and a speech noise reduction model, to obtain a noise-reduced speech signal of the target user, wherein the speech reduction The noise model is realized based on neural network.
  35. 根据权利要求34的终端设备,其特征在于,A terminal device according to claim 34, characterized in that,
    所述获取单元,还用于获取所述目标用户的语音增强系数;The acquiring unit is further configured to acquire the speech enhancement coefficient of the target user;
    所述降噪单元,还用于基于所述目标用户的语音增强系数对所述目标用户的降噪语音信号进行增强处理,以得到所述目标用户的增强语音信号,其中,所述目标用户的增强语音信号的幅度与所述目标用户的降噪语音信号的幅度的比值为所述目标用户的语音增强系数。The noise reduction unit is further configured to perform enhancement processing on the target user's noise-reduced speech signal based on the target user's speech enhancement coefficient to obtain the target user's enhanced speech signal, wherein the target user's The ratio of the amplitude of the enhanced voice signal to the amplitude of the noise-reduced voice signal of the target user is the voice enhancement coefficient of the target user.
  36. 根据权利要求35所述的终端设备,其特征在于,The terminal device according to claim 35, characterized in that,
    所述获取单元,还用于在通过所述降噪处理还得到所述干扰噪声信号后,获取干扰噪声系数;The obtaining unit is further configured to obtain the interference noise coefficient after the interference noise signal is also obtained through the noise reduction processing;
    所述降噪单元,还用于基于所述干扰噪声抑制系数对所述干扰噪声信号进行抑制处理,以得到干扰噪声抑制信号,其中,所述干扰噪声抑制信号的幅度与所述干扰噪声信号的幅度的比值为所述干扰噪声抑制系数;将所述干扰噪声抑制信号与所述目标用户的增强语音信号进行融合,以得到输出信号。The noise reduction unit is further configured to suppress the interference noise signal based on the interference noise suppression coefficient to obtain an interference noise suppression signal, wherein the amplitude of the interference noise suppression signal is the same as the amplitude of the interference noise signal The ratio of the amplitude is the interference noise suppression coefficient; the interference noise suppression signal is fused with the enhanced speech signal of the target user to obtain an output signal.
  37. 根据权利要求34所述的终端设备,其特征在于,The terminal device according to claim 34, characterized in that,
    所述获取单元,还用于在通过所述降噪处理还得到所述干扰噪声信号后,获取干扰噪声抑制系数;The acquisition unit is further configured to acquire an interference noise suppression coefficient after the interference noise signal is also obtained through the noise reduction processing;
    所述降噪单元,还用于基于所述干扰噪声抑制系数对所述干扰噪声信号进行抑制处理,得到干扰噪声抑制信号,其中,所述干扰噪声抑制信号的幅度与所述干扰噪声信号的幅度的比值为所述干扰噪声抑制系数;将所述干扰噪声抑制信号与所述目标用户的降噪语音信号进行融合,以得到输出信号。The noise reduction unit is further configured to suppress the interference noise signal based on the interference noise suppression coefficient to obtain an interference noise suppression signal, wherein the amplitude of the interference noise suppression signal is the same as the amplitude of the interference noise signal The ratio of is the interference noise suppression coefficient; the interference noise suppression signal is fused with the noise-reduced speech signal of the target user to obtain an output signal.
  38. 根据权利要求35所述的终端设备,其特征在于,所述目标用户包括M个,所述目标语音相关数据包括M个目标用户的语音相关数据,所述目标用户的降噪语音信号包括所述M个目标用户的降噪语音信号,所述目标用户的语音增强系数包括所述M个目标用户的语音增强系数,所述M为大于1的整数;The terminal device according to claim 35, wherein the target users include M, the target voice-related data include voice-related data of M target users, and the noise-reduced voice signals of the target users include the Noise-reducing voice signals of M target users, the voice enhancement coefficients of the target users include the voice enhancement coefficients of the M target users, and the M is an integer greater than 1;
    在所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到目标用户的降噪语音信号的方面,所述降噪单元具体用于:In the aspect of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain a noise-reduced speech signal of the target user, the noise reduction unit is specifically used for:
    对于所述M个目标用户中任一目标用户A,根据所述目标用户A的语音相关数据通过所述语音降噪模型对所述第一带噪语音信号进行降噪处理,得到所述目标用户A的降噪语音信号;For any target user A among the M target users, according to the voice-related data of the target user A, the noise reduction process is performed on the first noisy voice signal through the voice noise reduction model to obtain the target user A noise reduction voice signal;
    在所述基于所述目标用户的语音增强系数对所述目标用户的降噪语音信号进行增强处理,以得到所述目标用户的增强语音信号的方面,所述降噪单元具体用于:In the aspect of performing enhancement processing on the target user's noise-reduced speech signal based on the target user's speech enhancement coefficient to obtain the target user's enhanced speech signal, the noise reduction unit is specifically used for:
    基于所述目标用户A的语音增强系数对所述目标用户A的降噪语音信号进行增强处理,以得到所述目标用户A的增强语音信号;所述目标用户A的增强语音信号的幅度与所述目标用户A的降噪语音信号的幅度的比值为所述目标用户A的语音增强系数;Based on the speech enhancement coefficient of the target user A, the noise-reduced speech signal of the target user A is enhanced to obtain the enhanced speech signal of the target user A; the amplitude of the enhanced speech signal of the target user A is related to the The ratio of the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
    所述降噪单元,还用于基于M个目标用户的增强语音信号得到输出信号。The noise reduction unit is further configured to obtain an output signal based on the enhanced voice signals of the M target users.
  39. 根据权利要求36所述的终端设备,其特征在于,所述目标用户包括M个,所述目标语音相关数据包括M个目标用户的语音相关数据,所述目标用户的降噪语音信号包括所述M个目标用户的降噪语音信号,所述M为大于1的整数;The terminal device according to claim 36, wherein the target users include M, the target voice-related data include voice-related data of M target users, and the noise-reduced voice signals of the target users include the Noise-reduced voice signals of M target users, where M is an integer greater than 1;
    在根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号和所述干扰噪声信号的方面,所述降噪单元具体用于:In terms of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain the target user's noise-reduced speech signal and the interference noise signal, the noise reduction unit specifically Used for:
    根据所述M个目标用户中第1个目标用户的语音相关数据通过所述语音降噪模型对所述第一带噪语音信号进行降噪处理,得到所述第1个目标用户的降噪语音信号和不包含所述第1个目标用户的语音信号的第一带噪语音信号;Perform noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech-related data of the first target user among the M target users, to obtain the noise-reduced speech of the first target user signal and the first noisy speech signal that does not contain the speech signal of the first target user;
    根据所述M个目标用户中第2个目标用户的语音相关数据通过所述语音降噪模型对所述不包含所述第1个目标用户语音信号的第一带噪语音信号进行降噪处理,得到所述第2个目标用户的降噪语音信号和不包含所述第1个目标用户的语音信号和第2个目标用户的语音信号的第一带噪语音信号;performing noise reduction processing on the first noisy speech signal that does not include the speech signal of the first target user through the speech noise reduction model according to the speech-related data of the second target user among the M target users, Obtaining the noise-reduced speech signal of the 2nd target user and the first noisy speech signal that does not include the speech signal of the 1st target user and the speech signal of the 2nd target user;
    重复上述过程,直至根据第M个目标用户的语音相关数据通过所述语音降噪模型对不包含所述第1至M-1个目标用户的语音信号的第一带噪语音信号进行降噪处理,得到所述第M个目标用户的降噪语音信号和所述干扰噪声信号。Repeat the above process until the first noisy speech signal that does not contain the speech signals of the 1st to M-1 target users is subjected to noise reduction processing through the speech noise reduction model according to the speech related data of the Mth target user , to obtain the noise-reduced speech signal of the Mth target user and the interference noise signal.
  40. 根据权利要求36所述的终端设备,其特征在于,所述目标用户包括M个,所述目标语音相关数据包括M个目标用户的语音相关数据,所述目标用户的降噪语音信号包括所述M个目标用户的降噪语音信号,所述M为大于1的整数;The terminal device according to claim 36, wherein the target users include M, the target voice-related data include voice-related data of M target users, and the noise-reduced voice signals of the target users include the Noise-reduced voice signals of M target users, where M is an integer greater than 1;
    在根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,得到目标用户的降噪语音信号和所述干扰噪声信号的方面,所述降噪单元具体用于:In terms of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain the target user's noise-reduced speech signal and the interference noise signal, the noise reduction unit specifically Used for:
    根据所述M个目标用户的语音相关数据通过所述语音降噪模型对所述第一带噪语音信号进行降噪处理,得到所述M个目标用户的降噪语音信号和所述干扰噪声信号。Perform noise reduction processing on the first noisy speech signal through the speech noise reduction model according to the speech-related data of the M target users, to obtain the noise-reduced speech signals of the M target users and the interference noise signal .
  41. 根据权利要求34-37任一项所述的终端设备,其特征在于,所述目标用户包括M个,所述目标用户的相关数据包括所述目标用户的注册语音信号,所述目标用户的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户的语音信号,所述语音降噪模型包括第一编码网络、第二编码网络、时间卷积网络TCN和第一解码网络;The terminal device according to any one of claims 34-37, wherein the target users include M, the relevant data of the target users include the registered voice signals of the target users, the registered voice signals of the target users The speech signal is the speech signal of the target user collected in an environment where the noise decibel value is lower than a preset value, and the speech noise reduction model includes a first coding network, a second coding network, a time convolutional network TCN and a first decoding network ;
    在根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号,以得到所述目标用户的降噪语音信号的方面,所述降噪单元具体用于:In terms of obtaining the noise-reduced speech signal of the target user through the speech noise reduction model for the first noisy speech signal according to the target speech-related data, the noise reduction unit is specifically used for:
    利用所述第一编码网络和所述第二编码网络分别对所述目标用户的注册语音信号和所述第一带噪语音信号进行特征提取,以得到所述目标用户的注册语音信号的特征向量和所述第一带噪语音信号的特征向量;Using the first coding network and the second coding network to perform feature extraction on the registration speech signal of the target user and the first noisy speech signal respectively, so as to obtain a feature vector of the registration speech signal of the target user and the feature vector of the first noisy speech signal;
    根据所述目标用户的注册语音信号的特征向量和所述第一带噪语音信号的特征向量得到第一特征向量;Obtaining a first eigenvector according to the eigenvector of the registration voice signal of the target user and the eigenvector of the first noisy voice signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述第一解码网络和所述第二特征向量得到所述目标用户的降噪语音信号。Obtaining the noise-reduced speech signal of the target user according to the first decoding network and the second feature vector.
  42. 根据权利要求41所述的终端设备,其特征在于,所述降噪单元还用于:The terminal device according to claim 41, wherein the noise reduction unit is also used for:
    根据所述第一解码网络和所述第二特征向量还得到所述干扰噪声信号。The interference noise signal is also obtained according to the first decoding network and the second eigenvector.
  43. 根据权利要求38所述的终端设备,其特征在于,所述目标用户A的相关数据包括所述目标用户A的注册语音信号,所述目标用户A的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户A的语音信号,所述语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,The terminal device according to claim 38, wherein the relevant data of the target user A includes the registered voice signal of the target user A, and the registered voice signal of the target user A is when the noise decibel value is lower than the preset The voice signal of the target user A collected under the setting environment, the voice noise reduction model includes the first encoding network, the second encoding network, TCN and the first decoding network,
    在所述根据所述目标用户A的语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户A的降噪语音信号的方面,所述降噪单元具体用于:In the aspect of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the speech-related data of the target user A, so as to obtain the noise-reduced speech signal of the target user A, the noise reduction The noise unit is specifically used for:
    利用所述第一编码网络和所述第二编码网络分别对所述目标用户A的注册语音信号和所述第一带噪语音信号进行特征提取,得到所述目标用户A的注册语音信号的特征向量和所述第一带噪语音信号的特征向量;Using the first coding network and the second coding network to perform feature extraction on the registered speech signal of the target user A and the first noisy speech signal respectively, to obtain the features of the registered speech signal of the target user A vector and the feature vector of the first noisy speech signal;
    根据所述目标用户A的注册语音信号的特征向量和所述第一带噪语音信号的特征向量得到第一特征向量;Obtaining a first feature vector according to the feature vector of the registered voice signal of the target user A and the feature vector of the first noisy voice signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述第一解码网络和所述第二特征向量得到所述目标用户A的降噪语音信号。The noise-reduced speech signal of the target user A is obtained according to the first decoding network and the second feature vector.
  44. 根据权利要求39所述的终端设备,其特征在于,所述M个目标用户中第i个目标用户的相关数据包括所述第i个目标用户的注册语音信号,所述i为大于0且小于或者等于M的整数,所述语音降噪模型包括第一编码网络、第二编码网络、TCN和第一解码网络,所述降噪单元具体用于:The terminal device according to claim 39, wherein the relevant data of the i-th target user among the M target users includes the registered voice signal of the i-th target user, and the i is greater than 0 and less than Or an integer equal to M, the speech noise reduction model includes a first coding network, a second coding network, a TCN and a first decoding network, and the noise reduction unit is specifically used for:
    利用所述第一编码网络和所述第二编码网络分别对所述目标用户的注册语音信号和第一噪声信号进行特征提取,得到所述第i个目标用户的注册语音信号的特征向量和该第一噪声信号的特征向量;其中,所述第一噪声信号为不包含第1至i-1个目标用户的语音信号的第一带噪语音信号;Using the first coding network and the second coding network to perform feature extraction on the registration speech signal of the target user and the first noise signal respectively, to obtain the feature vector and the registration speech signal of the i-th target user. The eigenvector of the first noise signal; wherein, the first noise signal is the first noisy speech signal that does not contain the speech signals of the 1st to i-1 target users;
    根据所述第i个目标用户的注册语音信号的特征向量和所述第一噪声信号的特征向量得到第一特征向量;Obtaining a first feature vector according to the feature vector of the i-th target user's registered voice signal and the feature vector of the first noise signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述第一解码网络和所述第二特征向量得到所述第i个目标用户的降噪语音信号和第二噪声信号,其中,所述第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号。According to the first decoding network and the second eigenvector, the noise-reduced speech signal and the second noise signal of the i-th target user are obtained, wherein the second noise signal does not contain the first to i-th targets A first noisy speech signal of the user's speech signal.
  45. 根据权利要求40所述的终端设备,其特征在于,对于所述M个目标用户的语音相关数据,每个目标用户的相关数据包括该目标用户的注册语音信号,所述目标用户A的注册语音信号为在噪音分贝值低于预设值的环境下采集的目标用户A的语音信号,所述语音降噪模型包括M个第一编码网络、第二编码网络、TCN、第一解码网络和M个第三解码网络,The terminal device according to claim 40, wherein, for the voice-related data of the M target users, the relevant data of each target user includes the registered voice signal of the target user, and the registered voice signal of the target user A The signal is the voice signal of the target user A collected in an environment where the noise decibel value is lower than the preset value, and the voice noise reduction model includes M first encoding networks, second encoding networks, TCN, first decoding networks and M a third decoding network,
    在所述根据所述M个目标用户的语音相关数据通过所述语音降噪模型对所述带噪语音进行降噪处理,以得到所述M个目标用户的降噪语音信号和所述干扰噪声信号的方面,所述降噪单元具体用于:Perform noise reduction processing on the noisy speech through the speech noise reduction model according to the speech-related data of the M target users, so as to obtain the noise-reduced speech signals of the M target users and the interference noise In terms of signals, the noise reduction unit is specifically used for:
    利用所述M个第一编码网络分别对所述M个目标用户的注册语音信号进行特征提取,得到M个目标用户的注册语音信号的特征向量;利用所述第二编码网络对所述第一带噪语音信号进行特征提取,得到所述第一带噪语音信号的特征向量;Utilize the M first encoding networks to perform feature extraction on the registered voice signals of the M target users respectively, to obtain feature vectors of the registered voice signals of the M target users; use the second encoding network to extract the features of the first Performing feature extraction on the noisy speech signal to obtain a feature vector of the first noisy speech signal;
    根据所述M个目标用户的注册语音信号的特征向量和所述第一带噪语音信号的特征向量得到第一特征向量;Obtaining a first eigenvector according to the eigenvectors of the registration voice signals of the M target users and the eigenvectors of the first noisy voice signal;
    根据所述TCN和所述第一特征向量得到第二特征向量;obtaining a second eigenvector according to the TCN and the first eigenvector;
    根据所述M个第三解码网络中的每个第三解码网络、所述第二特征向量和与该第三解码网络对应的第一编码网络输出的特征向量得到M个目标用户的降噪语音信号;According to each of the M third decoding networks, the second feature vector and the feature vector output by the first encoding network corresponding to the third decoding network, the noise-reduced voices of M target users are obtained. Signal;
    根据所述第一解码网络、所述第二特征向量与所述第一带噪语音信号的特征向量得到所述干扰噪声信号。The interference noise signal is obtained according to the first decoding network, the second feature vector and the feature vector of the first noisy speech signal.
  46. 根据权利要求34-37任一项所述的终端设备,其特征在于,所述目标用户的相关数据包括所述目标用户的语音拾取VPU信号,所述语音降噪模型包括预处理模块、第三编码网络、门控循环单元GRU、第二解码网络和后处理模块,The terminal device according to any one of claims 34-37, wherein the relevant data of the target user includes the target user's voice pickup VPU signal, and the voice noise reduction model includes a preprocessing module, a third An encoding network, a gated recurrent unit GRU, a second decoding network and a post-processing module,
    在所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号的方面,所述降噪单元具体用于:In the aspect of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain the noise-reduced speech signal of the target user, the noise reduction unit specifically uses At:
    通过所述预处理模块分别对所述第一带噪语音信号和所述目标用户的VPU信号进行时 频变换,得到所述第一带噪语音信号的第一频域信号和所述VPU信号的第二频域信号;Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user by the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the VPU signal of the VPU signal. a second frequency domain signal;
    对所述第一频域信号和所述第二频域信号进行融合,得到第一融合频域信号;merging the first frequency domain signal and the second frequency domain signal to obtain a first fused frequency domain signal;
    将所述第一融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理,以得到所述目标用户的语音信号的第三频域信号的掩膜;sequentially processing the first fused frequency domain signal through the third encoding network, the GRU, and the second decoding network to obtain a mask of a third frequency domain signal of the speech signal of the target user;
    通过所述后处理模块根据所述第三频域信号的掩膜对所述第一频域信号进行后处理,得到所述第三频域信号;performing post-processing on the first frequency-domain signal by the post-processing module according to the mask of the third frequency-domain signal, to obtain the third frequency-domain signal;
    对所述第三频域信号进行频时变换,得到所述目标用户的降噪语音信号;performing frequency-time transformation on the third frequency domain signal to obtain a noise-reduced voice signal of the target user;
    其中,所述第三编码模块和所述第二解码模块均是基于卷积层和频域变换模块FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on a convolutional layer and a frequency domain transformation module FTB.
  47. 根据权利要求46所述的终端设备,其特征在于,所述降噪单元具体用于:The terminal device according to claim 46, wherein the noise reduction unit is specifically used for:
    将所述第一融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理还得到所述第一频域信号的掩膜;Processing the first fused frequency domain signal successively through the third encoding network, the GRU, and the second decoding network to obtain a mask of the first frequency domain signal;
    通过所述后处理模块根据所述第一频域信号的掩膜对所述第一频域信号进行后处理,得到所述干扰噪声信号的第四频域信号;performing post-processing on the first frequency-domain signal by the post-processing module according to the mask of the first frequency-domain signal, to obtain a fourth frequency-domain signal of the interference noise signal;
    以对所述第四频域信号进行频时变换,以得到所述干扰噪声信号。Perform frequency-time transformation on the fourth frequency domain signal to obtain the interference noise signal.
  48. 根据权利要求38所述的终端设备,其特征在于,所述目标用户A的相关数据包括所述目标用户A的VPU信号,所述语音降噪模型包括预处理模块、第三编码网络、GRU、第二解码网络和后处理模块,在所述根据所述目标用户A的语音相关数据通过语音降噪模型对所述第一带噪语音信号,以得到所述目标用户A的降噪语音信号的方面,所述降噪单元具体用于:The terminal device according to claim 38, wherein the relevant data of the target user A includes the VPU signal of the target user A, and the speech noise reduction model includes a preprocessing module, a third coding network, a GRU, The second decoding network and the post-processing module, according to the speech-related data of the target user A, through the speech noise reduction model to the first noisy speech signal, to obtain the noise-reduced speech signal of the target user A In one aspect, the noise reduction unit is specifically used for:
    通过所述预处理模块分别对所述第一带噪语音信号和所述目标用户A的VPU信号进行时频变换,以得到所述第一带噪语音信号的第一频域信号和所述目标用户A的VPU信号的第九频域信号;Time-frequency transformation is performed on the first noisy speech signal and the VPU signal of the target user A through the preprocessing module to obtain the first frequency domain signal of the first noisy speech signal and the target A ninth frequency domain signal of the VPU signal of user A;
    对所述第一频域信号和所述第九频域信号进行融合,得到第二融合频域信号;merging the first frequency domain signal and the ninth frequency domain signal to obtain a second fused frequency domain signal;
    将所述第二融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理,以得到所述目标用户A的语音信号的第十频域信号的掩膜;sequentially processing the second fused frequency domain signal through the third encoding network, the GRU, and the second decoding network to obtain a mask of the tenth frequency domain signal of the voice signal of the target user A;
    通过所述后处理模块根据所述第十频域信号的掩膜对所述第一频域信号进行后处理,得到所述第十频域信号;performing post-processing on the first frequency domain signal by the post-processing module according to the mask of the tenth frequency domain signal, to obtain the tenth frequency domain signal;
    对所述第十频域信号进行频时变换,以得到所述目标用户A的降噪语音信号;performing frequency-time transformation on the tenth frequency domain signal to obtain the noise-reduced speech signal of the target user A;
    其中,所述第三编码模块和所述第二解码模块均是基于卷积层和FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
  49. 根据权利要求39所述的终端设备,其特征在于,所述M个目标用户中第i个目标用户的相关数据包括所述第i个目标用户的VPU信号,所述i为大于0且小于或者等于M的整数,所述降噪单元具体用于:The terminal device according to claim 39, wherein the relevant data of the i-th target user among the M target users includes the VPU signal of the i-th target user, and the i is greater than 0 and less than or is equal to an integer of M, and the noise reduction unit is specifically used for:
    通过所述预处理模块对第一噪声信号和所述第i个目标用户的VPU信号均进行时频变换,以得到该第一噪声信号的第十一频域信号和所述第i个目标用户的VPU信号的第十二频域信号;Time-frequency transformation is performed on the first noise signal and the VPU signal of the ith target user by the preprocessing module to obtain the eleventh frequency domain signal of the first noise signal and the ith target user The twelfth frequency domain signal of the VPU signal;
    对所述第十一频域信号和所述第十二频域信号进行融合,得到第三融合频域信号;其中,所述第一噪声信号为不包含第1至i-1个目标用户的语音信号的带噪语音信号;merging the eleventh frequency domain signal and the twelfth frequency domain signal to obtain a third fused frequency domain signal; wherein the first noise signal does not include the 1st to i-1 target users a noisy speech signal of the speech signal;
    将所述第三融合频域信号先后经过所述第三编码网络、所述GRU和所述第二解码网络处理得到所述第i个目标用户的语音信号的第十三频域信号的掩膜和所述第十一频域信号的掩膜;Processing the third fused frequency domain signal successively through the third coding network, the GRU, and the second decoding network to obtain a mask of the thirteenth frequency domain signal of the voice signal of the i-th target user and a mask of the eleventh frequency domain signal;
    通过所述后处理模块根据所述第十三频域信号的掩膜和所述第十一频域信号的掩膜对所述第十一频域信号进行后处理,得到所述第十三频域信号和第二噪声信号的第十四频域信号;The post-processing module performs post-processing on the eleventh frequency domain signal according to the mask of the thirteenth frequency domain signal and the mask of the eleventh frequency domain signal to obtain the thirteenth frequency domain signal. domain signal and a fourteenth frequency domain signal of the second noise signal;
    对所述第十三频域信号和所述第十四频域信号进行频时变换,得到所述第i个目标用户的降噪语音信号和所述第二噪声信号,所述第二噪声信号为不包含第1至i个目标用户的语音信号的第一带噪语音信号;performing frequency-time transformation on the thirteenth frequency domain signal and the fourteenth frequency domain signal to obtain the noise-reduced voice signal of the ith target user and the second noise signal, the second noise signal be the first noisy speech signal that does not contain the speech signals of the 1st to i target users;
    其中,所述第三编码模块和所述第二解码模块均是基于卷积层和FTB实现的。Wherein, both the third encoding module and the second decoding module are implemented based on convolutional layers and FTB.
  50. 根据权利要求39、40、44、45和49任一项所述的终端设备,其特征在于,在所述基于所述目标用户的语音增强系数对所述目标用户的降噪语音信号进行增强处理,以得到所述目标用户的增强语音信号的方面,所述降噪单元具体用于:The terminal device according to any one of claims 39, 40, 44, 45, and 49, wherein the noise-reduced speech signal of the target user is enhanced during the speech enhancement coefficient based on the target user , to obtain the aspect of the enhanced speech signal of the target user, the noise reduction unit is specifically used for:
    对于所述M个目标用户中的任一目标用户A,基于所述目标用户A的语音增强系数对所述目标用户A的降噪语音信号进行增强处理,以得到所述目标用户A的增强语音信号;所述目标用户A的增强语音信号的幅度与所述目标用户A的降噪语音信号的幅度的比值为所述目标用户A的语音增强系数;For any target user A among the M target users, the noise-reduced speech signal of the target user A is enhanced based on the speech enhancement coefficient of the target user A, so as to obtain the enhanced speech of the target user A signal; the ratio of the amplitude of the enhanced voice signal of the target user A to the amplitude of the noise-reduced voice signal of the target user A is the voice enhancement coefficient of the target user A;
    在所述将所述干扰噪声抑制信号与所述目标用户的增强语音信号进行融合,以得到输出信号的方面,所述降噪单元具体用于:In the aspect of fusing the interference noise suppression signal with the enhanced speech signal of the target user to obtain an output signal, the noise reduction unit is specifically used for:
    将M个目标用户的增强语音信号与所述干扰噪声抑制信号进行融合,以得到所述输出信号。The enhanced speech signals of the M target users are fused with the interference noise suppression signal to obtain the output signal.
  51. 根据权利要求34-37任一项所述的终端设备,其特征在于,所述目标用户的相关数据包括所述目标用户的VPU信号,所述获取单元还用于:获取所述目标用户的耳内声音信号;The terminal device according to any one of claims 34-37, wherein the relevant data of the target user includes the VPU signal of the target user, and the acquisition unit is further configured to: acquire the target user's ear Internal sound signal;
    在所述根据所述目标语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户的降噪语音信号的方面,所述降噪单元具体用于:In the aspect of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the target speech-related data to obtain the noise-reduced speech signal of the target user, the noise reduction unit specifically uses At:
    分别对所述第一带噪语音信号和所述耳内声音信号进行时频变换,以得到所述第一带噪语音信号的第一频域信号和所述耳内声音信号的第五频域信号;respectively performing time-frequency transformation on the first noisy speech signal and the in-ear sound signal to obtain a first frequency-domain signal of the first noisy speech signal and a fifth frequency-domain signal of the in-ear sound signal Signal;
    根据所述目标用户的VPU信号、所述第一频域信号和所述第五频域信号得到所述第一带噪语音信号与所述耳内声音信号的协方差矩阵;obtaining the covariance matrix of the first noisy speech signal and the in-ear sound signal according to the target user's VPU signal, the first frequency domain signal, and the fifth frequency domain signal;
    基于所述协方差矩阵得到第一最小方差无失真响应MVDR权重;Obtaining the first minimum variance distortion-free response MVDR weight based on the covariance matrix;
    基于所述第一MVDR权重、所述第一频域信号和所述第五频域信号得到所述第一带噪语音信号的第六频域信号和所述耳内声音信号的第七频域信号;Obtaining a sixth frequency domain signal of the first noisy speech signal and a seventh frequency domain signal of the in-ear sound signal based on the first MVDR weight, the first frequency domain signal and the fifth frequency domain signal Signal;
    根据所述第六频域信号和所述第七频域信号得到所述降噪语音信号的第八频域信号;obtaining an eighth frequency domain signal of the noise-reduced speech signal according to the sixth frequency domain signal and the seventh frequency domain signal;
    对所述第八频域信号进行频时变换,以得到所述目标用户的降噪语音信号。performing frequency-time transformation on the eighth frequency domain signal to obtain the noise-reduced speech signal of the target user.
  52. 根据权利要求51所述的终端设备,其特征在于,所述降噪单元还用于:The terminal device according to claim 51, wherein the noise reduction unit is also used for:
    根据所述目标用户的降噪语音信号对所述第一带噪语音信号得到所述干扰噪声信号。The interference noise signal is obtained for the first noisy speech signal according to the noise-reduced speech signal of the target user.
  53. 根据权利要求38所述的终端设备,其特征在于,所述目标用户A的相关数据包括所述目标用户A的VPU信号,所述获取单元,还用于获取所述目标用户A的耳内声音信号;The terminal device according to claim 38, wherein the relevant data of the target user A includes the VPU signal of the target user A, and the acquisition unit is further configured to acquire the in-ear sound of the target user A Signal;
    在所述根据所述目标用户A的语音相关数据通过语音降噪模型对所述第一带噪语音信号进行降噪处理,以得到所述目标用户A的降噪语音信号的方面,降噪单元具体用于:In the aspect of performing noise reduction processing on the first noisy speech signal through a speech noise reduction model according to the speech-related data of the target user A to obtain the noise-reduced speech signal of the target user A, the noise reduction unit Specifically for:
    分别对所述第一带噪语音信号和所述目标用户A的耳内声音信号进行时频变换,得到所述第一带噪语音信号的第一频域信号和所述目标用户A的耳内声音信号的第十五频域信号;performing time-frequency transformation on the first noisy speech signal and the target user A's ear sound signal respectively, to obtain the first frequency domain signal of the first noisy speech signal and the target user A's ear sound signal a fifteenth frequency domain signal of the sound signal;
    根据所述目标用户A的VPU信号、所述第一频域信号和所述第十五频域信号得到所述第一带噪语音信号与所述目标用户A的耳内声音信号的协方差矩阵;According to the VPU signal of the target user A, the first frequency domain signal and the fifteenth frequency domain signal, the covariance matrix of the first noisy speech signal and the in-ear sound signal of the target user A is obtained ;
    基于所述协方差矩阵得到第二MVDR权重;obtaining a second MVDR weight based on the covariance matrix;
    基于所述第二MVDR权重、所述第一频域信号和所述第十五频域信号得到所述第一带噪语音信号的第十六频域信号和所述目标用户A的耳内声音信号的第十七频域信号;根据所述第十六频域信号和所述第十七频域信号得到所述目标用户A的降噪语音信号的第十八频域信号;Obtaining the sixteenth frequency domain signal of the first noisy speech signal and the in-ear sound of the target user A based on the second MVDR weight, the first frequency domain signal and the fifteenth frequency domain signal The seventeenth frequency domain signal of the signal; the eighteenth frequency domain signal of the noise-reduced voice signal of the target user A is obtained according to the sixteenth frequency domain signal and the seventeenth frequency domain signal;
    对所述第十八频域信号进行频时变换,以得到所述目标用户A的降噪语音信号。Perform frequency-time transformation on the eighteenth frequency domain signal to obtain the noise-reduced speech signal of the target user A.
  54. 根据权利要求41-45任一项所述的终端设备,其特征在于,所述获取单元还用于:The terminal device according to any one of claims 41-45, wherein the acquiring unit is further configured to:
    获取所述终端设备所处环境的第一噪音片段和第二噪音片段;所述第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取第一噪音片段的信噪比SNR和声压级SPL;Obtaining a first noise segment and a second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are continuous noise segments in time; obtaining the signal-to-noise ratio SNR and Sound pressure level SPL;
    所述终端设备还包括:The terminal equipment also includes:
    确定单元,用于若所述第一噪音片段的SNR大于第一阈值且所述第一噪音片段的SPL大于第二阈值,则提取所述第一噪音片段的第一临时特征向量;基于所述第一临时语音特征向量对所述第二噪音片段进行降噪处理,以得到第二降噪噪音片段;基于所述第二降噪噪音片段和所述第二噪音片段进行损伤评估,以得到第一损伤评分;若所述第一损伤评分不大于第三阈值,进入所述PNR模式;A determination unit, configured to extract a first temporary feature vector of the first noise segment if the SNR of the first noise segment is greater than a first threshold and the SPL of the first noise segment is greater than a second threshold; based on the The first temporary speech feature vector performs noise reduction processing on the second noise segment to obtain a second noise reduction noise segment; and performs damage assessment based on the second noise reduction noise segment and the second noise segment to obtain a second noise reduction segment. a damage score; if the first damage score is not greater than a third threshold, enter the PNR mode;
    在所述获取第一带噪语音信号的方面,所述获取单元具体用于:In the aspect of obtaining the first noisy speech signal, the obtaining unit is specifically used for:
    从在所述第一噪音片段之后产生的噪声信号中确定所述第一带噪语音信号;所述注册语音信号的特征向量包括所述第一临时特征向量。The first noisy speech signal is determined from a noise signal generated after the first noise segment; the feature vector of the registration speech signal includes the first temporary feature vector.
  55. 根据权利要求54所述的终端设备,其特征在于,若所述第一损伤评分不大于第三阈值,所述确定单元还用于:The terminal device according to claim 54, wherein if the first damage score is not greater than a third threshold, the determining unit is further configured to:
    通过所述终端设备发出第一提示信息,所述第一提示信息用于提示是否使得所述终端设备进入所述PNR模式;Sending first prompt information through the terminal device, where the first prompt information is used to prompt whether to enable the terminal device to enter the PNR mode;
    在检测到所述目标用户的同意进入所述PNR模式的操作指令后,才进入所述PNR模式。The PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  56. 根据权利要求54或55所述的终端设备,其特征在于,The terminal device according to claim 54 or 55, characterized in that,
    所述获取单元,还用于在检测到终端设备再次被使用时,获取第二带噪语音信号;The obtaining unit is further configured to obtain a second noisy speech signal when it is detected that the terminal device is used again;
    所述降噪单元,还用于:在所述第二带噪语音信号的SNR低于第四阈值时,根据所述第一临时特征向量对所述第二带噪语音信号进行降噪处理,以得到所述当前使用者的降噪语音信号;The noise reduction unit is further configured to: when the SNR of the second noisy speech signal is lower than a fourth threshold, perform noise reduction processing on the second noisy speech signal according to the first temporary feature vector, to obtain the noise-reduced voice signal of the current user;
    所述确定单元,还用于基于所述当前使用者的降噪语音信号和所述第二带噪语音信号进行损伤评估,以得到第二损伤评分;当所述第二损伤评分不大于第五阈值时,通过所述终端设备发出所述第二提示信息,所述第二提示信息用于提示所述当前使用者所述终端设备能够进入PNR模式;在检测到所述当前使用者同意进入所述PNR模式的操作指令后,使得所述终端设备进入PNR模式对第三带噪语音信号进行降噪处理,所述第三带噪语音信号是在所述第二带噪语音信号之后获取的;在检测到所述当前使用者的不同意进入所述PNR模式的操作指令后,采用非PNR模式对所述第三带噪语音信号进行降噪处理。The determining unit is further configured to perform damage assessment based on the noise-reduced speech signal of the current user and the second noisy speech signal to obtain a second damage score; when the second damage score is not greater than the fifth threshold, send the second prompt information through the terminal device, the second prompt information is used to prompt the current user that the terminal device can enter the PNR mode; when it is detected that the current user agrees to enter the After the operation command of the PNR mode, the terminal device enters the PNR mode to perform noise reduction processing on the third noisy speech signal, and the third noisy speech signal is obtained after the second noisy speech signal; After detecting that the current user does not agree to enter the operation instruction of the PNR mode, the non-PNR mode is used to perform noise reduction processing on the third noisy speech signal.
  57. 根据权利要求54或55所述的终端设备,其特征在于,The terminal device according to claim 54 or 55, characterized in that,
    所述获取单元,还用于若所述第一噪音片段的SNR不大于所述第一阈值或者所述第一噪音片段的SPL不大于所述第二阈值,且所述终端设备已存储参考临时声纹特征向量,获取第三噪音片段;The acquisition unit is further configured to: if the SNR of the first noise segment is not greater than the first threshold or the SPL of the first noise segment is not greater than the second threshold, and the terminal device has stored a reference temporary A voiceprint feature vector to obtain a third noise segment;
    所述降噪单元,还用于根据所述参考临时声纹特征向量对所述第三噪音片段进行降噪处理,得到第三降噪噪音片段;The noise reduction unit is further configured to perform noise reduction processing on the third noise segment according to the reference temporary voiceprint feature vector to obtain a third noise reduction noise segment;
    所述确定单元,还用于根据所述第三噪音片段和所述第三降噪噪音片段进行损伤评估, 以得到第三损伤评分;若所述第三损伤评分大于第六阈值且所述第三噪音片段的SNR小于第七阈值,或者所述第三损伤评分大于第八阈值且所述第三噪音片段的SNR不小于所述第七阈值,则通过所述终端设备发出所述第三提示信息,所述第三提示信息用于提示当前使用者所述终端设备能够进入PNR模式;在检测到所述当前使用者同意进入所述PNR模式的操作指令后,使得所述终端设备进入PNR模式对第四带噪语音信号进行降噪处理;在检测到所述当前使用者的不同意进入所述PNR模式的操作指令后,采用非PNR模式对所述第四带噪语音信号进行降噪处理;其中,所述第四带噪语音信号是从在所述第三噪音片段之后产生的噪声信号中确定的。The determining unit is further configured to perform damage assessment according to the third noise segment and the third noise-reduced noise segment to obtain a third damage score; if the third damage score is greater than a sixth threshold and the first The SNR of the three noise segments is less than the seventh threshold, or the third impairment score is greater than the eighth threshold and the SNR of the third noise segment is not less than the seventh threshold, then send the third prompt through the terminal device information, the third prompt information is used to prompt the current user that the terminal device can enter the PNR mode; after detecting the current user agrees to enter the operation instruction of the PNR mode, the terminal device enters the PNR mode Perform noise reduction processing on the fourth noisy speech signal; after detecting the current user's disagreement to enter the PNR mode operation instruction, use the non-PNR mode to perform noise reduction processing on the fourth noisy speech signal ; wherein the fourth noisy speech signal is determined from a noise signal generated after the third noise segment.
  58. 根据权利要求41-45任一项所述的终端设备,其特征在于,The terminal device according to any one of claims 41-45, characterized in that,
    所述获取单元,还用于获取所述终端设备所处环境的第一噪音片段和第二噪音片段;所述第一噪音片段和第二噪音片段在时间上是连续的噪音片段;获取所述终端设备的辅助设备的麦克风阵列针对所述终端设备所处的环境采集的信号;The acquiring unit is further configured to acquire a first noise segment and a second noise segment of the environment where the terminal device is located; the first noise segment and the second noise segment are continuous noise segments in time; the acquisition of the signals collected by the microphone array of the auxiliary device of the terminal device for the environment in which the terminal device is located;
    所述终端设备还包括:The terminal equipment also includes:
    确定单元,用于利用所述采集的信号计算得到所述第一噪音片段的信号到达角DOA和SPL;若所述第一噪音片段的DOA大于第九阈值且小于第十阈值,且所述第一噪音片段的SPL大于第十一阈值,则提取所述第一噪音片段的第二临时特征向量,基于所述第二临时特征向量对所述第二噪音片段进行降噪处理,以得到第三降噪噪音片段;基于所述第三降噪噪音片段和所述第二噪音片段进行损伤评估,以得到第四损伤评分;若所述第四损伤评分大于第十二阈值,则进入所述PNR模式;A determining unit, configured to use the collected signal to calculate the signal arrival angle DOA and SPL of the first noise segment; if the DOA of the first noise segment is greater than the ninth threshold and less than the tenth threshold, and the first noise segment If the SPL of a noise segment is greater than the eleventh threshold, the second temporary feature vector of the first noise segment is extracted, and the noise reduction process is performed on the second noise segment based on the second temporary feature vector to obtain the third Noise reduction noise segment; perform damage assessment based on the third noise reduction noise segment and the second noise segment to obtain a fourth damage score; if the fourth damage score is greater than a twelfth threshold, enter the PNR model;
    在所述获取第一带噪语音信号的方面,所述获取单元具体用于:In the aspect of obtaining the first noisy speech signal, the obtaining unit is specifically used for:
    从在所述第一噪音片段之后产生的噪声信号中确定所述第一带噪语音信号;determining said first noisy speech signal from a noise signal generated after said first noise segment;
    所述注册语音信号的特征向量包括所述第二临时特征向量。The feature vector of the enrollment speech signal includes the second temporary feature vector.
  59. 根据权利要求58所述的终端设备,其特征在于,若所述第四损伤评分不大于所述第十二阈值,所述确定单元还用于:The terminal device according to claim 58, wherein if the fourth damage score is not greater than the twelfth threshold, the determining unit is further configured to:
    通过所述终端设备发出第四提示信息,所述第四提示信息用于提示是否使得所述终端设备进入所述PNR模式;Sending fourth prompt information through the terminal device, where the fourth prompt information is used to prompt whether to make the terminal device enter the PNR mode;
    在检测到所述目标用户的同意进入所述PNR模式的操作指令后,才进入所述PNR模式。The PNR mode is entered only after an operation instruction of the target user agreeing to enter the PNR mode is detected.
  60. 根据权利要求34-53任一项所述的终端设备,其特征在于,所述终端设备还包括:The terminal device according to any one of claims 34-53, wherein the terminal device further comprises:
    检测单元,用于当检测到终端设备处于手持通话状态时,不进入所述PNR模式;A detection unit, configured to not enter the PNR mode when it is detected that the terminal device is in a hand-held talking state;
    当检测到所述终端设备处于免提通话状态时,进入所述PNR模式,其中,所述目标用户为所述终端设备的拥有者或者正在使用所述终端设备的用户;Entering the PNR mode when it is detected that the terminal device is in a hands-free call state, wherein the target user is the owner of the terminal device or the user who is using the terminal device;
    当检测到所述终端设备处于视频通话状态时,进入所述PNR模式,其中,所述目标用户为所述终端设备的拥有者或者距离所述终端设备最近的用户;When it is detected that the terminal device is in a video call state, enter the PNR mode, wherein the target user is the owner of the terminal device or the user closest to the terminal device;
    当检测到所述终端设备连接到耳机进行通话时,进入所述PNR模式,其中,所述目标用户为佩戴所述耳机的用户;所述第一带噪语音信号和所述目标语音相关数据是通过所述耳机采集得到的;或When it is detected that the terminal device is connected to the earphone for conversation, enter the PNR mode, wherein the target user is a user wearing the earphone; the first noisy voice signal and the target voice-related data are collected by said headset; or
    当检测到所述终端设备连接到智能大屏设备、智能手表或者车载设备时,进入所述PNR模式,其中所述目标用户为所述终端设备的拥有者或者正在使用所述终端设备的用户,所述第一带噪语音信号和目标语音相关数据是由所述智能大屏设备、所述智能手表或者所述车载设备的音频采集硬件采集得到的。When it is detected that the terminal device is connected to a smart large-screen device, a smart watch or a vehicle-mounted device, enter the PNR mode, wherein the target user is the owner of the terminal device or the user who is using the terminal device, The first noisy voice signal and target voice-related data are collected by audio collection hardware of the smart large-screen device, the smart watch, or the vehicle-mounted device.
  61. 根据权利要求34-53任一项所述的终端设备,其特征在于,所述获取单元还用于:获取当前环境的音频信号的分贝值,The terminal device according to any one of claims 34-53, wherein the acquisition unit is further configured to: acquire the decibel value of the audio signal of the current environment,
    所述终端设备还包括:The terminal equipment also includes:
    控制单元,用于若所述当前环境的音频信号的分贝值超过预设分贝值,且所述终端设备启动的应用程序对应的PNR功能未开启,则开启所述终端设备启动的应用程序对应的PNR功能,并进入所述PNR模式。The control unit is configured to enable the PNR function corresponding to the application program started by the terminal device if the decibel value of the audio signal in the current environment exceeds the preset decibel value and the PNR function corresponding to the application program started by the terminal device is not enabled. PNR function and enter the PNR mode.
  62. 根据权利要求34-53任一项所述的终端设备,其特征在于,所述终端设备包括显示屏,所述显示屏包括多个显示区域,The terminal device according to any one of claims 34-53, wherein the terminal device includes a display screen, and the display screen includes a plurality of display areas,
    其中,所述多个显示区域中的每个显示区域显示标签和对应的功能按键,所述功能按键用于控制对应标签所指示的应用程序的PNR功能的开启和关闭。Wherein, each display area of the plurality of display areas displays a label and a corresponding function key, and the function key is used to control the opening and closing of the PNR function of the application program indicated by the corresponding label.
  63. 根据权利要求34-53任一项所述的终端设备,其特征在于,当所述终端设备与另一终端设备之间进行语音数据传输时,所述终端设备还包括:The terminal device according to any one of claims 34-53, wherein when voice data transmission is performed between the terminal device and another terminal device, the terminal device further includes:
    接收单元,用于接收所述另一终端设备发送的语音增强请求,所述语音增强请求用于指示所述终端设备开启通话功能的PNR功能;a receiving unit, configured to receive a voice enhancement request sent by the other terminal device, where the voice enhancement request is used to instruct the terminal device to enable the PNR function of the call function;
    控制单元,用于响应于所述语音增强请求,通过所述终端设备发出第三提示信息,所述第三提示信息用于提示是否使得所述终端设备开启所述通话功能的PNR功能;当检测到所述目标用户针对所述终端设备的确认开启通话功能的PNR功能后,开启所述通话功能的PNR功能,并进入所述PNR模式;A control unit, configured to send third prompt information through the terminal device in response to the voice enhancement request, where the third prompt information is used to prompt whether to enable the terminal device to enable the PNR function of the call function; when detecting After the target user confirms that the PNR function of the call function is turned on for the terminal device, the PNR function of the call function is turned on, and enters the PNR mode;
    发送单元,用于向所述另一终端设备发送语音增强响应消息,所述语音增强响应消息用于指示所述终端设备已开启通话功能的PNR功能。A sending unit, configured to send a voice enhancement response message to the other terminal device, where the voice enhancement response message is used to indicate that the terminal device has enabled the PNR function of the call function.
  64. 根据权利要求38-40、43-45和50任一项所述的终端设备,其特征在于,当所述终端设备启动视频通话或者视频录制功能,所述终端设备的显示界面包括第一区域和第二区域,所述第一区域用于显示视频通话内容或者视频录制的内容,所述第二区域用于显示M个控件和对应的M个标签,所述M个控件与所述M个目标用户一一对应,所述M个控件中的每个控件包括滑动按钮和滑动条,通过控制所述滑动按钮在所述滑动条上滑动,以调节该控件对应的标签所指示目标用户的语音增强系数。The terminal device according to any one of claims 38-40, 43-45 and 50, wherein when the terminal device starts the video call or video recording function, the display interface of the terminal device includes the first area and In the second area, the first area is used to display video call content or video recording content, and the second area is used to display M controls and corresponding M labels, and the M controls are related to the M targets User one-to-one correspondence, each control in the M controls includes a sliding button and a sliding bar, by controlling the sliding button to slide on the sliding bar, to adjust the voice enhancement of the target user indicated by the label corresponding to the control coefficient.
  65. 根据权利要求38-40、43-45和50任一项所述的终端设备,其特征在于,当所述终端设备启动视频通话或者视频录制功能,所述终端设备的显示界面包括第一区域,所述第一区域用于显示视频通话内容或者视频录制的内容;所述终端设备还包括:The terminal device according to any one of claims 38-40, 43-45 and 50, wherein when the terminal device starts a video call or video recording function, the display interface of the terminal device includes a first area, The first area is used to display video call content or video recording content; the terminal device also includes:
    控制单元,用于当检测到针对所述视频通话内容或者视频录制内容中任一对象的操作时,在所述第一区域显示该对象对应的控件,该控件包括滑动按钮和滑动条,通过控制所述滑动按钮在所述滑动条上滑动,以调节该对象的语音增强系数。A control unit, configured to display a control corresponding to the object in the first area when an operation on any object in the video call content or video recording content is detected, the control includes a sliding button and a sliding bar, and is controlled by The slide button slides on the slide bar to adjust the speech enhancement coefficient of the object.
  66. 根据权利要求34-37和41任一项所述的终端设备,其特征在于,当所述终端设备为智能交互设备时,所述目标语音相关数据包括包含唤醒词的语音信号,所述第一带噪语音信号包括包含命令词的音频信号。The terminal device according to any one of claims 34-37 and 41, wherein when the terminal device is an intelligent interactive device, the target voice-related data includes a voice signal containing a wake-up word, and the first The noisy speech signal includes an audio signal containing command words.
  67. 一种终端设备,其特征在于,包括处理器和存储器,其中,所述处理器和存储器相连,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行如权利要求1-33任一项所述的方法。A terminal device, characterized by comprising a processor and a memory, wherein the processor is connected to the memory, wherein the memory is used to store program codes, and the processor is used to call the program codes to execute The method of any one of claims 1-33.
  68. 一种芯片系统,其特征在于,所述芯片系统应用于电子设备;所述芯片系统包括一个或多个接口电路,以及一个或多个处理器;所述接口电路和所述处理器通过线路互联;所述接口电路用于从所述电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信 号包括所述存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,所述电子设备执行如权利要求1-33中任一项所述方法。A chip system, characterized in that the chip system is applied to electronic equipment; the chip system includes one or more interface circuits, and one or more processors; the interface circuits and the processors are interconnected through lines The interface circuit is used to receive a signal from the memory of the electronic device and send the signal to the processor, the signal including a computer instruction stored in the memory; when the processor executes the computer When instructed, the electronic device executes the method according to any one of claims 1-33.
  69. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如权利要求1-33任一项所述方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method according to any one of claims 1-33.
PCT/CN2022/093969 2021-05-31 2022-05-19 Speech enhancement method and related device WO2022253003A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280038999.1A CN117480554A (en) 2021-05-31 2022-05-19 Voice enhancement method and related equipment
US18/522,743 US20240096343A1 (en) 2021-05-31 2023-11-29 Voice quality enhancement method and related device

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN202110611024 2021-05-31
CN202110611024.0 2021-05-31
CN202110694849.3 2021-06-22
CN202110694849 2021-06-22
CN202111323211.5 2021-11-09
CN202111323211.5A CN115482830B (en) 2021-05-31 2021-11-09 Voice enhancement method and related equipment

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/522,743 Continuation US20240096343A1 (en) 2021-05-31 2023-11-29 Voice quality enhancement method and related device

Publications (1)

Publication Number Publication Date
WO2022253003A1 true WO2022253003A1 (en) 2022-12-08

Family

ID=84322772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093969 WO2022253003A1 (en) 2021-05-31 2022-05-19 Speech enhancement method and related device

Country Status (3)

Country Link
US (1) US20240096343A1 (en)
CN (1) CN117480554A (en)
WO (1) WO2022253003A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229986A (en) * 2023-05-05 2023-06-06 北京远鉴信息技术有限公司 Voice noise reduction method and device for voiceprint identification task
WO2023249786A1 (en) * 2022-06-24 2023-12-28 Microsoft Technology Licensing, Llc Distributed teleconferencing using personalized enhancement models

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971696A (en) * 2013-01-30 2014-08-06 华为终端有限公司 Method, device and terminal equipment for processing voice
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110491407A (en) * 2019-08-15 2019-11-22 广州华多网络科技有限公司 Method, apparatus, electronic equipment and the storage medium of voice de-noising
CN110503968A (en) * 2018-05-18 2019-11-26 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
US20210074282A1 (en) * 2019-09-11 2021-03-11 Massachusetts Institute Of Technology Systems and methods for improving model-based speech enhancement with neural networks
CN112700786A (en) * 2020-12-29 2021-04-23 西安讯飞超脑信息科技有限公司 Voice enhancement method, device, electronic equipment and storage medium
CN112767960A (en) * 2021-02-05 2021-05-07 云从科技集团股份有限公司 Audio noise reduction method, system, device and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971696A (en) * 2013-01-30 2014-08-06 华为终端有限公司 Method, device and terminal equipment for processing voice
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110503968A (en) * 2018-05-18 2019-11-26 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110491407A (en) * 2019-08-15 2019-11-22 广州华多网络科技有限公司 Method, apparatus, electronic equipment and the storage medium of voice de-noising
US20210074282A1 (en) * 2019-09-11 2021-03-11 Massachusetts Institute Of Technology Systems and methods for improving model-based speech enhancement with neural networks
CN112700786A (en) * 2020-12-29 2021-04-23 西安讯飞超脑信息科技有限公司 Voice enhancement method, device, electronic equipment and storage medium
CN112767960A (en) * 2021-02-05 2021-05-07 云从科技集团股份有限公司 Audio noise reduction method, system, device and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023249786A1 (en) * 2022-06-24 2023-12-28 Microsoft Technology Licensing, Llc Distributed teleconferencing using personalized enhancement models
CN116229986A (en) * 2023-05-05 2023-06-06 北京远鉴信息技术有限公司 Voice noise reduction method and device for voiceprint identification task
CN116229986B (en) * 2023-05-05 2023-07-21 北京远鉴信息技术有限公司 Voice noise reduction method and device for voiceprint identification task

Also Published As

Publication number Publication date
CN117480554A (en) 2024-01-30
US20240096343A1 (en) 2024-03-21

Similar Documents

Publication Publication Date Title
US20220159403A1 (en) System and method for assisting selective hearing
US11158333B2 (en) Multi-stream target-speech detection and channel fusion
US9197974B1 (en) Directional audio capture adaptation based on alternative sensory input
CN115482830B (en) Voice enhancement method and related equipment
WO2022253003A1 (en) Speech enhancement method and related device
CN106716526B (en) Method and apparatus for enhancing sound sources
CN109360549B (en) Data processing method, wearable device and device for data processing
WO2021244056A1 (en) Data processing method and apparatus, and readable medium
WO2021263136A2 (en) Systems, apparatus, and methods for acoustic transparency
US20230164509A1 (en) System and method for headphone equalization and room adjustment for binaural playback in augmented reality
CN113228710A (en) Sound source separation in hearing devices and related methods
CN112333602B (en) Signal processing method, signal processing apparatus, computer-readable storage medium, and indoor playback system
CN113506582A (en) Sound signal identification method, device and system
CN112447184A (en) Voice signal processing method and device, electronic equipment and storage medium
CN113488066A (en) Audio signal processing method, audio signal processing apparatus, and storage medium
CN114697445A (en) Volume adjusting method, electronic equipment, terminal and storage medium
CN111667842A (en) Audio signal processing method and device
US11646046B2 (en) Psychoacoustic enhancement based on audio source directivity
US20230319488A1 (en) Crosstalk cancellation and adaptive binaural filtering for listening system using remote signal sources and on-ear microphones
CN116320144B (en) Audio playing method, electronic equipment and readable storage medium
CN117118956B (en) Audio processing method, device, electronic equipment and computer readable storage medium
CN111696564B (en) Voice processing method, device and medium
US12003673B2 (en) Acoustic echo cancellation control for distributed audio devices
EP4184507A1 (en) Headset apparatus, teleconference system, user device and teleconferencing method
US20230319190A1 (en) Acoustic echo cancellation control for distributed audio devices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22815050

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280038999.1

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22815050

Country of ref document: EP

Kind code of ref document: A1