CN114067824A

CN114067824A - Voice enhancement method and system fusing ultrasonic signal characteristics

Info

Publication number: CN114067824A
Application number: CN202111316293.0A
Authority: CN
Inventors: 丁菡; 王一展; 李�昊; 赵衰; 王鸽; 惠维; 赵鲲; 赵季中; 王鹏; 董博
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-18

Abstract

The invention discloses a voice enhancement method and a system fusing ultrasonic signal characteristics, wherein an ultrasonic signal is predefined, then a loudspeaker and a microphone of the device are used for actively transmitting and receiving the ultrasonic signal, channel estimation is carried out to obtain channel impact response, and the channel impact response is used as supplementary modal information which reflects the motion characteristics of a face vocal organ when a user speaks and is used as voice to be input into a neural network to realize voice enhancement. The invention fully utilizes the vocal action characteristics of the user to assist the voice enhancement task, improves the voice enhancement effect and has wide application prospect.

Description

Voice enhancement method and system fusing ultrasonic signal characteristics

Technical Field

The invention belongs to the technical field of sound processing, and particularly relates to a voice enhancement method and system fusing ultrasonic signal characteristics.

Background

The speech is inevitably interfered by the surrounding environment during the transmission process, and the interference can seriously affect the quality of the speech when being received, so that the received signal is not the original pure speech signal any more, but the speech signal with various interference noises, which can reduce the quality and intelligibility of the speech, affect the hearing of a speech listener and the accuracy of speech recognition, and therefore, the influence of the environment needs to be reduced through a speech enhancement technology, and the speech quality and intelligibility are improved.

Among various speech enhancement methods, speech enhancement based on a deep neural network attracts a wide range of attention of researchers because of its excellent effect. On the basis of the deep neural network, the multi-modal information is introduced, so that the performance of speech enhancement is effectively improved. However, the speech enhancement based on multi-modality mainly utilizes two modalities, namely sound and vision, and the visual information is affected by the lighting condition, and needs the support of a visual sensor, and meanwhile, privacy problem is brought.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and a system for enhancing speech by fusing ultrasonic signal features, which fully utilize the voice-generating action of a user during speaking without the support of visual information, and effectively improve the effect of speech enhancement.

The invention adopts the following technical scheme:

a speech enhancement method fusing ultrasonic signal features is characterized by comprising the following steps:

s1, the mobile device simultaneously transmits and receives the predefined ultrasonic signals, and sends the user voice while transmitting the ultrasonic signals; the mobile device receives an ultrasonic signal and user voice depicting a voice production action; the ultrasonic signal is used for channel estimation to extract a channel impact response matrix, the channel impact response matrix is subjected to bit-by-bit first-order difference calculation along a time axis to obtain first-order difference channel impact response, the user voice is subjected to down-sampling, and a time-frequency spectrogram of the voice signal is obtained through short-time Fourier transform;

and S2, respectively inputting the first-order difference channel impact response obtained in the step S1 and the time-frequency spectrogram of the voice signal into a deep complex neural network, predicting a complex value ratio mask by the deep complex neural network, multiplying the time-frequency spectrogram of the voice signal and the predicted complex value ratio mask bit by bit to obtain a time-frequency spectrogram of enhanced voice, and then obtaining the enhanced time-sequence voice signal through inverse short-time Fourier transform to realize voice enhancement.

Specifically, step S1 specifically includes:

s101, selecting 26-bit GSM sequence as basic sequence for ultrasonic signal of transmitting end, the number of sampling points of each ultrasonic signal is 480, when sampling rate of ultrasonic wave is 48KHz, calculating channel impact response by 100Hz frequency by system, sampling 12 times GSM training sequence, adding 168-bit zero value as protection bit at end to form a training sequence, and finally multiplying transmitting signal by protection bit

Performing up-conversion to an ultrasonic frequency band, simultaneously performing band-pass filtering to keep an ultrasonic signal at 18-22 KHz, and performing sampling and zero padding interpolation according to a GSM (global system for mobile communications) sequence to generate data of a training sequence frame to obtain a cyclic training sequence matrix M;

s102, at a receiving end, dividing the received ultrasonic signal into two parts through a filter: obtaining a user voice with noise through a low-pass filter with the cut-off frequency of 8KHz, down-sampling the user voice part from 48KHz to 16KHz, and then performing short-time Fourier transform on a Hamming window of 20 milliseconds and a jump length of 10 milliseconds to obtain a time-frequency spectrogram of a voice signal, wherein the time-frequency spectrogram is used as the input of a neural network voice branch;

s103, solving a first-order difference of the channel impact response matrix bit by bit along a time axis to obtain dCIR, and then taking the dCIR as the input of the neural network ultrasonic branch.

Further, in step S101, the cyclic training sequence matrix M is formed by the data part D of the training sequence { M ═ M₁,m₂,…,m_PAnd the cyclic training sequence matrix M is:

where P is the length of the data portion in the training sequence.

Further, in step S102, the received signal passes through a high-pass filter with a cut-off frequency of 18KHz to obtain an ultrasonic part of the signal, frame detection is performed first to align the transmitted signal and the received signal, and then the ultrasonic part received signals obtained after high-pass filtering are multiplied by the ultrasonic part received signals respectively

And

as the real and imaginary parts of the received baseband signal r (t), the out-of-band noise is then removed by a low pass filter with a cut-off frequency of 2 KHz.

Further, the signal impulse response is calculated by the least squares channel estimation algorithm to obtain a 70 × 100 complex CIR matrix h as follows:

h＝argmin‖R-Mh‖²

wherein, R is the received signal, and h is the signal impulse response.

Specifically, in step S2, the time-frequency spectrogram of the enhanced speech is specifically:

wherein N is a time-frequency spectrum of the voice with noise,

predicted complex-valued ratio mask M_CRMComprises the following steps:

wherein N is_r，N_sThe real part and the imaginary part of a time-frequency spectrogram of the voice with noise; s_r，S_iAre the real and imaginary parts of clean speech,

is the real part of CRM, j is the imaginary unit,

is the imaginary part of CRM, r is the real part of the complex number, and i is the imaginary part of the complex number.

Specifically, in step S2, the neural network adopts a complex codec structure, and adds a complex LSTM between codecs; the encoder is used for extracting high-dimensional features from the input noisy speech spectrogram and the first-order differential channel impulse response, each coding block comprises respective branches of the noisy speech spectrogram and the first-order differential channel impulse response, each branch comprises a plurality of two-dimensional convolutions, and a plurality of batch standardization and leakage complex rectifying units are activated; inputting the coded high-dimensional audio and the first-order differential channel impulse response characteristics into an interaction module for converting and sharing information; and then, reconstructing the low-resolution features to the original input scale by the decoder by adopting complex deconvolution, using jump connection between the encoders, adding a user authentication branch at the tail of the ultrasonic branch of the encoder, and predicting the user after full connection.

Further, the interaction module specifically includes:

connecting two input branch characteristics F1 and F2, and feeding the two input branch characteristics into a complex convolution, a complex batch standardization and a complex Sigmoid activation prediction F2 respectivelyThe new feature of F1 is represented as:

the new feature of F2 is represented as

H (-) represents a complex convolution, complex batch normalization, complex Sigmoid activated combinatorial operation. Further, the loss function L of the neural network is:

L＝αL_SISDR+βL_s

wherein L is_SISDRAs a function of loss of signal-to-noise ratio for a constant signal scale, L_sAlpha and beta are hyper-parameters for the cross entropy loss function predicted by the user.

Another technical solution of the present invention is a speech enhancement system fusing ultrasonic signal characteristics, including:

the mobile equipment simultaneously transmits and receives predefined ultrasonic signals and transmits user voice at the same time; the mobile device receives an ultrasonic signal and user voice depicting a voice production action; the ultrasonic part is used for channel estimation and extracting a channel impact response matrix, the channel impact response matrix is subjected to bit-by-bit first-order difference calculation along a time axis to obtain first-order difference channel impact response, the user voice part subjected to low-pass filtering is subjected to down-sampling, and a time-frequency spectrogram of a voice signal is obtained through short-time Fourier transform;

and the enhancement module is used for respectively inputting the first-order differential channel impact response obtained by the input module and the time-frequency spectrogram of the voice signal into the deep complex neural network, predicting a complex value ratio mask by the deep complex neural network, multiplying the time-frequency spectrogram of the voice signal and the predicted complex value ratio mask bit by bit to obtain a time-frequency spectrogram of enhanced voice, and then obtaining an enhanced time sequence voice signal through inverse short-time Fourier transform to realize voice enhancement.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention discloses a voice enhancement method fusing ultrasonic signal characteristics, which comprises the steps of predefining ultrasonic signals, actively transmitting and receiving the ultrasonic signals by using a loudspeaker and a microphone which are arranged on the equipment, carrying out Channel estimation to obtain Channel Impulse Response (CIR) which is used for reflecting the motion characteristics of face vocal organs (such as lips, chin and tongue) when a user speaks, and inputting the CIR into a neural network as supplementary modal information of voice to realize voice enhancement. In order to process the amplitude and phase of a plurality of voice signals in a time-frequency domain simultaneously so as to utilize all information of the voice signals, the invention adopts a plurality of neural networks, the internal operation of the neural networks complies with a plurality of algorithms, the network output is a Complex Ratio Mask (CRM), the CRM is multiplied with a time-frequency spectrum of input voice to obtain a predicted voice time-frequency spectrum, the voice time-frequency spectrum is further subjected to inverse short-time Fourier transform to obtain enhanced voice, the first-order difference channel impulse response can effectively reflect the motion characteristics of facial organs when a user speaks, and the first-order difference channel impulse response can be used as available redundant information to realize the improvement of the voice enhancement task effect.

Further, in order to estimate the channel impulse response, the transmitted signal and the received signal need to be known, and in order to ensure the synchronization of the received signal and the transmitted signal, a 26-bit GSM sequence with good autocorrelation is used as a base sequence. In order to make it of sufficient length, it is first up-sampled by a factor of 12 and then zero-filled at the end to prevent the echo of the current frame from mixing with the next frame. The signal is modulated to a frequency band of 18-22 KHz to ensure that the sound signal at the emitting position cannot be sensed by human ears. Dividing a received signal into two parts at a receiving end, wherein one part is an ultrasonic part above 18KHz and is used for channel estimation and calculation of channel impulse response; the other part is the noisy speech part as input, and in order to reduce the data complexity, the sampling rate is reduced from 48KHz to 16 KHz.

Further, the cyclic training matrix M represents data of a frame of training signal sequence as a transmission signal in channel estimation.

Furthermore, a low-pass filter with a cut-off frequency of 2KHz is used for eliminating out-of-band noise, and sound signals outside an ultrasonic frequency band are prevented from influencing channel estimation.

Furthermore, the least square channel estimation algorithm can effectively calculate the time-varying characteristics of the channel, has simple structure and low calculation complexity,

furthermore, compared with the method for directly predicting the voice signal on the time domain, the method for the time-frequency domain considers the information of the voice time domain and the voice frequency at the same time, and the prediction result is more accurate. The method for predicting the complex ratio mask can predict the amplitude and the phase of the time-frequency domain voice signal at the same time, and can realize almost lossless reconstruction of pure voice.

Furthermore, because the input of the network is dCIR and time-frequency domain voice signals which are both complex matrixes, a complex coding and decoding structure is adopted, the real part and the imaginary part of the input complex signals can be effectively modeled, and the real part and the imaginary part of the signals can be considered at the same time in the output result, so that the prediction result is more reliable.

Further, the actions and speech of the user as they utter are naturally related, and therefore features learned in one modality may be used to supplement features of another modality. The interaction module can realize information exchange between the voice branch and the ultrasonic branch, and can recover some lost features or delete some unnecessary features in the other branch by using the feature information of one branch.

Further, L_SISDRThe loss function may reflect the difference between the predicted signal and the clean signal. L of ultrasonic branch_SThe loss function is used for user prediction, and the prior knowledge of a target speaker can be utilized to improve the voice enhancement effect.

In conclusion, the method is suitable for commercial mobile equipment equipped with a loudspeaker and a microphone, can effectively utilize the correlation between the real part and the imaginary part of a complex signal, and improves the voice enhancement effect.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a system diagram of the present invention;

FIG. 2 is a diagram of ultrasonic signal transceiving according to the present invention;

FIG. 3 is a diagram of a network architecture according to the present invention;

FIG. 4 is a diagram of parameters of a network architecture of the present invention, wherein (a) is the structural parameters of an ultrasonic branch encoder and (b) is the structural parameters of a speech branch encoder (up) and a decoder (down);

FIG. 5 is a block diagram of an interaction module of the present invention;

FIG. 6 is a schematic diagram illustrating the operation of complex convolution according to the present invention;

FIG. 7 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.

The invention provides a voice enhancement method fused with ultrasonic signal characteristics, which is suitable for commercial mobile equipment provided with a loudspeaker and a microphone; the method comprises the steps of predefining an ultrasonic signal, actively transmitting and receiving the ultrasonic signal by using a loudspeaker and a microphone of the device, carrying out Channel estimation to obtain a Channel Impulse Response (CIR) which is used for reflecting the motion characteristics of facial vocal organs (such as lips, chin and tongue) when a user speaks, and inputting the CIR into a neural network as supplementary modal information of voice to realize voice enhancement. In order to process the amplitude and phase of a plurality of voice signals in a time-frequency domain simultaneously so as to utilize all information of the voice signals, the invention adopts a plurality of neural networks, the internal operation of the neural networks complies with a plurality of algorithms, the network output is a Complex Ratio Mask (CRM), the CRM is multiplied with the time-frequency spectrum of the input voice to obtain a predicted voice time-frequency spectrum, and further, the voice time-frequency spectrum is subjected to inverse short-time Fourier transform so as to obtain enhanced voice. The invention fully utilizes the vocal action characteristics of the user to assist the voice enhancement task, improves the voice enhancement effect and has wide application prospect.

Referring to fig. 1 and 7, a speech enhancement method with fusion of ultrasonic signal features according to the present invention includes an ultrasonic signal transceiver module and a depth network prediction module, and includes the following steps:

s1, adopting a commercial mobile device (such as a smart phone) with a loudspeaker and a microphone, simultaneously transmitting and receiving predefined ultrasonic signals, and when the ultrasonic signals are transmitted, a user speaks, wherein the signals received by the microphone are divided into two parts, one part is ultrasonic signals depicting sounding actions and the other part is obtained by a high-pass filter with the cut-off frequency of 18 KHz; the other part is user voice which is obtained by a low-pass filter with the cut-off frequency of 8KHz, and the ultrasonic part is used for channel estimation and extracting Channel Impulse Response (CIR) and is used as auxiliary information for reflecting the voice-making action to realize noise reduction on the voice part;

referring to fig. 2, the ultrasound signal transceiver module specifically includes:

s101, selecting 26-bit GSM sequence as basic sequence for ultrasonic signal of transmitting end, the number of sampling points of each ultrasonic signal is 480, when sampling rate of ultrasonic wave is 48KHz, the system can calculate CIR with 100Hz frequency, sampling GSM sequence by 12 times, then adding 168-bit zero value as protection bit at end to form a training sequence, finally multiplying transmitting signal by protection bit

Up-conversion to the ultrasonic frequency band (f)_c20KHz), and simultaneously carrying out band-pass filtering to keep the ultrasonic signal at 18-22 KHz;

s102, at a receiving end, dividing a received ultrasonic signal into two parts through a filter: the method comprises the steps of obtaining a user voice with noise through a low-pass filter with the cut-off frequency of 8KHz, down-sampling a user voice part from 48KHz to 16KHz to reduce the complexity of a model, and then carrying out short-time Fourier transform (STFT) with a Hamming window of 20 milliseconds and a jump length of 10 milliseconds to obtain a time-frequency spectrogram of a voice signal to be used as the input of a neural network voice branch.

The received signal is passed through high-pass filter whose cut-off frequency is 18KHz to obtain ultrasonic portion of signal, and at receiving end, utilizing the characteristic of good autocorrelation of GSM training sequence, firstly making frame detection to make transmitted signal and received signal be aligned, then respectively multiplying the ultrasonic portion received signal obtained after high-pass filtering by ultrasonic portion received signal

And

The cyclic training sequence matrix M is formed by the data part D of the training sequence { M ═ M₁，m₂，…，m_PThe transformation yields:

where P is the length of the data portion in the training sequence, P + N312, and N70.

The channel impulse response h is obtained by a least square channel estimation algorithm:

h＝argmin‖R-Mh‖²

wherein, R is the received signal, and h is the signal impulse response.

Through h_LS＝(M^HM)^-1M^HR obtains the signal impulse response h, and obtains a 70X 100 complex CIR matrix h per second according to the design.

S103, solving a first-order difference of the CIR matrix along a time axis bit by bit to obtain a first-order difference channel impulse response dCIR, effectively eliminating the reflection of static objects in the environment, highlighting the ultrasonic characteristics of mouth movement, and then taking the dCIR as the input of the ultrasonic branch of the neural network.

And S2, respectively inputting the dCIR and the voice time-frequency spectrogram into a depth complex neural network and network prediction complex value ratio mask (CRM) in a depth network prediction module, multiplying the voice time-frequency spectrogram and the complex value ratio mask bit by bit to obtain a voice-enhanced time-frequency spectrogram, and then obtaining an enhanced time sequence voice signal through inverse short-time Fourier transform.

Neural networks achieve speech enhancement in the form of a predictive Complex Ratio Mask (CRM), which is the ratio between the time-frequency spectrum of clean speech and noisy speech, defined as:

wherein N is_r，N_sThe real part and the imaginary part of a time-frequency spectrogram of the voice with noise; s_r，S_iAre the real and imaginary parts of clean speech. Furthermore, in order to narrow the search space to optimize the model, the amplitude of CRM is limited to [0,1) using the hyperbolic tangent function, and the phase of CRM is also varied accordingly:

where D is the output of the network.

The finally estimated pure speech time-frequency spectrogram can be calculated by the following formula:

wherein, N is a time-frequency spectrogram of the voice with noise.

Referring to fig. 3, the neural network employs a complex codec structure, and adds a plurality of LSTM between the codecs. In the codec, complex LSTM is used to model the timing dependence. The encoder is used to extract high-dimensional features from the input noisy speech spectrogram and dcir. Each coding block in turn contains a noisy speech spectrogram and a respective branch of dcir, each branch containing a complex two-dimensional convolution, a complex batch normalization and a leak CReLU activation, the network structure parameters being as shown in FIG. 4.

Then, inputting the coded high-dimensional audio and dcir characteristics into an Interaction module for converting and sharing information; interaction Module Structure As shown in FIG. 5, first two input branch signatures F1 and F2 are performedConnecting, then feeding the valid features that need to be saved in the complex convolution, the complex batch normalization, and the complex Sigmoid activation prediction F2 respectively, the new features of F1 can be expressed as:

similarly, the new feature of F2 can be expressed as

Where H (-) represents a complex convolution, complex batch normalization, complex Sigmoid activated combinatorial operation. The new feature output by the interaction module can effectively reserve a part which needs to be reserved in another feature on the basis of the original feature, and fusion of two branch information is realized.

The decoder then reconstructs the low resolution features to the scale of the original input using complex deconvolution. The use of a hopping connection between codecs can effectively facilitate the transfer of gradients. At the end of the ultrasonic branch of the encoder, a user authentication branch is added, and the user is predicted after full connection, so that the voice enhancement effect can be further improved by using the identity information of the user.

The operation of the plural neural networks conforms to the plural algorithm, taking the plural convolution as an example, if the convolution kernel is W ═ a + iB, and the input is I ═ x + iy, then the output F is obtained_ConvComprises the following steps:

F_Conv＝W*I＝(A*x-B*y)+i(B*x+A*y)

wherein the neural network loss function is:

L＝αL_SISDR+βL_s

wherein L is_SISDRAs a function of loss of signal-to-noise ratio for a constant signal scale, L_sFor the cross entropy loss function predicted by the user, α and β are hyper-parameters, taking 1 and 0.1, respectively.

Loss function L of scale invariant signal-to-noise ratio_SISDRComprises the following steps:

S_E＝S′-S_T

where S' is the predicted speech sequence, S is the original clean speech sequence, S_TIs the projection of S 'on S, and SISNR reflects the similarity of S' and S.

Cross entropy loss function L_sComprises the following steps:

where n is the number of speakers, here 10,

is a true tag of the identity of the ith speaker,

is a prediction of the i-th speaker identity label.

In another embodiment of the present invention, a speech enhancement system for fusing ultrasonic signal features is provided, which can be used to implement the above speech enhancement method for fusing ultrasonic signal features.

The mobile equipment simultaneously transmits and receives predefined ultrasonic signals, and transmits user voice while transmitting the predefined ultrasonic signals; the mobile device receives an ultrasonic signal and user voice depicting a voice production action; the ultrasonic part is used for channel estimation and extracting a channel impact response matrix, the channel impact response matrix is subjected to bit-by-bit first-order difference calculation along a time axis to obtain first-order difference channel impact response, the user voice part subjected to low-pass filtering is subjected to down-sampling, and a time-frequency spectrogram of a voice signal is obtained through short-time Fourier transform;

In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of the voice enhancement method for fusing ultrasonic signal characteristics, and comprises the following steps:

the mobile equipment simultaneously transmits and receives predefined ultrasonic signals, and transmits user voice while transmitting the ultrasonic signals; the mobile device receives an ultrasonic signal and user voice depicting a voice production action; the ultrasonic signal is used for channel estimation to extract a channel impact response matrix, the channel impact response matrix is subjected to bit-by-bit first-order difference calculation along a time axis to obtain first-order difference channel impact response, the user voice is subjected to down-sampling, and a time-frequency spectrogram of the voice signal is obtained through short-time Fourier transform; respectively inputting the first-order difference channel impact response and the time-frequency spectrogram of the voice signal into a deep complex neural network, predicting a complex value ratio mask by the deep complex neural network, multiplying the time-frequency spectrogram of the voice signal and the predicted complex value ratio mask bit by bit to obtain a time-frequency spectrogram of enhanced voice, and then obtaining the enhanced time-sequence voice signal through inverse short-time Fourier transform to realize voice enhancement.

In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.

One or more instructions stored in the computer-readable storage medium can be loaded and executed by the processor to implement the corresponding steps of the speech enhancement method related to the fusion of the ultrasonic signal features in the above embodiments; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 6, a schematic diagram of a complex convolution operation is shown, where an input matrix I is x + jy, and a complex convolution kernel W is a + jB, an output F after the convolution operation can be represented as F ═ W ═ I (a × x-B ×)) + j (B × + a ×). The use of complex convolution can further exploit the inherent link between amplitude and phase for prediction, facilitating the speech enhancement task.

To prove that the invention improves the voice enhancement, the noise enhancement is compared with the PHASEN system, the average Signal-to-noise Ratio (SDR) of the noisy data is 4.99dB, the average Speech Quality perception Evaluation value (PESQ) is 2.18, and the average Short-Time Objective Intelligibility measure (STOI) is 0.77.

The SNR of the invention after enhancing the voice with noise is 16.91dB, PESQ is 3.27, and STOI is 0.90.

The enhanced values of the PHASEN system are respectively 13.34dB, 3.15 and 0.88.

Compared with PHASEN, the method has the advantage that the enhancement effect of the voice with noise is obviously improved.

In summary, the speech enhancement method and system fusing ultrasonic signal features of the invention are suitable for commercial mobile equipment equipped with a loudspeaker and a microphone. The device is provided with a loudspeaker and a microphone which are used for actively transmitting and receiving ultrasonic signals, Channel Impact Response (CIR) is obtained through channel estimation and used for reflecting the motion characteristics of facial vocal organs (such as lips, chin and tongue) when a user speaks, and the CIR is input into a neural network as supplementary modal information to interact with noisy voice, so that the enhancement effect is improved. In order to simultaneously process the amplitude and the phase of the complex speech signal in a time-frequency domain so as to utilize all information of the speech signal, the network adopts a complex neural network, the internal operation of the network complies with a complex algorithm, and the correlation between the real part and the imaginary part of the complex signal can be utilized, so that the speech quality and the intelligibility are effectively improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A speech enhancement method fusing ultrasonic signal features is characterized by comprising the following steps:

2. The method for enhancing speech by fusing ultrasonic signal features according to claim 1, wherein step S1 is specifically:

s101, selecting a GSM sequence as a basic sequence for an ultrasonic signal of a transmitting end, calculating channel impact response at the frequency of 100Hz when the sampling rate of ultrasonic waves is 48KHz, performing 12 times of up-sampling on a GSM training sequence, adding 168-bit zero values at the end as protection bits to form a training sequence, and multiplying the transmitting signal by a protection bit to form a training sequence

Carrying out frequency conversion to an ultrasonic frequency band, simultaneously carrying out band-pass filtering to keep the ultrasonic signal at 18-22 KHz, and carrying out sampling and zero padding interpolation according to a GSM sequence to generate data of a training sequence frame to obtain a cyclic training sequence matrix M;

s103, solving a first-order difference of the channel impact response matrix along a time axis bit by bit to obtain a first-order difference channel impact response, and then taking the first-order difference channel impact response as the input of the neural network ultrasonic branch.

3. The method of claim 2, wherein the cyclic training sequence matrix M is a number of training sequences in step S101According to part D ═ m₁,m₂,…,m_PAnd the cyclic training sequence matrix M is:

where P is the length of the data portion in the training sequence.

4. The method of claim 2, wherein in step S102, the received signal is passed through a high-pass filter with a cut-off frequency of 18KHz to obtain the ultrasonic part of the signal, frame detection is performed to align the transmitted signal and the received signal, and then the ultrasonic part received signal obtained after high-pass filtering is multiplied by the ultrasonic part received signal

And

5. The method of claim 4, wherein the computing of the impulse response of the signal by the least squares channel estimation algorithm yields a 70 x 100 complex CIR matrix h as follows:

h＝argmin||R-Mh||²

wherein, R is the received signal, and h is the signal impulse response.

6. The method for enhancing speech by fusing features of ultrasonic signals according to claim 1, wherein in step S2, the time-frequency spectrogram of the enhanced speech is specifically:

where N is the time-frequency spectrum of the noisy speech, and the predicted complex-value ratio mask M_CRMComprises the following steps:

is the real part of CRM, j is the imaginary unit,

7. The method for enhancing speech according to the fusion ultrasonic signal feature of claim 1, wherein in step S2, the neural network adopts a complex codec structure, and adds a complex LSTM between the codecs; the encoder is used for extracting high-dimensional features from the input noisy speech spectrogram and the first-order differential channel impulse response, each coding block comprises respective branches of the noisy speech spectrogram and the first-order differential channel impulse response, each branch comprises a plurality of two-dimensional convolutions, and a plurality of batch standardization and leakage complex rectifying units are activated; inputting the coded high-dimensional audio and the first-order differential channel impulse response characteristics into an interaction module for converting and sharing information; and then, reconstructing the low-resolution features to the original input scale by the decoder by adopting complex deconvolution, using jump connection between the encoders, adding a user authentication branch at the tail of the ultrasonic branch of the encoder, and predicting the user after full connection.

8. The method for enhancing speech by fusing features of ultrasonic signals according to claim 7, wherein the interaction module is specifically:

connecting two input branch characteristics F1 and F2, and then respectively feeding effective characteristics needing to be stored in complex convolution, complex batch standardization and complex Sigmoid activation prediction F2, wherein the new characteristics of F1 are represented as follows:

the new feature of F2 is represented as

H (-) represents a complex convolution, complex batch normalization, complex Sigmoid activated combinatorial operation.

9. The method of claim 7, wherein the loss function L of the neural network is:

L＝αL_SISDR+βL_s

10. A speech enhancement system that fuses ultrasonic signal features, comprising: