CN113223543A - Speech enhancement method, apparatus and storage medium - Google Patents

Speech enhancement method, apparatus and storage medium Download PDF

Info

Publication number
CN113223543A
CN113223543A CN202110649724.9A CN202110649724A CN113223543A CN 113223543 A CN113223543 A CN 113223543A CN 202110649724 A CN202110649724 A CN 202110649724A CN 113223543 A CN113223543 A CN 113223543A
Authority
CN
China
Prior art keywords
signal
frame
sequence
signal frame
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110649724.9A
Other languages
Chinese (zh)
Other versions
CN113223543B (en
Inventor
侯海宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110649724.9A priority Critical patent/CN113223543B/en
Publication of CN113223543A publication Critical patent/CN113223543A/en
Application granted granted Critical
Publication of CN113223543B publication Critical patent/CN113223543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present disclosure relates to a speech enhancement method, apparatus, and storage medium, the method comprising: acquiring a sound signal; inputting each signal frame in the sound signal into a de-mixing model, wherein the de-mixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the de-mixing model calculates a signal frame sequence after removing reverberation and echo from the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame; and obtaining the target sound signal which is output by the unmixing model and is subjected to reverberation and echo removal. The present disclosure can enhance the effect of speech enhancement.

Description

Speech enhancement method, apparatus and storage medium
Technical Field
The present disclosure relates to the field of sound processing, and in particular, to a method, an apparatus, and a storage medium for speech enhancement.
Background
At present, most of various product devices adopt a microphone array to pick up sound, and a microphone beam forming technology or a blind source separation technology is applied to improve the processing quality of a voice signal and improve the voice recognition rate in a real environment.
However, when the device plays a scene of sound while recording, the sound played by the device is picked up by its own microphone to form an echo. The echo can interfere sound signals such as control commands sent by users, the voice recognition rate of the equipment is reduced, and the interactive experience is reduced. In addition, in an actual living environment, sound is reflected by walls, furniture and the like, so that reverberation occurs. Reverberation can lead to a deterioration of the beamforming and separation effect.
The existing voice enhancement system is generally formed by connecting an echo cancellation module, a dereverberation module and a blind source separation module in series, wherein the modules are independent of each other, are respectively divided into time division and are respectively optimized. Therefore, although each module can achieve its own optimization effect, the effect of the whole link cannot be guaranteed to be optimal.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a voice enhancement method, apparatus, and storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided a speech enhancement method, including: acquiring a sound signal; inputting each signal frame in the sound signal into a de-mixing model, wherein the de-mixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the de-mixing model calculates a signal frame sequence after removing reverberation and echo from the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame.
Optionally, the unmixing model obtains the target sound signal by: obtaining a first expected signal sequence of the current signal frame based on a de-mixing matrix corresponding to the previous signal frame; updating a demixing matrix based on the first desired signal sequence; calculating a second desired signal sequence based on the updated unmixing matrix, and regarding the second desired signal sequence as the target sound signal. .
Optionally, the obtaining the first expected signal sequence of the current signal frame based on the unmixing matrix corresponding to the previous signal frame includes: and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the unmixing matrix corresponding to the previous signal frame.
Optionally, the updating the unmixing matrix based on the first desired signal sequence includes: updating a weighted covariance matrix with the first expected signal value; updating the unmixing matrix based on the updated weighted covariance matrix. .
Optionally, the calculating a second expected signal sequence based on the updated unmixing matrix includes: and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the updated unmixing matrix.
Optionally, the unmixing matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a signal frame, a signal value sequence of a historical signal frame before the signal frame, and an expected signal value sequence of a target sound signal in the signal frame, where the mapping relationship is:
Figure BDA0003111254960000021
Figure BDA0003111254960000022
wherein x (k, τ) is a sequence of signal values at the kth frequency index of the τ th frame, z (k, τ) is a sequence of direct sound signal values at the kth frequency index of the τ th frame, r (k, τ) is a sequence of echo reference signal values at the kth frequency index of the τ th frame,
Figure BDA0003111254960000031
and (k) indexing a signal value sequence of a preset number of historical signal frames before the (tau-1) th frame for the k-th frequency, and W (k) is a de-mixing matrix of the k-th frequency index.
Optionally, before the calculating a second expected signal sequence based on the updated unmixing matrix, the method further includes: and carrying out amplitude ambiguity removal on the updated unmixing matrix.
According to a second aspect of the embodiments of the present disclosure, there is provided a speech enhancement apparatus, including: an acquisition module configured to acquire a sound signal; a processing module configured to input each signal frame in the sound signal into a demixing model, wherein the demixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the demixing model calculates a signal frame sequence after removing reverberation and echo in a current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a history signal frame before the current signal frame; an output module configured to obtain the target sound signal output by the unmixing model after the reverberation and the echo are removed.
Optionally, the unmixing model is configured to obtain the target sound signal by: obtaining a first expected signal sequence of the current signal frame based on a de-mixing matrix corresponding to the previous signal frame; updating a demixing matrix based on the first desired signal sequence; calculating a second desired signal sequence based on the updated unmixing matrix, and regarding the second desired signal sequence as the target sound signal.
Optionally, the unmixing model is further configured to obtain the first expected signal sequence according to a signal value sequence of a current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of a historical signal frame, and a unmixing matrix corresponding to a previous signal frame.
Optionally, the unmixing model is further configured to update a weighted covariance matrix by the first expected signal value; updating the unmixing matrix based on the updated weighted covariance matrix.
Optionally, the unmixing model is further configured to obtain the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame, and the updated unmixing matrix.
Optionally, the unmixing matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a signal frame, a signal value sequence of a historical signal frame before the signal frame, and an expected signal value sequence of a target sound signal in the signal frame, where the mapping relationship is:
Figure BDA0003111254960000041
Figure BDA0003111254960000042
wherein x (k, τ) is a sequence of signal values at the kth frequency index of the τ th frame, z (k, τ) is a sequence of direct sound signal values at the kth frequency index of the τ th frame, r (k, τ) is a sequence of echo reference signal values at the kth frequency index of the τ th frame,
Figure BDA0003111254960000043
and (k) indexing a signal value sequence of a preset number of historical signal frames before the (tau-1) th frame for the k-th frequency, and W (k) is a de-mixing matrix of the k-th frequency index.
Optionally, the unmixing model is further configured to perform amplitude deblurring on the updated unmixing matrix.
According to a third aspect of the embodiments of the present disclosure, there is provided a speech enhancement apparatus including: an acquisition module configured to acquire a sound signal; a processing module configured to input each signal frame in the sound signal into a demixing model, wherein the demixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the demixing model calculates a signal frame sequence after removing reverberation and echo in a current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a history signal frame before the current signal frame; an output module configured to obtain the target sound signal output by the unmixing model after the reverberation and the echo are removed.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech enhancement method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the obtained sound signals are processed through a demixing model which respectively takes reverberation and echo as independent sound sources and distinguishes the reverberation and the direct sound, the reverberation and the echo in the sound signals are separated, the direct sound signals are obtained, and therefore voice enhancement is more efficient, convenient and better in effect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method of speech enhancement according to an exemplary embodiment.
FIG. 2 is a schematic diagram illustrating a flow of speech enhancement according to an exemplary embodiment.
FIG. 3 is a block diagram illustrating a speech enhancement apparatus according to an example embodiment.
FIG. 4 is a block diagram illustrating an apparatus in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating a voice enhancement method according to an exemplary embodiment, and as shown in fig. 1, the voice enhancement method may be applied to various electronic devices, such as a mobile phone, a computer, a tablet computer, a wearable terminal, a recording pen, a recorder, and the like. The method comprises the following steps.
In step S11, a sound signal is acquired.
In step S12, each signal frame in the audio signal is input to an unmixing model.
The unmixing model is a calculation model for removing reverberation and echo in a signal frame, wherein the calculation model is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the unmixing model calculates a signal frame sequence after the reverberation and echo are removed from the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame.
In step S13, a target sound signal output by the unmixing model after reverberation and echo are removed is obtained.
In step S11, the sound signal may be acquired by a microphone array including a plurality of microphones, each microphone acquiring a sound signal independently, and the sound signals acquired by the microphone array may be sequenced as the sound signals in step S11.
For example, the device has M microphones, and the expression of the sound signal in each signal frame may be x (k, τ) ═ x1(k,τ),……,xM(k,τ)]Where k is the frequency index and τ is the frame number index.
Since the reverberation value appearing in any signal frame is derived by reflecting and recording the direct sound in the signal frame before the signal frame, the signal value of the direct sound in the current signal frame can be obtained by comparing the signal value of the current frame with the signal value in the signal frame before the current frame, and thus, in step S12, the sequence of signal values of the current signal frame and the signal before the signal frame can be obtainedThe mapping relationship between the sequence of signal values of a frame and the sequence of signal values of the direct sound in the signal frame is described by the following expression:
Figure BDA0003111254960000061
wherein x (k, τ) is a sequence of signal values at the kth frequency index of the τ th frame, z (k, τ) is a sequence of direct sound signal values at the kth frequency index of the τ th frame, r (k, τ) is a sequence of echo reference signal values at the kth frequency index of the τ th frame,
Figure BDA0003111254960000062
and (k) indexing a signal value sequence of a preset number of historical signal frames before the (tau-1) th frame for the k-th frequency, and W (k) is a de-mixing matrix of the k-th frequency index. The unmixing model is used for estimating the direct sound signal value sequence through the unmixing matrix.
It should be noted that, since the echo is generated by the device recording the sound played by the device, the sequence of echo reference signal values may be generated by the device according to the audio being played or obtained by acquiring the sequence of signal values in the file being played by the device, and in the above expression, the sequence of echo reference signal values may be used as a known sequence value for obtaining the direct sound signal. The present disclosure does not limit the manner of obtaining the echo reference signal value sequence.
The expression is obtained by processing the sound observation model, and the expression of the sound observation model is as follows:
Figure BDA0003111254960000071
where A is the signal matrix of the direct sound and early reverberation and B is the direct and early reflected parts of the echo path.
Figure BDA0003111254960000072
Is unfolded into
Figure BDA0003111254960000073
AlSignal matrix for late reverberation, BlFor the late reflected part of the echo path, s is the signal matrix for the direct sound.
The acoustic observation model is rewritten to obtain an acoustic observation model in a matrix form:
Figure BDA0003111254960000074
wherein P (k) is a mixing matrix,
Figure BDA0003111254960000075
based on the sound observation model and the mixing matrix, the direct sound, the reverberation and the echo can be separated through a de-mixing matrix to obtain the estimation value of each component, wherein z (k, tau) is a target sound signal obtained by estimating s (k, tau).
In one possible embodiment, the expression of the unmixing matrix w (k) is:
Figure BDA0003111254960000076
wherein D isM×N(k) Is a separation matrix of M rows and N columns, EM×R(k) Echo path matrix of M rows and R columns, FM×ML(k) A reverberation path matrix of M rows and ML columns.
The echo and reverberation removed target sound signal may include M sub-signal sequences, each of which includes sound signals of an independent sound source, for example, when a microphone is recording, sound signals of a character a, a character B, a character C, a character D, an ambient sound a, an ambient sound B, an ambient sound C, and an ambient sound D are recorded at the same time, after separation, a first sub-signal sequence of the target sound signal may be a sound signal obtained by dereverberating and echoing the character a, a second sub-signal sequence may be a sound signal obtained by dereverberating and echoing the character B, a third sub-signal sequence may be a sound signal obtained by dereverberating and echoing the character C, a fourth sub-signal sequence may be a sound signal obtained by dereverberating and echoing the character D, and a fifth sub-signal sequence may be a sound signal obtained by dereverberating and echoing the ambient sound a, the sixth sub-signal sequence may be a sound signal after the ambient sound B is dereverbered and echoed, the seventh sub-signal sequence may be a sound signal after the ambient sound C is dereverbered and echoed, and the eighth sub-signal sequence may be a sound signal after the ambient sound D is dereverbered and echoed.
Therefore, the target sound signal z (k, τ) also includes M lines, and the portion of the unmixing model other than the preceding M lines does not affect the target sound signal determination, and therefore, the target sound signal can be determined by determining and updating the preceding M lines of the unmixing model.
In one possible embodiment, the unmixing model estimates the expected signal values by:
and calculating a first expected signal sequence of the current signal frame based on a de-mixing matrix corresponding to the previous signal frame, updating the de-mixing matrix based on the first expected signal sequence, calculating a second expected signal sequence based on the updated de-mixing matrix, and taking the second expected signal sequence as the target sound signal.
That is, the first expected signal sequence of the current frame may be estimated by the value of the unmixing matrix of the previous signal frame, and the weighting calculation may be performed by the estimated first expected signal sequence, the unmixing matrix of the current frame may be updated, and the unmixing matrix of the current frame may be used to calculate the second expected signal sequence of the current frame and estimate the first expected signal sequence of the next frame, so that the second expected signal sequences of all signal frames are obtained through continuous iteration. The first expected signal sequence is an expected signal sequence obtained through estimation of a de-mixing matrix of a previous frame, large deviation of an estimation result may occur, the second expected signal sequence is an expected signal sequence obtained through calculation of the de-mixing matrix of a current frame, and the deviation of an obtained calculation result is smaller than that of a real direct sound signal value.
For the sound signal sequence of the first frame, there is no earlier signal frame, and therefore, when solving for the signal of the first frame, the unmixing matrix may be initialized to the identity matrix for iterative operation.
In one possible implementation, the first expected signal sequence is obtained according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame, and the unmixing matrix corresponding to the previous signal frame.
For example, the first desired signal sequence y' (k, τ) may be estimated by the following equation: y' (k, τ) ═ W (k, τ -1) X (k, τ), where W (k, τ -1) is the downmix matrix corresponding to the kth frequency index of the τ -1 th frame,
Figure BDA0003111254960000091
after the first expected signal value is obtained, the unmixing matrix can be updated through the obtained first expected signal value, and when updating, the first M rows of the unmixing matrix can be updated without updating all rows of the unmixing matrix, so that the iteration efficiency of the model is improved.
In one possible implementation, a weighted covariance matrix may be updated by the first desired signal value, and the unmixing matrix may be updated based on the updated weighted covariance matrix.
For example, the update of the unmixing matrix is performed by: w is am(k,τ)=[W(k,τ-1Cm(k,τ)]And (4) 1im, wmk, τ is the mth row of the unmixing matrix W (k, τ), and M is any natural number from 1 to M.
Cm(k, τ) is the m-th row, i, of the weighting matrix for the k-th frequency index of the τ -th framemIs the m-th column, C, of the identity matrixm(k,τ)=αCm(k,τ-1)+βm(τ)X(k,τ)XH(k,τ),βm(τ) is a weighted smoothing coefficient,
Figure BDA0003111254960000092
alpha is a preset smoothing coefficient,
Figure BDA0003111254960000093
for the mth row sequence value in the first desired signal value sequence at each frequency index of the mth frame,
Figure BDA0003111254960000094
k is a frequency index number,
Figure BDA0003111254960000095
in order to be a function of the comparison,
Figure BDA0003111254960000096
δ is a preset zero control parameter, and γ is a preset shape parameter.
In practice, α may be set to 0.99, and γ may be manually adjusted according to the distribution of the sound sources to improve the accuracy of the unmixing model. When the difference between the expected signal value obtained by the unmixing model and the actual direct sound signal value is large, the sound source distribution condition in the model can be adjusted by adjusting the value of gamma. In the present disclosure, the initial value of γ may be set to 0.2.
When the unmixing matrix is initialized, the weighting matrix may also be initialized, where the weighting matrix may be initialized to an initial matrix set arbitrarily or may be initialized to a zero matrix.
In one possible implementation, the unmixing matrix W corresponding to the 0 th frame may be obtainedk(0) Initialized to identity matrix I, and weighted matrix Cm(k,0) is initialized to zero matrix 0.
After the unmixing matrix of the current frame is updated, an expected signal sequence which is more accurate than the first expected signal sequence, namely a second expected signal sequence, can be calculated through the more accurate unmixing matrix. And solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the updated unmixing matrix. In one possible embodiment, the second desired signal sequence may be calculated by the following formula: y ″ (k, τ) ═ W (k, τ) X (k, τ). Wherein, W (k, tau) is the updated unmixing matrix.
The updated unmixing matrix may be amplitude-deblurred prior to the calculation of the second desired signal sequence, as disclosed in this disclosureIn the above, the amplitude deblurring process may be performed by MDP (minimum distortion criterion), that is, the amplitude deblurring is performed by the following formula: w (k, τ) ═ diag (W)-1(k,τ))W(k,τ)。
In blind source separation, it is usually assumed that the number of sound sources is equal to the number of microphones, and uncertainty may occur in the obtained result, since the estimated linear transformation of the signal of the sound source may also be considered as another signal in the signal source. By de-blurring the effect of the above problem on the unmixing matrix can be reduced.
After obtaining the target sound signal, the target sound signal may be sent to a speech processing unit so that the speech processing unit converts the target sound signal into operation instructions, for example, the speech processing unit may correspond to a voice assistant program or the like in the device. The target sound signal may also be sent to a speech recognition unit so that the speech recognition unit converts the target sound signal into text content. After dereverberation and echo, the noise in the sound signal is less, which is more beneficial to the extraction and processing of the sound.
FIG. 2 is a schematic diagram illustrating a flow of speech enhancement according to an exemplary disclosed embodiment.
In step S21, the unmixing matrix is initialized to the identity matrix and the weighting matrix is initialized to the zero matrix.
In step S22, a first desired signal sequence of the τ -th frame is estimated by the unmixing matrix corresponding to the τ -1-th frame.
In step S23, the weighting matrix corresponding to the τ -th frame is updated.
In step S24, the unmixing matrix corresponding to the τ -th frame is updated.
In step S25, the unmixing matrix corresponding to the τ -th frame is processed using MDP amplitude deblurring.
In step S26, a second desired signal sequence corresponding to the τ -th frame is calculated from the deblurred unmixing matrix.
Steps S22 to S26 are executed in a loop until the second expected signal sequence corresponding to each signal frame to be processed is obtained.
Through the technical scheme, the following technical effects can be at least achieved:
the obtained sound signals are processed through a demixing model which respectively takes reverberation and echo as independent sound sources and distinguishes the reverberation and the direct sound, the reverberation and the echo in the sound signals are separated, the direct sound signals are obtained, and therefore voice enhancement is more efficient, convenient and better in effect.
FIG. 3 is a block diagram illustrating a speech enhancement apparatus according to an example embodiment. Referring to fig. 3, the apparatus includes an acquisition module 310, a processing module 320, and an output module 330.
The acquisition module 310 is configured to acquire a sound signal.
The processing module 320 is configured to input each signal frame in the sound signal into a unmixing model, where the unmixing model is a calculation model for removing reverberation and echo in the signal frame, where the unmixing model is configured to set direct sound, reverberation, and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the unmixing model calculates a sequence of signal frames after removing reverberation and echo for a current signal frame according to a sequence of signal values of the current signal frame and a sequence of signal values of a history signal frame before the current signal frame.
The output module 330 is configured to obtain the target sound signal output by the unmixing model after the reverberation and echo are removed.
Optionally, the unmixing model is configured to obtain the target sound signal by: obtaining a first expected signal sequence of the current signal frame based on a de-mixing matrix corresponding to the previous signal frame; updating a demixing matrix based on the first desired signal sequence; calculating a second desired signal sequence based on the updated unmixing matrix, and regarding the second desired signal sequence as the target sound signal.
Optionally, the unmixing model is further configured to obtain the first expected signal sequence according to a signal value sequence of a current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of a historical signal frame, and a unmixing matrix corresponding to a previous signal frame.
Optionally, the unmixing model is further configured to update a weighted covariance matrix by the first expected signal value; updating the unmixing matrix based on the updated weighted covariance matrix.
Optionally, the unmixing model is further configured to obtain the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame, and the updated unmixing matrix.
Optionally, the unmixing matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a signal frame, a signal value sequence of a historical signal frame before the signal frame, and an expected signal value sequence of a target sound signal in the signal frame, where the mapping relationship is: where x (k, τ) is the sequence of signal values at the kth frequency index of the τ th frame, z (k, τ) is the sequence of direct sound signal values at the kth frequency index of the τ th frame, r (k, τ) is the sequence of echo reference signal values at the kth frequency index of the τ th frame,
Figure BDA0003111254960000121
and (k) indexing a signal value sequence of a preset number of historical signal frames before the (tau-1) th frame for the k-th frequency, and W (k) is a de-mixing matrix of the k-th frequency index.
Optionally, the unmixing model is further configured to perform amplitude deblurring on the updated unmixing matrix.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Through the technical scheme, the following technical effects can be at least achieved:
the obtained sound signals are processed through a demixing model which respectively takes reverberation and echo as independent sound sources and distinguishes the reverberation and the direct sound, the reverberation and the echo in the sound signals are separated, the direct sound signals are obtained, and therefore voice enhancement is more efficient, convenient and better in effect.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech enhancement method provided by the present disclosure.
Fig. 4 is a block diagram illustrating an apparatus 400 for speech enhancement according to an example embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, the apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an interface for input/output (I/O) 412, a sensor component 414, and a communication component 416.
The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the speech enhancement method described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.
The memory 404 is configured to store various types of data to support operations at the apparatus 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 406 provide power to the various components of device 400. Power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 400.
The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.
The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor assembly 414 may detect an open/closed state of the apparatus 400, the relative positioning of the components, such as a display and keypad of the apparatus 400, the sensor assembly 414 may also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described speech enhancement methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the apparatus 400 to perform the speech enhancement method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned speech enhancement method when executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of speech enhancement, comprising:
acquiring a sound signal;
inputting each signal frame in the sound signal into a de-mixing model, wherein the de-mixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the de-mixing model calculates a signal frame sequence after removing reverberation and echo from the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame;
and obtaining the target sound signal which is output by the unmixing model and is subjected to reverberation and echo removal.
2. The method of claim 1, wherein the unmixing model obtains the target sound signal by:
obtaining a first expected signal sequence of the current signal frame based on a de-mixing matrix corresponding to the previous signal frame;
updating a demixing matrix based on the first desired signal sequence;
calculating a second desired signal sequence based on the updated unmixing matrix, and regarding the second desired signal sequence as the target sound signal.
3. The method of claim 2, wherein the obtaining the first expected signal sequence of the current signal frame based on the unmixing matrix corresponding to the previous signal frame comprises:
and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the unmixing matrix corresponding to the previous signal frame.
4. The method of claim 2, wherein the updating the unmixing matrix based on the first desired signal sequence comprises:
updating a weighted covariance matrix with the first expected signal value;
updating the unmixing matrix based on the updated weighted covariance matrix.
5. The method of claim 2, wherein the calculating a second expected signal sequence based on the updated unmixing matrix comprises:
and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the updated unmixing matrix.
6. The method according to any one of claims 2-5, wherein the unmixing matrix is a preset matrix for establishing a mapping relationship among the signal value sequence of the current signal frame, the signal value sequence of the historical signal frame before the current signal frame, and the expected signal value sequence of the target sound signal in the current signal frame, and the mapping relationship is as follows:
Figure FDA0003111254950000021
wherein x (k, τ) is a sequence of signal values at the kth frequency index of the τ th frame, z (k, τ) is a sequence of direct sound signal values at the kth frequency index of the τ th frame, r (k, τ) is a sequence of echo reference signal values at the kth frequency index of the τ th frame,
Figure FDA0003111254950000022
and (k) indexing a signal value sequence of a preset number of historical signal frames before the (tau-1) th frame for the k-th frequency, and W (k) is a de-mixing matrix of the k-th frequency index.
7. The method of claim 2, wherein prior to said computing a second desired signal sequence based on the updated unmixing matrix, the method further comprises:
and carrying out amplitude ambiguity removal on the updated unmixing matrix.
8. A speech enhancement apparatus, comprising:
an acquisition module configured to acquire a sound signal;
a processing module configured to input each signal frame in the sound signal into a demixing model, wherein the demixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the demixing model calculates a signal frame sequence after removing reverberation and echo in a current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a history signal frame before the current signal frame;
an output module configured to obtain the target sound signal output by the unmixing model after the reverberation and the echo are removed.
9. A speech enhancement apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a sound signal;
inputting each signal frame in the sound signal into a de-mixing model, wherein the de-mixing model is a calculation model for removing reverberation and echo in the signal frame, which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the de-mixing model calculates a signal frame sequence after removing reverberation and echo from the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame;
and obtaining the target sound signal which is output by the unmixing model and is subjected to reverberation and echo removal.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.
CN202110649724.9A 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium Active CN113223543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110649724.9A CN113223543B (en) 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110649724.9A CN113223543B (en) 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium

Publications (2)

Publication Number Publication Date
CN113223543A true CN113223543A (en) 2021-08-06
CN113223543B CN113223543B (en) 2023-04-28

Family

ID=77080139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110649724.9A Active CN113223543B (en) 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113223543B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750956A (en) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
US20130294608A1 (en) * 2012-05-04 2013-11-07 Sony Computer Entertainment Inc. Source separation by independent component analysis with moving constraint
CN105931648A (en) * 2016-06-24 2016-09-07 百度在线网络技术(北京)有限公司 Audio signal de-reverberation method and device
CN110428852A (en) * 2019-08-09 2019-11-08 南京人工智能高等研究院有限公司 Speech separating method, device, medium and equipment
US20200349954A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Processing Overlapping Speech from Distributed Devices
CN112435685A (en) * 2020-11-24 2021-03-02 深圳市友杰智新科技有限公司 Blind source separation method and device for strong reverberation environment, voice equipment and storage medium
CN112489668A (en) * 2020-11-04 2021-03-12 北京百度网讯科技有限公司 Dereverberation method, dereverberation device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130294608A1 (en) * 2012-05-04 2013-11-07 Sony Computer Entertainment Inc. Source separation by independent component analysis with moving constraint
CN102750956A (en) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN105931648A (en) * 2016-06-24 2016-09-07 百度在线网络技术(北京)有限公司 Audio signal de-reverberation method and device
US20200349954A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Processing Overlapping Speech from Distributed Devices
CN110428852A (en) * 2019-08-09 2019-11-08 南京人工智能高等研究院有限公司 Speech separating method, device, medium and equipment
CN112489668A (en) * 2020-11-04 2021-03-12 北京百度网讯科技有限公司 Dereverberation method, dereverberation device, electronic equipment and storage medium
CN112435685A (en) * 2020-11-24 2021-03-02 深圳市友杰智新科技有限公司 Blind source separation method and device for strong reverberation environment, voice equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
卢艳: "基于神经网络与注意力机制结合的语音情感识别研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *
李先伟等: "基于频域去相关的语音信号分离", 《应用科技》 *

Also Published As

Publication number Publication date
CN113223543B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN108510987B (en) Voice processing method and device
CN111128221B (en) Audio signal processing method and device, terminal and storage medium
CN110970046B (en) Audio data processing method and device, electronic equipment and storage medium
CN111709891B (en) Training method of image denoising model, image denoising method, device and medium
CN107967459B (en) Convolution processing method, convolution processing device and storage medium
CN113314135B (en) Voice signal identification method and device
CN106060707B (en) Reverberation processing method and device
CN113053406A (en) Sound signal identification method and device
CN108629814B (en) Camera adjusting method and device
CN113506582A (en) Sound signal identification method, device and system
CN107239758B (en) Method and device for positioning key points of human face
CN107730443B (en) Image processing method and device and user equipment
CN110459236B (en) Noise estimation method, apparatus and storage medium for audio signal
CN112447184A (en) Voice signal processing method and device, electronic equipment and storage medium
CN110533006B (en) Target tracking method, device and medium
CN112201267A (en) Audio processing method and device, electronic equipment and storage medium
CN113223543B (en) Speech enhancement method, device and storage medium
CN109309764B (en) Audio data processing method and device, electronic equipment and storage medium
CN113489855B (en) Sound processing method, device, electronic equipment and storage medium
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN113488066A (en) Audio signal processing method, audio signal processing apparatus, and storage medium
CN113077808B (en) Voice processing method and device for voice processing
CN111667842B (en) Audio signal processing method and device
CN113362841B (en) Audio signal processing method, device and storage medium
CN112434714A (en) Multimedia identification method, device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant