CN113223543B - Speech enhancement method, device and storage medium - Google Patents

Speech enhancement method, device and storage medium Download PDF

Info

Publication number
CN113223543B
CN113223543B CN202110649724.9A CN202110649724A CN113223543B CN 113223543 B CN113223543 B CN 113223543B CN 202110649724 A CN202110649724 A CN 202110649724A CN 113223543 B CN113223543 B CN 113223543B
Authority
CN
China
Prior art keywords
signal
frame
sequence
signal frame
unmixed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110649724.9A
Other languages
Chinese (zh)
Other versions
CN113223543A (en
Inventor
侯海宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110649724.9A priority Critical patent/CN113223543B/en
Publication of CN113223543A publication Critical patent/CN113223543A/en
Application granted granted Critical
Publication of CN113223543B publication Critical patent/CN113223543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present disclosure relates to a speech enhancement method, apparatus and storage medium, the method comprising: acquiring a sound signal; inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation and echo in the signal frame and is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and the unmixing model calculates a signal frame sequence after removing the reverberation and echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal; and obtaining the target sound signal which is output by the unmixed model and is subjected to reverberation and echo removal. The present disclosure may promote the effect of speech enhancement.

Description

Speech enhancement method, device and storage medium
Technical Field
The present disclosure relates to the field of sound processing, and in particular, to a method, apparatus, and storage medium for speech enhancement.
Background
At present, various product devices mostly adopt a microphone array to pick up sound, and a microphone wave beam forming technology or a blind source separation technology is applied to improve the processing quality of voice signals and improve the voice recognition rate in a real environment.
However, when the device plays a sound scene while recording, the sound played by the device is picked up by its microphone to form an echo. The echo can interfere with sound signals such as control commands sent by users, so that the voice recognition rate of equipment is reduced, and the interactive experience is reduced. In addition, in a real life environment, a reverberation phenomenon occurs in sound due to reflection of walls, furniture, and the like. Reverberation can lead to deterioration of the beamforming and separation effects.
The current voice enhancement system is generally formed by connecting an echo cancellation module, a dereverberation module and a blind source separation module in series, wherein the modules are independent from each other, and are respectively divided into labor and optimized. Therefore, although each module can achieve respective optimization effect, the effect of the whole link cannot be guaranteed to be optimal.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a voice enhancement method, apparatus, and storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided a speech enhancement method, including: acquiring a sound signal; and inputting each signal frame in the sound signal into a unmixed model, wherein the unmixed model is a calculation model which is used for removing reverberation and echo in the signal frame and is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and the unmixed model calculates a signal frame sequence after removing the reverberation and echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal.
Optionally, the unmixed model obtains the target sound signal by: based on a unmixed matrix corresponding to the previous signal frame, a first expected signal sequence of the current signal frame is obtained; updating a downmix matrix based on the first desired signal sequence; and calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal. .
Optionally, the calculating the first expected signal sequence of the current signal frame based on the unmixed matrix corresponding to the previous signal frame includes: and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the unmixed matrix corresponding to the previous signal frame.
Optionally, the updating the unmixed matrix based on the first expected signal sequence includes: updating a weighted covariance matrix by the first desired signal value; and updating the unmixed matrix based on the updated weighted covariance matrix. .
Optionally, the calculating a second expected signal sequence based on the updated unmixed matrix includes: and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the updated unmixed matrix.
Optionally, the unmixed matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a signal frame, a signal value sequence of a history signal frame before the signal frame, and an expected signal value sequence of a target sound signal in the signal frame, where the mapping relationship is:
Figure BDA0003111254960000021
Figure BDA0003111254960000022
wherein x (k, τ) is a signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a direct sound signal value sequence of a kth frequency index of a kth frame, r (k, τ) is an echo reference signal value sequence of a kth frequency index of a kth frame, and (k, τ) is a signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a direct sound signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a signal value sequence of a signal value of a sound signal>
Figure BDA0003111254960000031
For a signal value sequence of a preset number of historical signal frames before a kth frequency index tau-1 frame, W (k) is a unmixed matrix of the kth frequency index.
Optionally, before said calculating the second desired signal sequence based on the updated unmixed matrix, the method further comprises: and performing amplitude ambiguity removal on the updated unmixed matrix.
According to a second aspect of embodiments of the present disclosure, there is provided a speech enhancement apparatus comprising: an acquisition module configured to acquire a sound signal; a processing module configured to input each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources and is used for removing the reverberation and the echo in the signal frame, and the unmixing model calculates a signal frame sequence after removing the reverberation and the echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal; and the output module is configured to obtain the target sound signal which is output by the unmixed model and is subjected to reverberation and echo removal.
Optionally, the unmixed model is configured to obtain the target sound signal by: based on a unmixed matrix corresponding to the previous signal frame, a first expected signal sequence of the current signal frame is obtained; updating a downmix matrix based on the first desired signal sequence; and calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal.
Optionally, the unmixed model is further configured to calculate the first expected signal sequence according to a signal value sequence of the current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of the historical signal frame, and an unmixed matrix corresponding to the previous signal frame.
Optionally, the unmixed model is further configured to update a weighted covariance matrix with the first desired signal value; and updating the unmixed matrix based on the updated weighted covariance matrix.
Optionally, the unmixed model is further configured to calculate the first expected signal sequence according to a signal value sequence of a current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of a historical signal frame, and the updated unmixed matrix.
Optionally, the unmixed matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a signal frame, a signal value sequence of a history signal frame before the signal frame, and an expected signal value sequence of a target sound signal in the signal frame, where the mapping relationship is:
Figure BDA0003111254960000041
Figure BDA0003111254960000042
wherein x (k, τ) is a signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a direct sound signal value sequence of a kth frequency index of a kth frame, r (k, τ) is an echo reference signal value sequence of a kth frequency index of a kth frame, and (k, τ) is a signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a direct sound signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a signal value sequence of a signal value of a sound signal>
Figure BDA0003111254960000043
For a signal value sequence of a preset number of historical signal frames before a kth frequency index tau-1 frame, W (k) is a unmixed matrix of the kth frequency index.
Optionally, the unmixed model is further configured to de-amplitude blur the updated unmixed matrix.
According to a third aspect of embodiments of the present disclosure, there is provided a speech enhancement apparatus, comprising: an acquisition module configured to acquire a sound signal; a processing module configured to input each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources and is used for removing the reverberation and the echo in the signal frame, and the unmixing model calculates a signal frame sequence after removing the reverberation and the echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal; and the output module is configured to obtain the target sound signal which is output by the unmixed model and is subjected to reverberation and echo removal.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech enhancement method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: the obtained sound signals are processed through a unmixing model which respectively takes the reverberation and the echo as independent sound sources and distinguishes the reverberation and the echo from the direct sound, and the reverberation and the echo in the sound signals are separated to obtain the direct sound signals, so that the voice enhancement is more efficient, more convenient and better in effect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart illustrating a method of speech enhancement according to an exemplary embodiment.
Fig. 2 is a schematic diagram illustrating a flow of speech enhancement according to an example embodiment.
Fig. 3 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment.
Fig. 4 is a block diagram of an apparatus according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart illustrating a voice enhancement method according to an exemplary embodiment, and as shown in fig. 1, the voice enhancement method may be applied to various electronic devices, for example, a mobile phone, a computer, a tablet computer, a wearable terminal, a recording pen, a recorder, etc. The method comprises the following steps.
In step S11, a sound signal is acquired.
In step S12, each signal frame in the sound signal is input to the unmixed model.
The unmixed model is a calculation model which is set by taking direct sound, reverberation and echo in a signal frame as independent sound sources and is used for removing the reverberation and the echo in the signal frame, and the unmixed model calculates a signal frame sequence after removing the reverberation and the echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal.
In step S13, a target sound signal after removing reverberation and echo is obtained, which is output by the unmixed model.
In step S11, the sound signal may be acquired by a microphone array, where the microphone array includes a plurality of microphones, each of which independently acquires a sound signal, and the sequence of sound signals acquired by the microphone array may be used as the sound signal in step S11.
For example, the device has M microphones, and the expression of the sound signal in each signal frame may be x (k, τ) = [ x 1 (k,τ),……,x M (k,τ)]Where k is the frequency index and τ is the frame number index.
Since the source of the reverberation value appearing in any signal frame is recorded after the direct sound in the signal frame preceding the signal frame is reflected, the signal value of the direct sound in the current signal frame can be obtained by comparing the signal value of the current frame with the signal value in the signal frame preceding the frame, and thus, in step S12, the mapping relationship among the signal value sequence of the current processed signal frame, the signal value sequence of the signal frame preceding the signal frame, and the signal value sequence of the direct sound in the signal frame can be described as the following expression:
Figure BDA0003111254960000061
wherein x (k, τ) is a signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a direct sound signal value sequence of a kth frequency index of a kth frame, r (k, τ) is an echo reference signal value sequence of a kth frequency index of a kth frame, and (k, τ) is a signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a direct sound signal value sequence of a kth frequency index of a kth frame, z (k, τ) is a signal value sequence of a signal value of a sound signal>
Figure BDA0003111254960000062
For a signal value sequence of a preset number of historical signal frames before a kth frequency index tau-1 frame, W (k) is a unmixed matrix of the kth frequency index. The unmixed model is used for estimating the direct sound signal value sequence through the unmixed matrix.
It should be noted that, since the echo is generated by the device recording the sound it plays, the echo reference signal value sequence may be generated by the device according to the audio being played, or may be obtained by obtaining the signal value sequence in the file being played by the device, where in the above expression, the echo reference signal value sequence may be used as a known sequence value to calculate the direct sound signal. The method for acquiring the echo reference signal value sequence is not limited in the present disclosure.
The above expression is obtained by processing an acoustic observation model, and the acoustic observation model has the expression:
Figure BDA0003111254960000071
wherein A is a direct sound and early reverberation signal matrix, and B is direct and early reflection parts of an echo path. />
Figure BDA0003111254960000072
Is unfolded into
Figure BDA0003111254960000073
A l Signal matrix for late reverberation, B l S is the signal matrix of the direct sound, which is the late reflected part of the echo path.
The above-mentioned acoustic observation model is rewritten to obtain a matrix acoustic observation model:
Figure BDA0003111254960000074
wherein P (k) is a mixing matrix,
Figure BDA0003111254960000075
based on the sound observation model and the mixing matrix, the direct sound, the reverberation and the echo can be separated through the unmixing matrix to obtain estimated values of all components, wherein z (k, tau) is a target sound signal obtained by estimating s (k, tau).
In one possible implementation, the expression of the unmixed matrix W (k) is:
Figure BDA0003111254960000076
wherein D is M×N (k) A separation matrix of M rows and N columns, E M×R (k) Echo path matrix for M rows and R columns, F M×ML (k) A reverberation path matrix of M rows and ML columns.
The target sound signal after echo and reverberation removal may include M sub-signal sequences, each of which includes a sound signal of an independent sound source, for example, when the microphone is configured to perform the configuration, the sound signals of the person a, the person B, the person C, the person D, the ambient sound a, the ambient sound B, the ambient sound C, and the ambient sound D may be configured to perform the configuration simultaneously, after the separation, the first sub-signal sequence of the target sound signal may be a sound signal after the person a dereverberates and the echo, the second sub-signal sequence may be a sound signal after the person B dereverberates and the echo, the third sub-signal sequence may be a sound signal after the person C dereverberates and the echo, the fourth sub-signal sequence may be a sound signal after the person D dereverberates and the echo, the sixth sub-signal sequence may be a sound signal after the ambient sound B dereverberates and the echo, and the seventh sub-signal sequence may be a sound signal after the environment sound signal C dereverberates and the echo, and the eighth sub-signal sequence may be a sound signal after the environment B dereverberates and the echo.
Therefore, the target sound signal z (k, τ) also includes M lines, and the portion of the unmixed model other than the first M lines does not affect the determination of the target sound signal, and therefore, when the target sound signal is determined, it can be realized by performing the determination and updating of the first M lines of the unmixed model.
In one possible implementation, the unmixed model is estimated for the desired signal values by:
and obtaining a first expected signal sequence of the current signal frame based on a unmixed matrix corresponding to the previous signal frame, updating the unmixed matrix based on the first expected signal sequence, calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal.
That is, the first expected signal sequence of the current frame may be estimated by the value of the downmix matrix of the previous signal frame, and the downmix matrix of the current frame may be updated by performing a weighted calculation on the estimated first expected signal sequence, and the downmix matrix of the current frame may be used to calculate the second expected signal sequence of the current frame and estimate the first expected signal sequence of the next frame, so that the second expected signal sequences of all signal frames are obtained through continuous iteration. The first expected signal sequence is an expected signal sequence obtained through the unmixed matrix estimation of the previous frame, a larger deviation may occur in the estimation result, the second expected signal sequence is an expected signal sequence obtained through the unmixed matrix calculation of the current frame, and the deviation of the obtained direct sound signal value is smaller.
For the sound signal sequence of the first frame, there is no earlier signal frame, so when solving the signal of the first frame, the unmixed matrix may be initialized to an identity matrix for iterative operation.
In one possible implementation manner, the first expected signal sequence is obtained according to a signal value sequence of a current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of a historical signal frame and a unmixed matrix corresponding to a previous signal frame.
For example, the first desired signal sequence y' (k, τ) may be estimated by the following formula: y' (k, τ) =w (k, τ -1) X (k, τ), where W (k, τ -1) is the unmixed matrix corresponding to the kth frequency index of the τ -1 frame,
Figure BDA0003111254960000091
after the first expected signal value is obtained, the unmixed matrix can be updated through the obtained first expected signal value, and when the unmixed matrix is updated, the first M rows of the unmixed matrix can be updated, but not all rows of the unmixed matrix are updated, so that the iteration efficiency of the model is improved.
In one possible implementation, a weighted covariance matrix may be updated by the first desired signal value, and the unmixed matrix may be updated based on the updated weighted covariance matrix.
For example, the update of the unmixed matrix is performed by: w (w) m (k,τ)=[W(k,τ-1Cm(k,τ)]-1im, wmk, τ being the M-th row of the unmixed matrix W (k, τ), M being any natural number from 1 to M, traversing the first M rows of the unmixed matrix W (k, τ) can be achieved by traversing M.
C m (k, τ) mth row, i of weighting matrix for kth frequency index of τ th frame m M-th column of unit matrix, C m (k,τ)=αC m (k,τ-1)+β m (τ)X(k,τ)X H (k,τ),β m (tau) is a weighted smoothing coefficient,
Figure BDA0003111254960000092
alpha is a preset smoothing coefficient, +.>
Figure BDA0003111254960000093
For the mth row sequence value in the first sequence of desired signal values at each frequency index of the τ frame,
Figure BDA0003111254960000094
k is the frequency index number, ">
Figure BDA0003111254960000095
As a function of the comparison,
Figure BDA0003111254960000096
delta is a preset zero prevention and removal parameter, and gamma is a preset shape parameter.
In practice, α may be set to 0.99, and γ may be manually adjusted according to the distribution of sound sources, so as to improve the accuracy of the unmixed model. Under the condition that the expected signal value obtained by the unmixing model is greatly different from the actual direct sound signal value, the sound source distribution condition in the model can be adjusted by adjusting the value of gamma. In the present disclosure, the initial value of γ may be set to 0.2.
The initialization of the unmixed matrix may be performed, and the weighting matrix may be initialized, where the weighting matrix may be an initial matrix set arbitrarily or may be a zero matrix.
In one possible implementation, the unmixed matrix W corresponding to frame 0 may be k (0) Initializing to be an identity matrix I, and weighting matrix C m (k, 0) is initialized to zero matrix 0.
After updating the unmixed matrix of the current frame, a more accurate desired signal sequence, i.e. a second desired signal sequence, than the first desired signal sequence may be calculated by the more accurate unmixed matrix. And solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the updated unmixed matrix. In one possible embodiment, the second desired signal sequence may be calculated by the following formula: y "(k, τ) =w (k, τ) X (k, τ). Where W (k, τ) is the updated unmixed matrix.
The updated unmixed matrix may be subjected to a de-amplitude blur process prior to the calculation of the second desired signal sequence, which in this disclosure may be performed by MDP (Minimal distortion principle, minimum distortion criterion), i.e. by the following formula: w (k, τ) =diag (W) -1 (k,τ))W(k,τ)。
In blind source separation, it is generally assumed that the number of sound sources is equal to the number of microphones, and uncertainty may occur in the resulting result since the linear transformation of the estimated sound source signal may also be considered as another signal in the signal source. By de-blurring the effect of the above problem on the unmixed matrix can be reduced.
After the target sound signal is obtained, the target sound signal may be sent to a voice processing unit so that the voice processing unit converts the target sound signal into an operation instruction, for example, the voice processing unit may correspond to a voice helper program or the like in the device. The target sound signal may also be sent to a speech recognition unit so that the speech recognition unit converts the target sound signal into text content. After dereverberation and echo, fewer noise is present in the sound signal, which is more conducive to sound extraction and processing.
FIG. 2 is a schematic diagram illustrating a flow of speech enhancement according to an exemplary disclosed embodiment.
In step S21, the unmixed matrix is initialized to an identity matrix, and the weighting matrix is initialized to a zero matrix.
In step S22, the first expected signal sequence of the τ frame is estimated by the corresponding unmixed matrix of the τ -1 frame.
In step S23, the weighting matrix corresponding to the τ frame is updated.
In step S24, the downmix matrix corresponding to the τ frame is updated.
In step S25, the unmixed matrix corresponding to the τ frame is processed using MDP amplitude deblurring.
In step S26, a second desired signal sequence corresponding to the τ frame is calculated by the deblurred unmixed matrix.
Step S22 to step S26 are steps performed in a loop until the second desired signal sequence corresponding to each signal frame to be processed is obtained.
Through the technical scheme, at least the following technical effects can be achieved:
the obtained sound signals are processed through a unmixing model which respectively takes the reverberation and the echo as independent sound sources and distinguishes the reverberation and the echo from the direct sound, and the reverberation and the echo in the sound signals are separated to obtain the direct sound signals, so that the voice enhancement is more efficient, more convenient and better in effect.
Fig. 3 is a block diagram of a speech enhancement apparatus according to an exemplary embodiment. Referring to fig. 3, the apparatus includes an acquisition module 310, a processing module 320, and an output module 330.
The acquisition module 310 is configured to acquire sound signals.
The processing module 320 is configured to input each signal frame in the sound signal into a unmixed model, where the unmixed model is a calculation model for removing reverberation and echo in the signal frame, where the direct sound, the reverberation and the echo in the signal frame are set as independent sound sources, and for any non-first signal frame in the sound signal, the unmixed model calculates a signal frame sequence after removing reverberation and echo for a current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a history signal frame before the current signal frame.
The output module 330 is configured to obtain the reverberated and echoed target sound signal output by the unmixed model.
Optionally, the unmixed model is configured to obtain the target sound signal by: based on a unmixed matrix corresponding to the previous signal frame, a first expected signal sequence of the current signal frame is obtained; updating a downmix matrix based on the first desired signal sequence; and calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal.
Optionally, the unmixed model is further configured to calculate the first expected signal sequence according to a signal value sequence of the current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of the historical signal frame, and an unmixed matrix corresponding to the previous signal frame.
Optionally, the unmixed model is further configured to update a weighted covariance matrix with the first desired signal value; and updating the unmixed matrix based on the updated weighted covariance matrix.
Optionally, the unmixed model is further configured to calculate the first expected signal sequence according to a signal value sequence of a current signal frame, an echo reference signal value sequence of the current signal frame, a signal value sequence of a historical signal frame, and the updated unmixed matrix.
Optionally, the unmixed matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a signal frame, a signal value sequence of a history signal frame before the signal frame, and an expected signal value sequence of a target sound signal in the signal frame, where the mapping relationship is: wherein x (k, τ) is τA sequence of signal values for a frame k-th frequency index, z (k, τ) is a sequence of direct sound signal values for a τ -th frame k-th frequency index, r (k, τ) is a sequence of echo reference signal values for a τ -th frame k-th frequency index,
Figure BDA0003111254960000121
for a signal value sequence of a preset number of historical signal frames before a kth frequency index tau-1 frame, W (k) is a unmixed matrix of the kth frequency index.
Optionally, the unmixed model is further configured to de-amplitude blur the updated unmixed matrix.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Through the technical scheme, at least the following technical effects can be achieved:
the obtained sound signals are processed through a unmixing model which respectively takes the reverberation and the echo as independent sound sources and distinguishes the reverberation and the echo from the direct sound, and the reverberation and the echo in the sound signals are separated to obtain the direct sound signals, so that the voice enhancement is more efficient, more convenient and better in effect.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech enhancement method provided by the present disclosure.
Fig. 4 is a block diagram illustrating an apparatus 400 for speech enhancement according to an example embodiment. For example, apparatus 400 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 4, apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and a communication component 416.
The processing component 402 generally controls the overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the speech enhancement method described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.
Memory 404 is configured to store various types of data to support operations at apparatus 400. Examples of such data include instructions for any application or method operating on the apparatus 400, contact data, phonebook data, messages, pictures, videos, and the like. The memory 404 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power component 406 provides power to the various components of the device 400. The power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 400.
The multimedia component 408 includes a screen between the device 400 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 410 is configured to output and/or input audio signals. For example, the audio component 410 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 further includes a speaker for outputting audio signals.
The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 414 includes one or more sensors for providing status assessment of various aspects of the apparatus 400. For example, the sensor assembly 414 may detect the on/off state of the device 400, the relative positioning of the components, such as the display and keypad of the device 400, the sensor assembly 414 may also detect the change in position of the device 400 or a component of the device 400, the presence or absence of user contact with the device 400, the orientation or acceleration/deceleration of the device 400, and the change in temperature of the device 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate communication between the apparatus 400 and other devices in a wired or wireless manner. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described voice enhancement methods.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 404, including instructions executable by processor 420 of apparatus 400 to perform the above-described speech enhancement method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described speech enhancement method when executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (9)

1. A method of speech enhancement, comprising:
acquiring a sound signal;
inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation and echo in the signal frame and is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and the unmixing model calculates a signal frame sequence after removing the reverberation and echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal;
obtaining a target sound signal which is output by the unmixed model and is subjected to reverberation and echo removal;
wherein the unmixed model obtains the target sound signal by:
based on a unmixed matrix corresponding to the previous signal frame, a first expected signal sequence of the current signal frame is obtained;
updating a downmix matrix based on the first desired signal sequence;
and calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal.
2. The method of claim 1, wherein the determining the first expected signal sequence for the current signal frame based on the unmixed matrix corresponding to the previous signal frame comprises:
and solving the first expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the unmixed matrix corresponding to the previous signal frame.
3. The method of claim 1, wherein the updating the unmixed matrix based on the first desired signal sequence comprises:
updating a weighting matrix by the first desired signal sequence;
and updating the unmixed matrix based on the updated weighting matrix.
4. The method of claim 1, wherein said calculating a second desired signal sequence based on said updated unmixed matrix comprises:
and solving the second expected signal sequence according to the signal value sequence of the current signal frame, the echo reference signal value sequence of the current signal frame, the signal value sequence of the historical signal frame and the updated unmixed matrix.
5. The method according to any one of claims 1-4, wherein the unmixed matrix is a preset matrix for establishing a mapping relationship among a signal value sequence of a current signal frame, a signal value sequence of a history signal frame preceding the current signal frame, and a desired signal value sequence of a target sound signal in the current signal frame, the mapping relationship being:
Figure FDA0004086824700000021
where x (k, τ) is the sequence of signal values for the kth frequency index of the τ frame, z (k, τ) is the sequence of direct sound signal values for the kth frequency index of the τ frame, r (k, τ) is the sequence of echo reference signal values for the kth frequency index of the τ frame,
Figure FDA0004086824700000022
for a signal value sequence of a preset number of historical signal frames before a kth frequency index tau-1 frame, W (k) is a unmixed matrix of the kth frequency index.
6. The method of claim 1, wherein prior to said calculating a second desired signal sequence based on said updated unmixed matrix, the method further comprises:
and performing amplitude ambiguity removal on the updated unmixed matrix.
7. A speech enhancement apparatus, comprising:
an acquisition module configured to acquire a sound signal;
a processing module configured to input each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources and is used for removing the reverberation and the echo in the signal frame, and the unmixing model calculates a signal frame sequence after removing the reverberation and the echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal;
the output module is configured to obtain a target sound signal which is output by the unmixed model and is subjected to reverberation and echo removal;
wherein the unmixed model is configured to obtain the target sound signal by: based on a unmixed matrix corresponding to the previous signal frame, a first expected signal sequence of the current signal frame is obtained; updating a downmix matrix based on the first desired signal sequence; and calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal.
8. A speech enhancement apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a sound signal;
inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation and echo in the signal frame and is set by taking direct sound, reverberation and echo in the signal frame as independent sound sources, and the unmixing model calculates a signal frame sequence after removing the reverberation and echo for the current signal frame according to a signal value sequence of the current signal frame and a signal value sequence of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal;
obtaining a target sound signal which is output by the unmixed model and is subjected to reverberation and echo removal;
wherein the unmixed model obtains the target sound signal by:
based on a unmixed matrix corresponding to the previous signal frame, a first expected signal sequence of the current signal frame is obtained;
updating a downmix matrix based on the first desired signal sequence;
and calculating a second expected signal sequence based on the updated unmixed matrix, and taking the second expected signal sequence as the target sound signal.
9. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1-6.
CN202110649724.9A 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium Active CN113223543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110649724.9A CN113223543B (en) 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110649724.9A CN113223543B (en) 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium

Publications (2)

Publication Number Publication Date
CN113223543A CN113223543A (en) 2021-08-06
CN113223543B true CN113223543B (en) 2023-04-28

Family

ID=77080139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110649724.9A Active CN113223543B (en) 2021-06-10 2021-06-10 Speech enhancement method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113223543B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9099096B2 (en) * 2012-05-04 2015-08-04 Sony Computer Entertainment Inc. Source separation by independent component analysis with moving constraint
CN102750956B (en) * 2012-06-18 2014-07-16 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN105931648B (en) * 2016-06-24 2019-05-03 百度在线网络技术(北京)有限公司 Audio signal solution reverberation method and device
US11138980B2 (en) * 2019-04-30 2021-10-05 Microsoft Technology Licensing, Llc Processing overlapping speech from distributed devices
CN110428852B (en) * 2019-08-09 2021-07-16 南京人工智能高等研究院有限公司 Voice separation method, device, medium and equipment
CN112489668B (en) * 2020-11-04 2024-02-02 北京百度网讯科技有限公司 Dereverberation method, device, electronic equipment and storage medium
CN112435685B (en) * 2020-11-24 2024-04-12 深圳市友杰智新科技有限公司 Blind source separation method and device for strong reverberation environment, voice equipment and storage medium

Also Published As

Publication number Publication date
CN113223543A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN108510987B (en) Voice processing method and device
CN111709891B (en) Training method of image denoising model, image denoising method, device and medium
CN111128221B (en) Audio signal processing method and device, terminal and storage medium
CN111724823A (en) Information processing method and device and electronic equipment
CN113314135B (en) Voice signal identification method and device
CN111863012B (en) Audio signal processing method, device, terminal and storage medium
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN113053406B (en) Voice signal identification method and device
CN113506582B (en) Voice signal identification method, device and system
CN106060707B (en) Reverberation processing method and device
CN113362848B (en) Audio signal processing method, device and storage medium
CN112447184B (en) Voice signal processing method and device, electronic equipment and storage medium
CN111667842B (en) Audio signal processing method and device
CN113810828A (en) Audio signal processing method and device, readable storage medium and earphone
CN110459236B (en) Noise estimation method, apparatus and storage medium for audio signal
CN110148424B (en) Voice processing method and device, electronic equipment and storage medium
CN113488066B (en) Audio signal processing method, audio signal processing device and storage medium
CN113223543B (en) Speech enhancement method, device and storage medium
CN113362841B (en) Audio signal processing method, device and storage medium
CN112434714A (en) Multimedia identification method, device, storage medium and electronic equipment
CN114648996A (en) Audio data processing method and device, voice interaction method, equipment and chip, sound box, electronic equipment and storage medium
CN113077808A (en) Voice processing method and device for voice processing
CN113194387A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN113345456B (en) Echo separation method, device and storage medium
CN111862288A (en) Pose rendering method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant