CN113362841B - Audio signal processing method, device and storage medium - Google Patents

Audio signal processing method, device and storage medium Download PDF

Info

Publication number
CN113362841B
CN113362841B CN202110649720.0A CN202110649720A CN113362841B CN 113362841 B CN113362841 B CN 113362841B CN 202110649720 A CN202110649720 A CN 202110649720A CN 113362841 B CN113362841 B CN 113362841B
Authority
CN
China
Prior art keywords
signal
frame
signal frame
unmixed
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110649720.0A
Other languages
Chinese (zh)
Other versions
CN113362841A (en
Inventor
侯海宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110649720.0A priority Critical patent/CN113362841B/en
Publication of CN113362841A publication Critical patent/CN113362841A/en
Application granted granted Critical
Publication of CN113362841B publication Critical patent/CN113362841B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

The present disclosure relates to an audio signal processing method, apparatus and storage medium, the method comprising: acquiring a sound signal; inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation in the signal frame and is set by taking direct sound and reverberation in the signal frame as independent sound sources, and the unmixing model calculates a signal value after removing reverberation for a current signal frame according to a signal value of the current signal frame and a signal value of a history signal frame before the current signal frame aiming at any non-first signal frame in the sound signal; and obtaining the sound signal which is output by the unmixed model and is subjected to reverberation removal. The present disclosure may promote dereverberation efficiency.

Description

Audio signal processing method, device and storage medium
Technical Field
The present disclosure relates to the field of sound processing, and in particular, to an audio signal processing method, apparatus, and storage medium.
Background
At present, various product devices mostly adopt a microphone array to pick up sound, and a microphone wave beam forming technology or a blind source separation technology is applied to improve the processing quality of voice signals and improve the voice recognition rate in a real environment.
However, in a real life environment, sound has a reverberation phenomenon due to reflection of walls, furniture, etc. Reverberation can lead to deterioration of the beamforming and separation effects. Thus, the dereverberation module is a very important module in the speech enhancement system link, and the current dereverberation technology is inefficient and not ideal.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides an audio signal processing method, apparatus, and storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided an audio signal processing method, including acquiring a sound signal; inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation in the signal frame and is set by taking direct sound and reverberation in the signal frame as independent sound sources, and the unmixing model can calculate the signal value after removing reverberation for the current signal frame according to the signal value of the current signal frame and the signal value of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal; and obtaining the sound signal which is output by the unmixed model and is subjected to reverberation removal.
Optionally, the unmixed model is configured to estimate a desired signal value of the direct sound in the current signal frame, and take the estimated result as a signal value after reverberation is removed from the current signal frame, where the unmixed model includes an unmixed matrix, where the unmixed matrix is a preset matrix for establishing a mapping relationship between a signal value of the current signal frame, a signal value of a historical signal frame preceding the current signal frame, and a desired signal value of the direct sound in the current signal frame.
Optionally, the unmixed model estimates the desired signal value by: calculating a first expected signal value of direct sound in the current signal frame based on the unmixed matrix corresponding to the previous signal frame of the current signal frame; updating the unmixed matrix based on the first desired signal value; and calculating a second expected signal value of the direct sound in the current signal frame based on the updated unmixed matrix, and taking the second expected signal value as the signal value after reverberation removal.
Optionally, the calculating, based on the downmix matrix corresponding to the previous signal frame of the current signal frame, a first expected signal value of the direct sound in the current signal frame includes: and determining the first expected signal value according to the signal value of the current signal frame, the signal value of the historical signal frame and the unmixed matrix corresponding to the previous signal frame of the current signal frame.
Optionally, the mapping relationship is:
Figure BDA0003111253820000021
wherein d n,k For the desired signal value, x, of the direct sound indexed by the kth frequency of the nth frame Δ,k To characterize the kth frequency index within the previous delta frame of the nth frameVector of sequence of signal values, x n,k Signal value W for the nth frame kth frequency index k For the unmixed matrix, wherein: />
Figure BDA0003111253820000022
w 1,k For the first sub-unmixed matrix, w 2,k For the second sub-unmixed matrix, +.>
Figure BDA0003111253820000023
And (3) a vector of unmixed filter coefficients in the unmixed model.
Optionally, the updating the first sub-unmixed matrix based on the first expected signal value includes: updating a weighted covariance matrix by the first desired signal value; and updating the unmixed matrix based on the updated weighted covariance matrix.
Optionally, the calculating, based on the updated unmixed matrix, a second desired signal value of the direct sound in the current signal frame includes: and determining the second expected signal value according to the signal value of the current signal frame, the signal value of the historical signal frame and the updated unmixed matrix.
According to a second aspect of embodiments of the present disclosure, there is provided an audio signal processing apparatus including an acquisition module configured to acquire a sound signal; a processing module configured to input each signal frame in the sound signal into a unmixed model, wherein the unmixed model is a calculation model for removing reverberation in the signal frame, which is set by taking direct sound and reverberation in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the unmixed model can calculate a signal value after removing reverberation for a current signal frame according to a signal value of the current signal frame and a signal value of a historical signal frame before the current signal frame; and the output module is configured to obtain the sound signals which are output by the unmixed model and are subjected to reverberation removal.
Optionally, the unmixed model is configured to estimate a desired signal value of the direct sound in the signal frame, and take the estimated result as a signal value after reverberation is removed from the signal frame, where the unmixed model includes an unmixed matrix, where the unmixed matrix is a preset matrix for establishing a mapping relationship between a signal value of a current signal frame, a signal value of a historical signal frame before the current signal frame, and a desired signal value of the direct sound in the current signal frame.
Optionally, the unmixed model is configured to estimate the desired signal value by: calculating a first expected signal value of direct sound in the current signal frame based on the unmixed matrix corresponding to the previous signal frame of the current signal frame; updating the unmixed matrix based on the first desired signal value; and calculating a second expected signal value of the direct sound in the current signal frame based on the updated unmixed matrix, and taking the second expected signal value as the signal value after reverberation removal.
Optionally, the unmixed model is further configured to determine the first expected signal value according to a signal value of the current signal frame, a signal value of the historical signal frame, and the unmixed matrix corresponding to a signal frame previous to the current signal frame.
Optionally, the mapping relationship is:
Figure BDA0003111253820000031
wherein d n,k For the desired signal value, x, of the direct sound indexed by the kth frequency of the nth frame Δ,k To characterize the vector of the sequence of signal values of the kth frequency index within the previous delta frame of the nth frame, x n,k Signal value W for the nth frame kth frequency index k For the unmixed matrix, wherein: />
Figure BDA0003111253820000041
w 1,k For the first sub-unmixed matrix, w 2,k For the second sub-unmixed matrix, +.>
Figure BDA0003111253820000042
And (3) a vector of unmixed filter coefficients in the unmixed model. />
Optionally, the unmixed model is further configured to update a weighted covariance matrix with the first desired signal value; and updating the unmixed matrix based on the updated weighted covariance matrix.
Optionally, the unmixed model is further configured to determine the second desired signal value from the signal values of the current signal frame, the signal values of the historical signal frames, and the updated unmixed matrix.
According to a third aspect of embodiments of the present disclosure, there is provided an audio signal processing apparatus comprising a processor and a memory; the memory is used for storing processor executable instructions, and the processor is configured to: acquiring a sound signal; inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation in the signal frame and is set by taking direct sound and reverberation in the signal frame as independent sound sources, and the unmixing model can calculate the signal value after removing reverberation for the current signal frame according to the signal value of the current signal frame and the signal value of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal; and obtaining the sound signal which is output by the unmixed model and is subjected to reverberation removal.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the audio signal processing method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: the obtained sound signals are processed through the unmixed model which distinguishes the reverberation from the independent sound source and the direct sound, the reverberation in the sound signals is separated, and the sound signals after the reverberation is removed are obtained, so that the reverberation in the sound signals can be removed more quickly and conveniently.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment.
Fig. 2 is a schematic diagram illustrating a flow of audio signal processing according to an exemplary embodiment.
Fig. 3 is a block diagram of an audio signal processing apparatus according to an exemplary embodiment.
Fig. 4 is a block diagram of an apparatus according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment, and as shown in fig. 1, the audio signal processing method may be applied to various electronic devices, for example, a mobile phone, a computer, a tablet computer, a recording device, etc. The method comprises the following steps.
In step S11, a sound signal is acquired.
In step S12, each signal frame in the sound signal is input to the unmixed model.
The unmixed model is a calculation model which is set by taking direct sound and reverberation in a signal frame as independent sound sources and is used for removing the reverberation in the signal frame, and the unmixed model can calculate the signal value after removing the reverberation for the current signal frame according to the signal value of the current signal frame and the signal value of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal.
The unmixing model is used for estimating expected signal values of direct sound in the signal frame and taking an estimation result as the signal values after reverberation is removed from the signal frame, wherein the unmixing model comprises an unmixing matrix, and the unmixing matrix is a preset matrix used for establishing a mapping relation among the signal values of the signal frame, the signal values of a historical signal frame before the signal frame and the expected signal values of the direct sound in the signal frame.
The history signal frames are divided into a long-term history signal frame and a recent history signal frame, a signal frame close to the current signal frame is a recent history signal frame, a signal frame far away from the current signal frame is a long-term history signal frame, and because a sound signal recorded by environmental reflection in the recent history signal frame is relatively close to a direct sound signal, the sound signal and the direct sound signal can not be distinguished, and therefore, the history signal frame in the present disclosure can be the long-term history signal frame. For example, if the current signal frame is x n,k The recent history signal frame is x n-Δ+1,k To x n-1,k Long-term history signal frame x Δ,k Can be x n-Δ-L+1,k To x n-Δ,k Where Δ is the number of frames of the recent history signal frame and L is the number of frames of the distant history signal frame.
In step S13, a sound signal after removing reverberation, which is output by the unmixed model, is obtained.
Since the source of the reverberation value appearing in any signal frame is recorded after the direct sound in the signal frame before the signal frame is reflected, the signal value of the direct sound in the current signal frame may be obtained by comparing the signal value of the current frame with the signal value in the signal frame before the frame, and thus, in step S12, the mapping relationship among the signal value of the current processed signal frame, the signal value of the signal frame before the signal frame, and the signal value of the direct sound in the current processed signal frame may be described as the following expression:
Figure BDA0003111253820000061
wherein d n,k For the desired signal value, x, of the direct sound indexed by the kth frequency of the nth frame Δ,k To characterize a vector of a sequence of signal values of a kth frequency index within a preset number of frames preceding an nth-delta frame, x n,k Signal value W for the nth frame kth frequency index k In order to unmixe the matrix,
Figure BDA0003111253820000062
w 1,k for the first sub-unmixed matrix, w 2,k For the second sub-unmixed matrix, +.>
Figure BDA0003111253820000063
For unmixing the filter coefficient vector, wherein +.>
Figure BDA0003111253820000064
The unmixed model is used for estimating the expected signal value through the unmixed matrix, and taking a calculation result as the signal value after reverberation removal.
Wherein, the mapping relation is obtained by rewriting an observation model:
the observed signal for any one microphone can be expressed as:
Figure BDA0003111253820000071
wherein M is a microphone index, the value is 1 to M,
Figure BDA0003111253820000072
for the filter coefficients, d n,k For direct sound signal value, +.>
Figure BDA0003111253820000073
Is the reverberant signal value.
The expression of the following mixed model can be obtained by rewriting the above observation model into a matrix form:
Figure BDA0003111253820000074
wherein (1)>
Figure BDA0003111253820000075
Is a vector of filter coefficients in the hybrid model.
And (3) carrying out reverse writing on the mixed model to obtain a unmixed model:
Figure BDA0003111253820000076
and can be +.>
Figure BDA0003111253820000077
Defined as a unmixed matrix, wherein +.>
Figure BDA0003111253820000078
Is a vector of filter coefficients in the unmixed matrix.
In one possible implementation, the unmixed model is estimated for the desired signal values by:
and calculating a first expected signal value of direct sound in the current signal frame based on the sub-unmixing matrix corresponding to the last signal frame of the current signal frame, updating the unmixing matrix based on the first expected signal value, calculating a second expected signal value of direct sound in the current signal frame based on the updated unmixing matrix, and taking the second expected signal value as the signal value after reverberation removal.
That is, the expected signal value of the current frame may be estimated by the value of the first sub-unmixed matrix of the previous signal frame, and the unmixed matrix of the current frame may be updated by performing a weighted calculation on the estimated expected signal value, and the unmixed matrix of the current frame may be used to calculate the expected signal value of the current frame and estimate the expected signal value of the next frame, so that the expected signal values of all signal frames are obtained through continuous iteration. The first expected signal value is an expected signal value obtained by estimating the unmixed matrix of the previous frame, a larger deviation may occur in the estimation result, the second expected signal value is an expected signal value obtained by calculating the unmixed matrix of the current frame, and the deviation of the obtained calculation result of the actual direct sound signal value is smaller.
Due to the direct sound d n,k For the first element in the matrix of the expression, therefore, the processing of the unmixed matrix also only needs to solve to obtain the first submatrix in the unmixed matrix, namely the first submatrix, and the first submatrix can be used for obtaining the first expected signal value and the second expected signal value.
For the sound signal value of the first frame, there is no earlier signal frame, so when solving the signal of the first frame, the unmixed matrix may be initialized to an identity matrix for iterative operation.
In one possible implementation, the first expected signal value may be determined according to a signal value of the current signal frame, a signal value of the historical signal frame, and the unmixed matrix corresponding to a signal frame previous to the current signal frame.
For example, the first desired signal value d can be calculated by the following formula n,k ' estimate:
Figure BDA0003111253820000081
Figure BDA0003111253820000082
wherein w is 1,k (n-1) is a first sub-unmixed matrix corresponding to a kth frequency index of an n-1 th frame, x n,k =[x n,k ,x Δ,k ] T
After obtaining the first desired signal value, a first sub-unmixed matrix may be updated by the obtained first desired signal value, and in a possible implementation, a weighted covariance matrix may be updated by the first desired signal value, and the unmixed matrix may be updated based on the updated weighted covariance matrix.
For example, the updating of the first sub-unmixed matrix may be performed by:
Figure BDA0003111253820000083
Figure BDA0003111253820000084
C k (n) weighting matrix for the kth frequency index of the nth frame, i 1 =[1,0] T
Figure BDA0003111253820000085
β k (n) is a weighted smoothing coefficient, +.>
Figure BDA0003111253820000086
Alpha is a preset smoothing coefficient, +.>
Figure BDA0003111253820000087
For the comparison function of the nth frame kth frequency index, < >>
Figure BDA0003111253820000088
Delta is a preset zero prevention and removal parameter, any minimum number is adopted, and gamma is a preset shape parameter.
In practice, α may be set to 0.99, and γ may be manually adjusted according to the distribution of sound sources, so as to improve the accuracy of the unmixed model. Under the condition that the expected signal value obtained by the unmixing model is greatly different from the actual direct sound signal value, the sound source distribution condition in the model can be adjusted by adjusting the value of gamma. In the present disclosure, the initial value of γ may be set to 0.2.
The first sub-unmixed matrix may be initialized, and the weighting matrix may be initialized, where the weighting matrix may be an initial matrix set arbitrarily or may be a zero matrix.
In one possible implementation, the unmixed matrix W corresponding to frame 0 may be k (0) Initialized to I ML×ML Will weight matrix C k (0) Initialized to O ML×ML Wherein I ML×ML Identity matrix of ML columns and ML rows, 0 ML×ML Is a zero matrix of ML rows and ML columns.
After updating the first sub-unmixed matrix of the current frame, a more accurate desired signal value, i.e. a second desired signal value, than the first desired signal value may be calculated by the more accurate first sub-unmixed matrix. The second desired signal value may be determined from the signal value of the current signal frame, the signal value of the historical signal frame, and the updated unmixed matrix.
In one possible implementation, the second desired signal value may be calculated by the following formula:
Figure BDA0003111253820000091
wherein (1)>
Figure BDA0003111253820000092
Indexing a first sub-unmixed matrix, x, corresponding to an nth frame for a kth frequency k (n) is.
Before the second desired signal is calculated, the first sub-unmixed matrix may be normalized, where each element in the first sub-unmixed matrix is divided by a specified element, for example, the specified element may be the first element in the first sub-unmixed matrix, where the normalized first sub-unmixed matrix w '' 1,k (n) is a first sub-unmixed matrix w 1,k (n) first element w in the first sub-unmixed matrix 1 And (5) a result of division.
In the case where there is more than one recording device or recording unit for recording, the processing in step S12 may be performed on the sound signal obtained by each recording device or recording unit, so as to obtain a sound signal recorded by the recording device after removing the reverberation, for example, in the case where recording is performed by using a microphone array, the processing in step S12 may be performed on the sound signal recorded by each microphone.
After obtaining the reverberated sound signal, the sound signal may be sent to a speech processing unit, so that the speech processing unit converts the sound signal into operation instructions, e.g., the speech processing unit may correspond to a speech helper program or the like in the device. The sound signal may also be sent to a speech recognition unit so that the speech recognition unit converts the sound signal into text content. After the dereverberation, the noise in the sound signal is less, which is more beneficial to the extraction and processing of the sound.
Fig. 2 is a schematic diagram illustrating a flow of audio signal processing according to an exemplary disclosed embodiment. In the case where a plurality of recording apparatuses or recording units are present to record sound, the sound signal obtained by each recording apparatus or recording unit may be subjected to the processing of the flow shown in fig. 2.
In step S21, the unmixed matrix is initialized to an identity matrix, and the weighting matrix is initialized to a zero matrix.
In step S22, a first expected signal value of an nth frame is estimated by a first sub-unmixed matrix corresponding to the nth-1 frame.
In step S23, the weighting matrix corresponding to the nth frame is updated.
In step S24, the first sub-downmix matrix corresponding to the nth frame is updated.
In step S25, normalization processing is performed on the first sub-unmixed matrix corresponding to the nth frame.
In step S26, a second desired signal value corresponding to the nth frame is calculated by the normalized first sub-unmixed matrix.
Step S22 to step S26 are steps performed in a loop until the second desired signal value corresponding to each signal frame to be dereverberated is obtained.
Through the technical scheme, at least the following technical effects can be achieved:
the obtained sound signals are processed through the unmixed model which distinguishes the reverberation from the independent sound source and the direct sound, the reverberation in the sound signals is separated, and the sound signals after the reverberation is removed are obtained, so that the reverberation in the sound signals can be removed more quickly and conveniently.
Fig. 3 is a block diagram of an audio signal processing device according to an exemplary embodiment. Referring to fig. 3, the apparatus includes an acquisition module 310, a processing module 320, and an output module 330.
The acquisition module 310 is configured to acquire sound signals.
The processing module 320 is configured to input each signal frame in the sound signal into a unmixed model, where the unmixed model is a calculation model for removing reverberation in a signal frame, where direct sound and reverberation in a signal frame are set as independent sound sources, and the unmixed model is capable of calculating, for any non-first signal frame in the sound signal, a signal value after removing reverberation for a current signal frame according to a signal value of the current signal frame and a signal value of a historical signal frame preceding the current signal frame.
The output module 330 is configured to obtain the reverberated sound signal output by the unmixed model.
Optionally, the unmixed model is configured to estimate a desired signal value of the direct sound in the signal frame, and take the estimated result as a signal value after reverberation is removed from the signal frame, where the unmixed model includes an unmixed matrix, where the unmixed matrix is a preset matrix for establishing a mapping relationship between a signal value of the signal frame, a signal value of a historical signal frame preceding the signal frame, and a desired signal value of the direct sound in the signal frame.
Optionally, the unmixed model is configured to estimate the desired signal value by: calculating a first expected signal value of direct sound in the current signal frame based on the unmixed matrix corresponding to the previous signal frame of the current signal frame; updating the unmixed matrix based on the first desired signal value; and calculating a second expected signal value of the direct sound in the current signal frame based on the updated unmixed matrix, and taking the second expected signal value as the signal value after reverberation removal.
Optionally, the unmixed model is further configured to determine the first expected signal value according to a signal value of the current signal frame, a signal value of the historical signal frame, and the unmixed matrix corresponding to a signal frame previous to the current signal frame.
Optionally, the mapping relationship is:
Figure BDA0003111253820000111
wherein d n,k For the nth frameDesired signal value, x, of the direct sound of the kth frequency index Δ,k To characterize the vector of the sequence of signal values of the kth frequency index within the previous delta frame of the nth frame, x n,k Signal value W for the nth frame kth frequency index k For the unmixed matrix, wherein: />
Figure BDA0003111253820000112
w 1,k For the first sub-unmixed matrix, w 2,k For the second sub-unmixed matrix, +.>
Figure BDA0003111253820000113
And (3) a vector of unmixed filter coefficients in the unmixed model.
Optionally, the unmixed model is further configured to update a weighted covariance matrix with the first desired signal value; and updating the unmixed matrix based on the updated weighted covariance matrix.
Optionally, the unmixed model is further configured to determine the second desired signal value from the signal values of the current signal frame, the signal values of the historical signal frames, and the updated unmixed matrix.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
The obtained sound signals are processed through the unmixed model which distinguishes the reverberation from the independent sound source and the direct sound, the reverberation in the sound signals is separated, and the sound signals after the reverberation is removed are obtained, so that the reverberation in the sound signals can be removed more quickly and conveniently.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the audio signal processing method provided by the present disclosure.
Fig. 4 is a block diagram illustrating an apparatus 400 for audio signal processing according to an exemplary embodiment. For example, apparatus 400 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 4, apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and a communication component 416.
The processing component 402 generally controls the overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the audio signal processing method described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.
Memory 404 is configured to store various types of data to support operations at apparatus 400. Examples of such data include instructions for any application or method operating on the apparatus 400, contact data, phonebook data, messages, pictures, videos, and the like. The memory 404 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power component 406 provides power to the various components of the device 400. The power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 400.
The multimedia component 408 includes a screen between the device 400 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 410 is configured to output and/or input audio signals. For example, the audio component 410 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 further includes a speaker for outputting audio signals.
The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 414 includes one or more sensors for providing status assessment of various aspects of the apparatus 400. For example, the sensor assembly 414 may detect the on/off state of the device 400, the relative positioning of the components, such as the display and keypad of the device 400, the sensor assembly 414 may also detect the change in position of the device 400 or a component of the device 400, the presence or absence of user contact with the device 400, the orientation or acceleration/deceleration of the device 400, and the change in temperature of the device 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate communication between the apparatus 400 and other devices in a wired or wireless manner. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described audio signal processing methods.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 404, including instructions executable by processor 420 of apparatus 400 to perform the above-described audio signal processing method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned audio signal processing method when being executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (8)

1. An audio signal processing method, comprising:
acquiring a sound signal;
inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation in the signal frame and is set by taking direct sound and reverberation in the signal frame as independent sound sources, and the unmixing model calculates a signal value after removing reverberation for a current signal frame according to a signal value of the current signal frame and a signal value of a history signal frame before the current signal frame aiming at any non-first signal frame in the sound signal;
obtaining a sound signal which is output by the unmixed model and is subjected to reverberation removal;
the unmixing model is used for estimating expected signal values of direct sound in the signal frames and taking an estimation result as a signal value after reverberation is removed from the signal frames, wherein the unmixing model comprises an unmixing matrix, and the unmixing matrix is a preset matrix for establishing a mapping relation among signal values of a current signal frame, signal values of historical signal frames before the current signal frame and expected signal values of direct sound in the current signal frame;
wherein the unmixed model estimates the desired signal value by:
calculating a first expected signal value of direct sound in the current signal frame based on the unmixed matrix corresponding to the previous signal frame of the current signal frame;
updating the unmixed matrix based on the first desired signal value;
and calculating a second expected signal value of the direct sound in the current signal frame based on the updated unmixed matrix, and taking the second expected signal value as the signal value after reverberation removal.
2. The method according to claim 1, wherein said calculating a first desired signal value of the direct sound in the current signal frame based on the downmix matrix corresponding to the previous signal frame of the current signal frame comprises:
and determining the first expected signal value according to the signal value of the current signal frame, the signal value of the historical signal frame and the unmixed matrix corresponding to the previous signal frame of the current signal frame.
3. The method according to any one of claims 1-2, wherein the mapping relationship is:
Figure FDA0004055162720000021
wherein d n,k For the desired signal value, x, of the direct sound indexed by the kth frequency of the nth frame Δ,k To characterize the vector of the sequence of signal values of the kth frequency index within the previous delta frame of the nth frame, x n,k Signal value W for the nth frame kth frequency index k For the unmixed matrix, wherein:
Figure FDA0004055162720000022
w 1,k for the first sub-unmixed matrix, w 2,k For the second sub-unmixed matrix,/>
Figure FDA0004055162720000023
And (3) a vector of unmixed filter coefficients in the unmixed model.
4. The method of claim 1, wherein the updating the unmixed matrix based on the first desired signal value comprises:
updating a weighted covariance matrix by the first desired signal value;
and updating the unmixed matrix based on the updated weighted covariance matrix.
5. The method of claim 1, wherein the calculating a second desired signal value for direct sound in the current signal frame based on the updated downmix matrix comprises:
and determining the second expected signal value according to the signal value of the current signal frame, the signal value of the historical signal frame and the updated unmixed matrix.
6. An audio signal processing apparatus, comprising:
an acquisition module configured to acquire a sound signal;
a processing module configured to input each signal frame in the sound signal into a unmixed model, wherein the unmixed model is a calculation model for removing reverberation in the signal frame, which is set by taking direct sound and reverberation in the signal frame as independent sound sources, and for any non-first signal frame in the sound signal, the unmixed model can calculate a signal value after removing reverberation for a current signal frame according to a signal value of the current signal frame and a signal value of a historical signal frame before the current signal frame;
an output module configured to obtain a reverberated sound signal output by the unmixed model;
the method comprises the steps that a unmixing model is used for estimating expected signal values of direct sound in a signal frame, and taking an estimation result as a signal value after reverberation is removed from the signal frame, wherein the unmixing model comprises an unmixing matrix, and the unmixing matrix is a preset matrix used for establishing a mapping relation among the signal values of the signal frame, the signal values of a historical signal frame before the signal frame and the expected signal values of the direct sound in the signal frame;
wherein the unmixed model is configured to estimate the desired signal value by: calculating a first expected signal value of direct sound in the current signal frame based on the unmixed matrix corresponding to the previous signal frame of the current signal frame; updating the unmixed matrix based on the first desired signal value; and calculating a second expected signal value of the direct sound in the current signal frame based on the updated unmixed matrix, and taking the second expected signal value as the signal value after reverberation removal.
7. An audio signal processing apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a sound signal;
inputting each signal frame in the sound signal into a unmixing model, wherein the unmixing model is a calculation model which is used for removing reverberation in the signal frame and is set by taking direct sound and reverberation in the signal frame as independent sound sources, and the unmixing model can calculate the signal value after removing reverberation for the current signal frame according to the signal value of the current signal frame and the signal value of a historical signal frame before the current signal frame aiming at any non-first signal frame in the sound signal;
obtaining a sound signal which is output by the unmixed model and is subjected to reverberation removal;
the unmixing model is used for estimating expected signal values of direct sound in the signal frames and taking an estimation result as a signal value after reverberation is removed from the signal frames, wherein the unmixing model comprises an unmixing matrix, and the unmixing matrix is a preset matrix for establishing a mapping relation among signal values of a current signal frame, signal values of historical signal frames before the current signal frame and expected signal values of direct sound in the current signal frame;
wherein the unmixed model estimates the desired signal value by:
calculating a first expected signal value of direct sound in the current signal frame based on the unmixed matrix corresponding to the previous signal frame of the current signal frame;
updating the unmixed matrix based on the first desired signal value;
and calculating a second expected signal value of the direct sound in the current signal frame based on the updated unmixed matrix, and taking the second expected signal value as the signal value after reverberation removal.
8. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1-5.
CN202110649720.0A 2021-06-10 2021-06-10 Audio signal processing method, device and storage medium Active CN113362841B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110649720.0A CN113362841B (en) 2021-06-10 2021-06-10 Audio signal processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110649720.0A CN113362841B (en) 2021-06-10 2021-06-10 Audio signal processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN113362841A CN113362841A (en) 2021-09-07
CN113362841B true CN113362841B (en) 2023-05-02

Family

ID=77533641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110649720.0A Active CN113362841B (en) 2021-06-10 2021-06-10 Audio signal processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113362841B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101149591B1 (en) * 2004-07-22 2012-05-29 코닌클리케 필립스 일렉트로닉스 엔.브이. Audio signal dereverberation
JP4448423B2 (en) * 2004-10-25 2010-04-07 日本電信電話株式会社 Echo suppression method, apparatus for implementing this method, program, and recording medium therefor
CN102750956B (en) * 2012-06-18 2014-07-16 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN111161751A (en) * 2019-12-25 2020-05-15 声耕智能科技(西安)研究院有限公司 Distributed microphone pickup system and method under complex scene
CN111462770A (en) * 2020-01-09 2020-07-28 华中科技大学 L STM-based late reverberation suppression method and system
CN112750461B (en) * 2020-02-26 2023-08-01 腾讯科技(深圳)有限公司 Voice communication optimization method and device, electronic equipment and readable storage medium
CN112863537A (en) * 2021-01-04 2021-05-28 北京小米松果电子有限公司 Audio signal processing method and device and storage medium

Also Published As

Publication number Publication date
CN113362841A (en) 2021-09-07

Similar Documents

Publication Publication Date Title
CN108510987B (en) Voice processing method and device
EP3032821B1 (en) Method and device for shooting a picture
EP3091753B1 (en) Method and device of optimizing sound signal
CN108766457B (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN111179960B (en) Audio signal processing method and device and storage medium
CN111160448B (en) Training method and device for image classification model
CN107341509B (en) Convolutional neural network training method and device and readable storage medium
CN107194464B (en) Training method and device of convolutional neural network model
CN106060707B (en) Reverberation processing method and device
CN108629814B (en) Camera adjusting method and device
CN108984628B (en) Loss value obtaining method and device of content description generation model
CN113807498B (en) Model expansion method and device, electronic equipment and storage medium
CN107992894B (en) Image recognition method, image recognition device and computer-readable storage medium
CN107239758B (en) Method and device for positioning key points of human face
CN107730443B (en) Image processing method and device and user equipment
CN111988704B (en) Sound signal processing method, device and storage medium
CN113362841B (en) Audio signal processing method, device and storage medium
CN113223543B (en) Speech enhancement method, device and storage medium
CN112925466B (en) Touch control method and device
CN112434714A (en) Multimedia identification method, device, storage medium and electronic equipment
CN113488066A (en) Audio signal processing method, audio signal processing apparatus, and storage medium
CN113345456B (en) Echo separation method, device and storage medium
CN107665340B (en) Fingerprint identification method and device and electronic equipment
CN112637416A (en) Volume adjusting method and device and storage medium
CN112861592A (en) Training method of image generation model, image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant