CN114333876A

CN114333876A - Method and apparatus for signal processing

Info

Publication number: CN114333876A
Application number: CN202111415175.5A
Authority: CN
Inventors: 陈日林; 张兆奇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-12
Anticipated expiration: 2041-11-25
Also published as: CN114333876B

Abstract

The application provides a method and a device for signal processing, which can reduce the influence of reverberation on signal separation by obtaining a de-mixing matrix according to a mixing matrix containing correlation transfer functions between microphones, thereby improving the signal separation performance. In the method, a first mixing matrix including a correlation transfer function between microphones and a speech signal with reverberation can be obtained according to an observation signal, then a de-mixing matrix of the observation signal can be obtained according to the first mixing matrix and the speech signal with reverberation, and finally a separation signal can be obtained according to the de-mixing matrix. The embodiment of the application can be used in the field of audio processing, for example, front-end speech signal enhancement.

Description

Method and apparatus for signal processing

Technical Field

The present application relates to the field of audio processing, and more particularly, to methods and apparatus for signal processing.

Background

The cocktail party effect reveals the masking effect of the human ear, i.e. the natural ability to extract a desired sound source from a complex noisy auditory scene (an acoustic scene where multiple sound sources are present simultaneously). With the increasing maturity of voice interaction technology, a target voice signal can be extracted through a blind source separation method. Blind Source Separation (BSS) refers to a process of separating a Source signal from a mixed signal (i.e., an observation signal) without knowing the Source signal and signal mixing system (or transmission channel).

Independent Vector Analysis (IVA) is a commonly used blind source separation method, i.e. a received observed signal is decomposed into several Independent components according to a statistically Independent principle, and these Independent components are used as an approximate estimate of the source signal. However, in the existing IVA-based blind source separation method, the mixing matrix is considered to be formed by a room transfer function, which makes the separation performance affected by the room reverberation condition.

Disclosure of Invention

The embodiment of the application provides a method and a device for signal processing, wherein a de-mixing matrix is obtained according to a mixing matrix containing correlation transfer functions between microphones, so that the influence of reverberation on signal separation can be reduced, and the signal separation performance is improved.

In a first aspect, a method of signal processing is provided, including:

acquiring observation signals, wherein the observation signals comprise original sound source signals of at least two sources acquired by at least two microphones;

determining a first mixing matrix H and a reverberated speech signal from the observation signal

Wherein the first mixing matrix H comprises a first correlation transfer function between the at least two microphones, the first mixing matrix H being used to represent the observation signal and the reverberated speech signal

The mapping relationship between the two;

mixing the first mixing matrix H and the reverberated speech signal

Inputting a signal processing model to obtain a de-mixing matrix W of the observation signal, wherein the signal processing model is used for representing the first mixing matrix H and the voice signal with reverberation

And the unmixing matrix WThe mapping relationship between the two;

and acquiring a separation signal according to the unmixing matrix W and the observation signal.

In a second aspect, there is provided an apparatus for signal processing, comprising:

an acquisition unit for acquiring observation signals, wherein the observation signals comprise original sound source signals of at least two sources acquired by at least two microphones;

a processing unit for determining a first mixing matrix H and the reverberated speech signal from the observation signal

The mapping relationship between the two;

the processing unit is further configured to combine the first mixing matrix H and the reverberated speech signal

And the mapping relation between the unmixing matrix W;

the processing unit is further configured to obtain a separation signal according to the unmixing matrix W and the observation signal.

In a third aspect, an electronic device is provided, which includes: a processor and a memory; the memory for storing a computer program; the processor is configured to execute the computer program to implement the method of the first aspect.

In a fourth aspect, a chip is provided, comprising: a processor for calling and running the computer program from the memory so that the device on which the chip is installed performs the method according to the first aspect.

In a fifth aspect, there is provided a computer readable storage medium comprising computer instructions which, when executed by a computer, cause the computer to carry out the method of the first aspect.

In a sixth aspect, there is provided a computer program product comprising computer program instructions to, when run on a computer, cause the computer to perform the method of the first aspect.

In the embodiment of the application, a first mixing matrix including a correlation transfer function between microphones and a voice signal with reverberation are obtained according to an observation signal, then a de-mixing matrix of the observation signal is obtained according to the first mixing matrix and the voice signal with reverberation, and finally a separation signal is obtained from the observation signal according to the de-mixing matrix. Since the first mixing matrix contains the correlation transfer function between microphones instead of the room transfer function, and the correlation transfer function between microphones does not contain reverberation, obtaining the unmixing matrix according to the first mixing matrix can reduce the influence of the reverberation on signal separation, thereby improving the signal separation performance.

Drawings

FIG. 1 is a schematic diagram of an application scenario suitable for use in embodiments of the present application;

FIG. 2 is a schematic diagram of a speech recognition system suitable for use with embodiments of the present application;

fig. 3 is a schematic flow chart of a method of signal processing provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of another method of signal processing provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart diagram of another method of signal processing provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart diagram of another method of signal processing provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of another method of signal processing provided by an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of another method of signal processing provided by an embodiment of the present application;

fig. 9 is a schematic diagram for comparing the effect of the method of signal processing provided by the embodiment of the present application with the effect of the sound source separation scheme in the prior art;

FIG. 10 is an alternative schematic block diagram of an apparatus for signal processing of an embodiment of the present application;

fig. 11 is another alternative schematic block diagram of an electronic device provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that in the embodiment of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "at least one" means one or more, "a plurality" means two or more than two, unless otherwise specified. In addition, "and/or" describes an association relationship of associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be further understood that the descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent a particular limitation to the number of devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

It should also be appreciated that a particular feature, structure, or characteristic described in connection with an embodiment is included in at least one embodiment of the application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a signal processing scheme, which can enhance a front-end voice signal, for example, enhance an expected signal, suppress an interference signal, and the like, and can be applied to various fields, for example, smart homes, video conferences, intelligent traffic, driving assistance, and the like, without limitation.

Some brief descriptions will be made below on application scenarios to which the technical solution of the embodiment of the present application can be applied. It should be noted that the following application scenarios are only used for illustrating the embodiments of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Fig. 1 is a schematic diagram of an application scenario suitable for use in an embodiment of the present application. As shown in fig. 1, the application scenario may include a user terminal, which may be, for example, a mobile phone, a smart voice interaction device (e.g., wearable devices such as smart watches and smart glasses), an in-vehicle terminal, and a smart appliance (e.g., a smart speaker, a coffee maker, a printer, etc.). Optionally, the application scenario may further include a computing device, for example, the application scenario may be a cloud server, an intelligent portable device, or a home computing hub, which is not limited in this application. Illustratively, the smart portable device may be a smart phone, a computer, or the like, and the home computing center may be a smart phone, a computer, a smart television, a router, or the like, without limitation. For example, the user terminal and the computing device may be connected through a wireless network or through a bluetooth pairing connection, which is not limited in this embodiment of the application.

It should be noted that the user terminal in fig. 1 is only an example, and the user terminal to which the present application is applied is not limited thereto, and for example, the user terminal may also be an electronic device in an internet of things (IoT) system. In addition, the computing device in fig. 1 is only an example, and the computing device to which the present application is applied is not limited thereto, and may be, for example, a mobile internet device or the like. It should be further noted that the plurality of electronic devices shown in the embodiments of the present application are for better and more comprehensive description of the embodiments of the present application, but should not cause any limitation to the embodiments of the present application.

For a specific example, when the system architecture shown in fig. 1 is applied to a home use scenario, a user terminal may be a home computing hub, such as a mobile phone, a television, a router, or a cloud device, such as a cloud server, for example, and the embodiment of the present application is not limited thereto.

For another specific example, when the system architecture shown in fig. 1 is applied to a personal wearing scenario, the user terminal is, for example, a personal wearing device, such as a smart band, a smart watch, a smart headset, smart glasses, and the like, and the computing device may be a personal device, such as a mobile phone, and the like, which is not limited in this embodiment of the present application.

In some embodiments, the signal processing method provided by the embodiments of the present application may be implemented by a user terminal. For example, after acquiring the observation signal, the user terminal may obtain a unmixing matrix according to the signal processing method provided in the embodiment of the present application, and obtain the separation signal according to the unmixing matrix.

In other embodiments, the signal processing method provided by the embodiments of the present application may be implemented by a user terminal and a computing device in cooperation. For example, after acquiring an observation signal, the user terminal may send the observation signal to the computing device, the computing device obtains a de-mixing matrix according to the signal processing method provided in the embodiment of the present application, and sends the de-mixing matrix to the user terminal, and the user terminal obtains a separation signal according to the de-mixing matrix. For another example, the computing device may obtain a de-mixing matrix according to the signal processing method provided in the embodiment of the present application, obtain a separation signal according to the de-mixing matrix, and send the separation signal to the user terminal.

FIG. 2 is a schematic diagram of a speech recognition system suitable for use with embodiments of the present application. As shown in fig. 2, a front-end signal processing module 201 may be disposed before the speech recognition system 202, the target speech and the interfering speech may be received by one or more microphones (an example of a microphone), an observation signal output by the microphone is input to the front-end signal processing module 201, for example, an enhanced clean target speech signal (i.e., a separation signal) may be obtained after echo cancellation, dereverberation, sound source separation (also referred to as blind source separation), post-processing, and the like, respectively, and then the target speech signal may be input to the speech recognition system 202 for speech recognition. The signal processing scheme provided by the embodiment of the application can be applied to the sound source separation module, and the target speech signal is obtained by obtaining the unmixing matrix and performing signal separation on the observation signal.

For example, the front-end signal processing module 201 in fig. 2 may be on the user terminal in fig. 1, or may be on the computing device in fig. 1, which is not limited in this application.

In the following, related terms related to the embodiments of the present application are described.

1) Mixing matrix: a mapping relationship (e.g., a frequency domain linear combination relationship in the complex domain) between the observed signal and the original sound source signal is characterized. The mixing matrix may be a matrix of Room Transfer Functions (RTFs) from the individual sound sources to the individual microphones.

2) Unmixing matrix: the inverse matrix of the mixing matrix, i.e. the target matrix to be solved, characterizes the mapping relationship between the target speech signal and the observed signal (e.g. the frequency domain linear combination relationship in the complex domain). The unmixing matrix may also be referred to as a separation matrix, both meaning the same.

3) Room transfer function: a function that characterizes the propagation characteristics of sound in the frequency domain from a sound source to a microphone (e.g., a microphone).

4) The correlation transfer function between microphones is a function that characterizes the frequency domain propagation characteristics of sound from one microphone to another. When the microphones are microphones, the correlation transfer function between the microphones may be referred to as a correlation transfer function between the microphones.

Currently, in an IVA-based blind source separation method, an IVA-based separation method is used, a source signal model is established according to a hybrid matrix to obtain an objective function, the objective function is iteratively optimized, and a separation matrix is solved until the model converges to obtain an estimated source signal. In the scheme, the mixing matrix is considered to be formed by a room transfer function, so that the separation performance of the voice signal is influenced by the room reverberation condition, therefore, dereverberation preprocessing needs to be carried out in advance, and the complexity of a sound source separation algorithm is increased. Secondly, this solution is difficult to estimate the variance of the source signal, requires pre-whitening of the observed signal, and is thus difficult to implement in real time in the product. Finally, the scheme adopts a natural gradient method to carry out parameter optimization, the classification performance is limited by step length parameters, and although a large number of self-adaptive variable step length technologies are provided, the gradient descent algorithm still has large calculation amount.

In view of the above problem, embodiments of the present application provide a method for signal processing, which may transform a mixing matrix into a mixing matrix including a correlation transfer function between microphones, instead of a room transfer function, and the correlation transfer function between the microphones does not include reverberation, so as to obtain a unmixing matrix according to the mixing matrix including the correlation transfer function between the microphones, and may mitigate an influence of the reverberation on signal separation, thereby improving signal separation performance.

Furthermore, according to the mixing matrix and the voice signal with reverberation, the de-mixing matrix can construct a first parameter, and the de-mixing matrix is determined according to the mapping relation between the first parameter and the de-mixing matrix, so that estimation of a voice signal model can be avoided in the signal separation process, pre-whitening processing on an observed signal is not needed, and meanwhile, a natural gradient method is avoided for parameter optimization, so that the separation process is not restricted by the step length parameter, and the calculated amount can be effectively reduced.

The technical solutions provided by the embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 3 shows a schematic flow chart of a method 300 of signal processing provided by an embodiment of the present application. The method 300 may be used for blind source separation, for example, may be applied to the application scenario shown in fig. 1, or may be applied to the speech recognition system shown in fig. 2, without limitation. As shown in fig. 3, method 300 includes steps 310 through 340.

An observation signal is acquired 310, wherein the observation signal comprises raw sound source signals of at least two sources acquired by at least two microphones.

Illustratively, the user terminal may acquire the observed signal via one or more microphones (e.g., microphones). The observation signal may comprise speech signals from a plurality of sound sources, which may comprise a target speech signal, i.e. a speech signal from a desired sound source. The observation signal may also include interfering speech signals, i.e. speech signals from undesired sound sources. In addition, the transmission channel or mixing system information of the observed signal is unknown.

In some embodiments, a Short-Time Fourier Transform (STFT) may be performed on the observed signal, resulting in the following equation (1):

x(f，t)＝A^fs(f，t) (1)

wherein x (f, t) represents an observation signal of f frequency point and t time, A^fRepresenting the mixing matrix at the f bins (i.e. one example of the second mixing matrix a),s (f, t) represents original sound source signals of at least two sources of f frequency points and t time, wherein f is the frequency of the signals, and t is the time of the signals.

In the following description, a scheme provided by an embodiment of the present application is described by taking a dual-microphone and dual-sound-source scene as an example. It will be appreciated that the process may be extended to the case of multiple microphones and multiple sound sources, and in particular reference may be made to the description of the process of two microphones and two sound sources, and some simple adaptations may be required, which are within the scope of the embodiments of the present application.

For example, in a two-microphone, two-source scenario, the observed signal x (f, t) may be expressed as:

x(f,t)＝[x₁(f,t)，x₂(f,t)]^T

the original sound source signal s (f, t) can be expressed as:

s(f,t)＝[s₁(f，t)，s₂(f,t)]^T

mixing matrix A^fGenerally consisting of a room transfer function, which can be expressed as:

wherein, according to the formula (2), A^fIncluding 4 parameters

In the method 300 for signal processing provided in the embodiment of the present application, it is necessary to estimate the unmixing matrix W^fAnd satisfies the following conditions:

y(f,t)＝W^fx(f,t) (3)

where y (f, t) represents the estimated separated signal, or may be referred to as the target speech signal, and should coincide with s (f, t) as much as possible. In a two-microphone, two-sound-source scenario, y (f, t) is y1f, t, y2f, tT.

320 determining a first mixing matrix H and the reverberated speech signal from the observation signal

Wherein the first mixing matrix H comprises a first correlation transfer function between at least two microphones. The first mixing matrix H is used for representing the observation signal and the language signal with reverberation

The mapping relationship between them.

Illustratively, in step 320, the above equation (1) may be transformed to obtain:

wherein H^fA mixing matrix representing the frequencies at f, including the associated transfer functions between the microphones,

and (3) representing the voice signal with reverberation at the frequency f and the time t.

In some alternative embodiments, referring to fig. 4, a speech signal with reverberation may be determined according to the following

steps

321 and 322

321, determining a mapping relation between the second mixing matrix a and the first mixing matrix H.

322 determining said reverberated speech signal based on said mapping relationship and said original sound source signals of said at least two sources

Illustratively, taking a dual-microphone and dual-sound-source scene as an example, the mixing matrix a in formula (1) can be obtained^fThe following transformations are made:

wherein the content of the first and second substances,

and

for the related transfer functions between the microphones, the two form a new mixing matrix

I.e. an example of the first mixing matrix H. H^fIncluding 2 parameters

Further, substituting equation (5) into equation (4) can obtain:

as can be seen from equation (6), the speech signal with reverberation

In equation (6), the speech signal to be restored is changed from the original speech signal s (f, t) to the reverberated speech signal

And a mixing matrix A consisting of room transfer functions^fInto a mixing matrix H formed by microphone-dependent transfer functions^fThe reverberation involved due to the room transfer function is transferred to the reverberated speech signal

So that the microphone-dependent transfer function does not containAnd (4) reverberation.

330, mixing the first mixing matrix H and the reverberated speech signal

And the mapping relation between the unmixing matrix W.

That is, the signal processing model may be based on the first mixing matrix H, the reverberated speech signal

And a mapping relation between the mixing matrix W, and the input first mixing matrix H and the voice signal with reverberation

And obtaining a demixing matrix W of the observed signals.

In some alternative embodiments, referring to fig. 5, a demixing matrix W for the observed signals may be determined according to

steps

331 and 332.

331, from the first mixing matrix H, the speech signal with reverberation

And a unmixing matrix W, determining the first parameter.

332, obtaining the unmixing matrix W according to the mapping relationship between the first parameter and the unmixing matrix W.

In some embodiments, the first parameter may be defined. As a possible implementation, referring to fig. 6, the first parameter may be determined according to the following steps 333 and 334:

333, according to the first mixing matrix H, the voice signal with reverberation

A second parameter is determined.

334, the first parameter is determined according to the second parameter and the de-mixing matrix W.

Illustratively, the second parameter may be expressed as

In the embodiment of the present application, the first parameter may be defined

And defining a second parameter

Wherein E [ alpha ], [ beta ], [ alpha ], [ beta ]]Indicating data expectation, different values of k correspond to different sound sources.

For a dual microphone, dual sound source scenario, because

And

independently of each other, substituting the above formula (6) into the second parameter

In (b), one can obtain:

in addition, the unmixing matrix W^fAnd a mixing matrix H^fIs a reciprocal matrix, satisfies W^fH^fI, i.e.:

in the examples of the present application, it can be considered that

The formula (8) is satisfied, wherein the value of k for the dual sound source scene is 1 or 2, which respectively corresponds to different sound sources, and (t-1) represents the last moment of time t.

For equation (7), each term left and right of equal sign is multiplied by left

And right ride

The following can be obtained:

due to the fact that

Satisfies the formula (8), i.e.

So that equation (9) can become:

in the formula (10)

Is that

Similarly, for equation (7), each term of the left and right equal sign is multiplied by the left

And right ride

The following can be obtained:

in formula (11)

Is that

As a specific implementation, may be based on the first parameter (e.g., such as

And

) Determining a mapping relationship between the first parameter and the unmixing matrix W.

That is, it is possible to let:

from the formula (12), it can be found

And

can be expressed, for example, as shown in the following formula

In some alternative embodiments, the modulus value of the unmixing matrix W may also be determined according to a minimum distortion principle (minimum distortion principle). Illustratively, the modulus value of the unmixing matrix W may be determined according to the following equation (14):

W^f(t)＝diag(diag((W^f(t))^-1))W^f(t) (14)

in summary, as a possible implementation manner of step 330, firstly, the reverberated speech signal may be obtained according to the first mixing matrix H

Determining a second parameter

Then according to the second parameter

And a de-mixing matrix W for determining the first parameter

Finally, according to the first parameter and

the mapping relation with the unmixing matrix W, such as equations (13) and (14), yields the unmixing matrix W.

340, obtaining a separation signal according to the unmixing matrix W and the observation signal.

Illustratively, the observed signal x (f, t) may be de-mixed with the matrix W^fThe separated signal y (f, t), i.e. the target speech signal, is obtained by substituting the above equation (3).

Therefore, according to the method, a first mixing matrix comprising correlation transfer functions between microphones and a voice signal with reverberation are obtained according to an observation signal, then a de-mixing matrix of the observation signal is obtained according to the first mixing matrix and the voice signal with reverberation, and finally a separation signal is obtained according to the de-mixing matrix. Since the first mixing matrix contains the correlation transfer function between microphones instead of the room transfer function, and the correlation transfer function between microphones does not contain reverberation, obtaining the unmixing matrix according to the first mixing matrix can reduce the influence of the reverberation on signal separation, thereby improving the signal separation performance.

Furthermore, according to the first mixing matrix and the voice signal with reverberation, the first parameter can be constructed by the de-mixing matrix, and the de-mixing matrix is determined according to the mapping relation between the first parameter and the de-mixing matrix, so that the estimation of the voice signal model can be avoided in the signal separation process, the pre-whitening processing on the observed signal is not needed, and the parameter optimization by adopting a natural gradient method is avoided, so that the separation process is not restricted by the step length parameter, the calculated amount can be effectively reduced, and the signal separation efficiency is improved.

In some alternative embodiments, for example, in the case where the energy of a certain original sound source signal in the observed signal is weak, the first parameter is made

The mapping relation with the unmixing matrix W (such as the formula (13)) has a denominator of 0, which may cause the above-mentioned signal processing procedure to be unstable, for example, a downtime situation.

In order to ensure the stability of the signal processing process and improve the separation performance of the method 300, an Auxiliary virtual sound Source (AuxIS) may be introduced to enhance the observed signal, so as to obtain a first mixing matrix H of the enhanced observed signal and a voice signal with reverberation

For example, the auxiliary virtual sound source may enhance a weaker sound source signal in the original sound source signal to avoid that the energy of a certain original sound source signal is too weak, which may result in the first parameter

The mapping relation (such as formula (13)) with the unmixing matrix W has a denominator of 0, which can help to improve the stability of the signal processing process and improve the signal separationPerformance of ion.

Illustratively, referring to fig. 7, in the method 300, a first mixing matrix H of the enhanced observation signal and the reverberated speech signal may be obtained by the following steps 350 to 370.

350, determining the energy of the signal of the auxiliary virtual sound source according to the observation signal.

As a possible implementation, referring to fig. 8, the energy of the signal of the secondary virtual sound source may be determined by the following

steps

351 and 352.

351, determining the amplitude spectrum of the signal of the auxiliary virtual sound source according to the observation signal.

352, determining the energy of the signal of the secondary virtual sound source based on the energy ratio of the observed signal to the signal of the secondary virtual sound source.

That is, the signal of the auxiliary virtual sound source can be decomposed into two parts, i.e. the amplitude spectrum of the signal of the auxiliary virtual sound source and the energy ratio of the observed signal to the signal of the auxiliary virtual sound source, which can be specifically seen in formula (15):

wherein λ is_dBThe energy ratio of the observed signal to the signal of the auxiliary virtual sound source may be given in advance;

to assist the amplitude spectrum of the virtual sound source.

By way of example, one may define

The following were used:

360, obtaining a second related transfer function corresponding to the auxiliary virtual sound source

For example, a secondary virtual sound source may be introduced to enhance the kth sound source (e.g. the weakest one of the original sound source signals), where the secondary virtual sound source corresponds to a second related transfer function

Can be expressed as

Wherein k can be positive integers and respectively correspond to different sound sources.

Alternatively to this, the first and second parts may,

for the estimated correlation transfer function, the estimation method may change with the change of the usage scenario. In some embodiments, the estimation may be performed by using a method of averaging multiple point measurements in advance (i.e. a method of real measurement), so as to obtain the related transfer function

For example, it may be in a scene where the speaker location is relatively fixed, such as in a car. In some embodiments, the estimation may be performed using an adaptive correlation transfer function estimation algorithm (e.g., a far-field approximation estimation algorithm) to obtain the correlation transfer function

For example, in a scenario where the speaker location is unknown, such as a conference room.

370 according to the original sound source signals of said at least two sources, the energy of the signals of said secondary virtual sound source and said second associated transfer function

Obtaining the first mixing matrix H and the voice signal with reverberation

Wherein the first mixing matrix H comprises the second correlation transfer function

The voice signal with reverberation

Including the energy of the signal of the secondary virtual sound source.

Illustratively, after obtaining equation (6) above, the equation (6) may be further expanded to obtain:

when an auxiliary virtual sound source is introduced to enhance the kth sound source, the enhanced observation signal can be marked as x_k(f, t), which can be expressed as the following equation:

that is, the first mixing matrix H may be updated to

Speech signal with reverberation

Can be updated to

Illustratively, for a dual sound source scene, k takes values of 1 and 2, corresponding to two different sound sources respectively. When a virtual sound source is introduced, the 1 st sound source is enhanced to obtain an observation signal x₁(f, t) is as follows:

when a virtual sound source is introduced, the 2 nd sound source is enhanced to obtain an observation signal x₂(f, t) is as follows:

after the observation signal is enhanced by the auxiliary virtual sound source, the enhanced first mixing matrix H and the reverberated speech signal may be combined

And inputting the signal processing model to obtain an enhanced unmixing matrix W of the observation signal. Accordingly, the enhanced first mixing matrix H and the reverberated speech signal may now be used

And determining the second parameter and the first parameter, and further obtaining a de-mixing matrix W according to the first parameter and the de-mixing matrix W.

Illustratively, based on the enhanced first mixing matrix H and the reverberated speech signal

The determined second parameter may be recorded as

The first parameter may be recorded as

Substitution of equation (18)

It is possible to obtain:

illustratively, for a dual-microphone, dual-source scene, the first parameter is determined

Thereafter, can be

Substituting into the above equations (13) and (14), the unmixing matrix W is obtained. Then, a separation signal can be obtained from the unmixing matrix W and the enhanced observation signal.

As a specific example, in obtaining

Thereafter, can be

Substituting into the equations (13) and (14) to obtain

And the modulus values of the unmixing matrix W. Then, can be based on

Obtaining a target speech signal of the 1 st sound source based on

And obtaining a target voice signal of the 2 nd sound source.

That is, in the case where the auxiliary virtual sound source is introduced, it is possible to first obtain the energy λ (f, t) of the auxiliary virtual sound source from the observed signal and estimate the correlation transfer function between the microphones corresponding to the auxiliary virtual sound source

Can then be based on this energy λ (f, t) and the associated transfer function

Determining an enhanced observed Signal x_k(f, t), and further based on the enhanced observed signal x_k(f, t) determining a second parameter

And a first parameter

Finally, according to the first parameter and

the mapping relation with the unmixing matrix W, such as equations (13) and (14), obtains the unmixing matrix W, and thus obtains the separated signal y (f, t), i.e., the target speech signal.

Therefore, in the embodiment of the application, the auxiliary virtual sound source is introduced to enhance the observation signal, so as to obtain a second mixing matrix corresponding to the enhanced observation signal and a voice signal with reverberation, where the second mixing matrix includes a second correlation transfer function corresponding to the auxiliary virtual sound source, and the voice signal with reverberation includes energy of a signal of the auxiliary virtual sound source. The unmixing matrix W to be solved can be regarded as a special beam forming matrix (that is, the unmixing matrix W is not a matrix designed by direction information, but a matrix designed by sound source independence), the added auxiliary virtual sound source can enhance the original voice signal in the observation signal, and the accuracy of the unmixing matrix W can be increased, so that the stability of the signal processing process can be ensured, and the signal separation performance can be improved.

Fig. 9 is a schematic diagram for comparing the effect of the method of signal processing provided by the embodiment of the present application with the effect of the sound source separation scheme in the prior art. The graph (a) is a comparison graph of the Signal-to-Interference Ratio (SIR) rise value of the separated Signal obtained in each scheme, and the graph (b) is a comparison graph of the Signal-to-Interference Ratio (SDR) rise value of the separated Signal obtained in each scheme, and the X-axis of the graph (a) and the graph (b) represents the reverberation time.

For example, a mixed speech signal may be acquired in a two-microphone, two-source mixing scenario. As a specific example, two microphones can be used to collect the voice signals of two persons speaking simultaneously in a room with a length of 4.45m, a width of 3.55 m and a height of 2.5 m. The two persons may be located 1m from the microphones, respectively, at 45 ° and 135 ° with respect to the directional angle of the microphones, respectively, and the distance between the two microphones may be 0.1 m. The reverberation time is adjusted from 150ms to 300ms, and the adjustment step size of the reverberation time is 10 ms.

The speech signals received by the two microphones can respectively adopt (1) the traditional AuxIVA technology; (2) a geometric Constrained Auxiliary function (GCAV) -IVA with VCD of the reference algorithm; (3) the AuxIS-AuxIVA estimating method and device using the guide vector provided by the embodiment of the application; (4) the AuxIS-AuxIVA estimated by using the pre-measured value is provided by the embodiment of the application. Wherein the steering vector represents a far-field approximate estimation of the related transfer function of the AuxIS, and the pre-measured value is the actually measured related transfer function of the AuxIS.

As can be seen from fig. 9, under different reverberation times, the SIR and SDR of the separated signal obtained by the signal processing method provided by the embodiment of the present application are significantly improved compared with the existing method, so that the signal processing method provided by the embodiment of the present application can help to improve the quality of the front-end signal.

The present invention is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present invention within the technical concept of the present invention, and the technical solution of the present invention is protected by the present invention. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.

It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. It is to be understood that the numerical designations are interchangeable under appropriate circumstances such that the embodiments of the application described are capable of operation in sequences other than those illustrated or described herein.

Method embodiments of the present application are described in detail above in conjunction with fig. 3-9, and apparatus embodiments of the present application are described in detail below in conjunction with fig. 10-11.

Fig. 10 is a schematic block diagram of an apparatus 700 for signal processing according to an embodiment of the present application. As shown in fig. 10, the signal processing apparatus 700 may include an obtaining unit 710 and a processing unit 720.

An obtaining unit 710 for obtaining observation signals, wherein the observation signals comprise at least two original sound source signals of at least two sources obtained by at least two microphones;

a processing unit 720 for determining a first mixing matrix H and a reverberated speech signal based on the observation signal

The mapping relationship between the two;

the processing unit 720 is further configured to combine the first mixing matrix H and the reverberated speech signal

And the mapping relation between the unmixing matrix W;

the processing unit 720 is further configured to obtain a separation signal according to the unmixing matrix W and the observation signal.

Optionally, the processing unit 720 is specifically configured to:

the voice signal with reverberation is obtained according to the first mixing matrix H

And the unmixing matrix W, determining a first parameter;

and obtaining the unmixing matrix W according to the mapping relation between the first parameter and the unmixing matrix W.

Optionally, the processing unit 720 is specifically configured to:

Determining a second parameter;

and determining the first parameter according to the second parameter and the unmixing matrix W.

Optionally, the processing unit 720 is further configured to:

and determining the mapping relation between the first parameter and the unmixing matrix W according to the null space of the first parameter.

Optionally, the processing unit 720 is further configured to:

and determining the modulus value of the unmixing matrix W according to the minimum distortion principle.

Optionally, the processing unit 720 is further configured to determine, according to the observation signal, an energy of a signal of an auxiliary virtual sound source;

the obtaining unit 710 is further configured to obtain a second correlation transfer function corresponding to the auxiliary virtual sound source.

Wherein, the processing unit 720 is specifically configured to:

according to the observed signal, the energy of the signal of the auxiliary virtual sound source and the second associated transfer function

Obtaining the first mixing matrix H and the voice signal with reverberation

Wherein the first mixing matrix H comprises the second correlation transfer function, the reverberated speech signal

Including the energy of the signal of the secondary virtual sound source.

Optionally, the processing unit 720 is specifically configured to:

determining a magnitude spectrum of a signal of the auxiliary virtual sound source according to the observation signal;

determining the energy of the signal of the auxiliary virtual sound source according to the energy ratio of the observed signal to the signal of the auxiliary virtual sound source.

Optionally, the obtaining unit 710 is specifically configured to determine the second correlation transfer function by using a way of averaging in advance through multipoint measurement.

Optionally, the obtaining unit 710 is specifically configured to determine the second correlation transfer function by using an adaptive correlation transfer function estimation algorithm.

Optionally, the processing unit 720 is specifically configured to:

determining a mapping relationship between the first mixing matrix H and a second mixing matrix A, wherein the second mixing matrix A is used for representing the mapping relationship between the observation signals and the original sound source signals of the at least two sources;

determining the language with reverberation according to the mapping relation and the original sound source signals of the N sourcesSound signal

Optionally, the second mixing matrix a comprises a room transfer function between a sound source of the observation signal to a microphone.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 700 for signal processing in this embodiment may correspond to a corresponding main body for executing the method 300 in this embodiment, and the foregoing and other operations and/or functions of each module in the apparatus 700 are respectively for implementing each method in fig. 3 to fig. 8 or a corresponding flow in each method, and are not described again here for brevity.

The apparatus and system of embodiments of the present application are described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 11 is a schematic block diagram of an electronic device 800 provided in an embodiment of the present application.

As shown in fig. 11, the electronic device 800 may include:

a memory 810 and a processor 820, the memory 810 being configured to store a computer program and to transfer the program code to the processor 820. In other words, the processor 820 may call and execute a computer program from the memory 810 to implement the communication method in the embodiment of the present application.

For example, the processor 820 may be configured to perform the steps of the method 300 according to instructions in the computer program.

In some embodiments of the present application, the processor 820 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 810 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 810 and executed by the processor 820 to perform the encoding methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device 800.

Optionally, the electronic device 800 may further include:

a transceiver 830, the transceiver 830 being connectable to the processor 820 or the memory 810.

The processor 820 may control the transceiver 830 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 830 may include a transmitter and a receiver. The transceiver 830 may further include one or more antennas.

It should be understood that the various components in the electronic device 800 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

According to an aspect of the present application, there is provided a communication device comprising a processor and a memory, the memory being configured to store a computer program, the processor being configured to call and execute the computer program stored in the memory, so that the encoder performs the method of the above-described method embodiment.

According to an aspect of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of signal processing, comprising:

The mapping relationship between the two;

mixing the first mixing matrix H and the reverberated speech signal

And the mapping relation between the unmixing matrix W;

2. The method of claim 1, wherein the mixing the first mixing matrix H and the reverberated speech signal

Inputting a signal processing model to obtain a demixing matrix W of the observation signal, wherein the demixing matrix W comprises the following steps:

And the unmixing matrix W, determining a first parameter;

3. The method of claim 2, wherein the reverberated speech signal is based on the mixing matrix H

And the unmixing matrix W, determining a first parameter, comprising:

Determining a second parameter;

4. The method of claim 2 or 3, further comprising:

5. The method according to any one of claims 2-4, further comprising:

6. The method of any one of claims 1-5, further comprising:

determining the energy of the signal of the auxiliary virtual sound source according to the observation signal;

acquiring a second related transfer function corresponding to the auxiliary virtual sound source;

wherein the determining of the first mixing matrix H and the reverberated speech signal from the observation signal

The method comprises the following steps:

according to the observed signal, the energy of the signal of the auxiliary virtual sound sourceQuantity and said second associated transfer function

Obtaining the first mixing matrix H and the voice signal with reverberation

Including the energy of the signal of the secondary virtual sound source.

7. The method of claim 6, wherein determining the energy of the signal of the secondary virtual sound source from the observed signal comprises:

8. The method according to claim 6 or 7, wherein the obtaining a second correlation transfer function corresponding to the auxiliary virtual sound source comprises:

and determining the second correlation transfer function by using a mode of averaging in advance through multipoint measurement.

9. The method according to claim 6 or 7, wherein the obtaining a second correlation transfer function corresponding to the auxiliary virtual sound source comprises:

determining the second correlation transfer function using an adaptive correlation transfer function estimation algorithm.

10. The method of any one of claims 1-9, wherein the root is a root of a plantDetermining a mixing matrix H and a speech signal with reverberation from the observation signal

The method comprises the following steps:

determining the voice signal with reverberation according to the mapping relation and the original sound source signals of the N sources

11. The method according to claim 10, wherein the second mixing matrix a comprises a room transfer function between a sound source of the observation signal to a microphone.

12. An apparatus for signal processing, comprising:

The mapping relationship between the two;

the processing unit is further configured to sum the first mixing matrix H andthe voice signal with reverberation

And the mapping relation between the unmixing matrix W;

13. An electronic device comprising a processor and a memory, the memory having stored therein instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-11.

14. A computer storage medium for storing a computer program comprising instructions for performing the method of any one of claims 1-11.

15. A computer program product, comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any of claims 1-11.