WO2021193093A1

WO2021193093A1 - Signal processing device, signal processing method, and program

Info

Publication number: WO2021193093A1
Application number: PCT/JP2021/009764
Authority: WO
Inventors: 厚夫廣江
Original assignee: ソニーグループ株式会社
Priority date: 2020-03-25
Filing date: 2021-03-11
Publication date: 2021-09-30
Also published as: JP2021152623A

Abstract

Provided is a signal processing device comprising: a reference signal generation unit to which a mixed sound signal picked up by microphones disposed at different positions and obtained by mixing a target sound and sounds other than the target sound is inputted, and which generates a reference signal corresponding to the target sound on the basis of the mixed sound signal; and a sound source extraction unit which extracts, from the mixed sound signal, a signal which is similar to the reference signal and in which the target sound is further enhanced.

Description

Signal processing equipment, signal processing methods and programs

The present disclosure relates to signal processing devices, signal processing methods and programs.

A technique for extracting a target sound from a mixed sound signal in which a sound to be extracted (hereinafter, appropriately referred to as a target sound) and a sound to be removed (hereinafter, appropriately referred to as a nuisance sound) is proposed (for example). See Patent Documents 1 to 3 below.).

Japanese Unexamined Patent Publication No. 2006-721163

Japanese Patent No. 4449871

Japanese Unexamined Patent Publication No. 2014-219467

In such fields, it is desired to improve the accuracy of extracting the target sound.

One of the purposes of the present disclosure is to provide a signal processing device, a signal processing method, a program, and a signal processing system with improved accuracy of extracting a target sound.

The present disclosure is, for example,
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
A reference signal generator that generates a reference signal corresponding to the target sound based on the mixed sound signal,
It is a signal processing device having a sound source extraction unit that extracts a signal that is similar to a reference signal from a mixed sound signal and has a more emphasized target sound.

The present disclosure is, for example,
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generator generates a reference signal corresponding to the target sound based on the mixed sound signal.
This is a signal processing method in which the sound source extraction unit extracts a signal that is similar to the reference signal and has a more emphasized target sound from the mixed sound signal.

The present disclosure is, for example,
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generator generates a reference signal corresponding to the target sound based on the mixed sound signal.
This is a program in which the sound source extraction unit causes a computer to execute a signal processing method for extracting a signal that is similar to a reference signal and has a more emphasized target sound from a mixed sound signal.

FIG. 1 is a diagram for explaining an example of the sound source separation process of the present disclosure. FIG. 2 is a diagram for explaining an example of a sound source extraction method using a reference signal based on the deflation method. FIG. 3 is a diagram referred to when explaining a process of extracting a sound source after generating a reference signal for each section. FIG. 4 is a block diagram showing a configuration example of the sound source extraction device according to the embodiment. FIG. 5 is a diagram referred to when explaining an example of interval estimation and reference signal generation processing. FIG. 6 is a diagram referred to when explaining other examples of interval estimation and reference signal generation processing. FIG. 7 is a diagram referred to when explaining other examples of interval estimation and reference signal generation processing. FIG. 8 is a diagram referred to when explaining the details of the sound source extraction unit according to the embodiment. FIG. 9 is a flowchart referred to when explaining the flow of the entire processing performed by the sound source extraction device according to the embodiment. FIG. 10 is a diagram referred to when explaining the process performed by the FTFT unit according to the embodiment. FIG. 11 is a flowchart referred to when explaining the flow of the sound source extraction process according to the embodiment.

Hereinafter, embodiments and the like of the present disclosure will be described with reference to the drawings. The explanation will be given in the following order.
<Outline of this disclosure, background, and issues to be considered>
<Technology used in this disclosure>
<One Embodiment>
<Modification example>
The embodiments and the like described below are suitable specific examples of the present disclosure, and the contents of the present disclosure are not limited to these embodiments and the like.

[Notation in this specification]
(Formula notation)
In the following, the mathematical formula will be described according to the following notation.
・ "_" Represents a subscript character.
(Example) X_k ・・・ "k" is a subscript character.
-If there are multiple subscript characters, enclose them in "{...}".
(Example) R_ {xx} ・・・ "xx" is a subscript character.
・ "^" Represents a superscript.
(Example) W ^ H …… W Elmeet transpose (= complex transpose) Matrix y_k (f, t) ^ H …… y_k (f, t) Elmeet transpose vector (conjugated complex number & transpose) A ^ {-1} …… The inverse matrix of the distribution matrix A. -Conj (X) represents the conjugate complex number of complex number X. In the equation, the conjugate complex number of X is represented by overlining X.・ Hat (x) means to put "^" on top of x. -Value assignment is represented by "=" or "←". In particular, operations that do not hold an equal sign on both sides (for example, "x ← x + 1") are always represented by "←". -Matrixes are shown in uppercase, and vectors and scalars are shown in lowercase. Matrixes and vectors are shown in bold and scalars are shown in italics.

(Definition of terms)
In this specification, "sound (signal)" and "voice (signal)" are used properly. "Sound" is used in a general sense such as sound and audio, and "voice" is used as a term for voice and speech.
In addition, "separation" and "extraction" are used properly as follows. "Separation" is the opposite of mixing, and is used as a term meaning that a signal obtained by mixing a plurality of original signals is divided into each original signal (there are multiple inputs and outputs). "Extraction" is used as a term meaning to extract one original signal from a signal in which a plurality of original signals are mixed. (There are multiple inputs, but one output.)
"Applying a filter" and "performing filtering" have the same meaning, and similarly, "applying a mask" and "performing a masking" have the same meaning.

<Outline of this disclosure, background, and issues to be considered>
First, in order to facilitate the understanding of the present disclosure, the outline, background, and issues to be considered in the present disclosure will be described.

(Summary of this disclosure)
The present disclosure is sound source extraction using a reference signal (reference). In addition to recording a signal in which the sound to be extracted (target sound) and the sound to be erased (jamming sound) are mixed with multiple microphones, a "rough" amplitude spectrometer corresponding to the target sound is generated, and the amplitude spectrometer is generated. By using it as a reference signal, it is a signal processing device that produces an extraction result similar to the reference signal and with higher accuracy. That is, one form of the present disclosure is a signal processing device that extracts a signal that is similar to a reference signal and has a more emphasized target sound from a mixed sound signal.

In the processing performed by the signal processing device, an objective function that reflects both the dependency (similarity) between the reference signal and the extraction result and the independence between the extraction result and other virtual separation results is prepared. , Find the extraction filter as a solution to optimize it. By using the deflation method used in blind sound source separation, the output signal can be limited to one sound source corresponding to the reference signal. Since it can be regarded as a beamformer that considers both dependence and independence, it is appropriately referred to as Similarity-and-Independence-aware Beamformer (SIBF) below.

(background)
The present disclosure is sound source extraction using a reference signal (reference). In addition to recording a signal with a mixture of the sound you want to extract (target sound) and the sound you want to eliminate (interfering sound) with multiple microphones, you can acquire or generate a "rough" amplitude spectrogram corresponding to the target sound, and its amplitude. By using the spectrogram as a reference signal, it produces extraction results that are similar to and more accurate than the reference signal.

The usage situation assumed in the present disclosure shall satisfy all of the following conditions (1) to (3), for example.
(1) Observation signals are recorded synchronously by a plurality of microphones.
(2) It is assumed that the section in which the target sound is sounding, that is, the time range is known, and the above-mentioned observation signal includes at least that section.
(3) As a reference signal, it is assumed that a rough amplitude spectrogram corresponding to the target sound (rough target sound spectrogram) has been acquired, or can be generated from the above-mentioned observation signal.

Supplement each of the above conditions.
Under the condition (1) above, each microphone may or may not be fixed, and the position of each microphone and sound source may be unknown in either case. An example of a fixed microphone is a microphone array, and an example of a non-fixed microphone is a case where each speaker wears a pin microphone or the like.

Under the condition (2) above, the section in which the target sound is sounding is, for example, the utterance section in the case of extracting the voice of a specific speaker. It is assumed that the section is known, but it is unknown whether or not the target sound is sounding outside the section. That is, the assumption that the target sound does not exist outside the section may not hold.

In the above (3), the rough target sound spectrogram means that the spectrogram of the true target sound is deteriorated because it meets one or more of the following conditions a) to f). ..
a) Real number data that does not include phase information.
b) Although the target sound is predominant, the disturbing sound is also included.
c) The disturbing sound is almost eliminated, but the sound is distorted as a side effect.
d) The resolution is lower than that of the true target sound spectrogram in either or both of the time direction and the frequency direction.
e) The spectrogram amplitude scale is different from the observed signal, and the size comparison is meaningless. For example, even if the amplitude of the rough target sound spectrogram is half the amplitude of the observed signal spectrogram, it does not mean that the target sound and the disturbing sound are included in the observed signal with the same magnitude.
f) Amplitude spectrogram generated from a signal other than sound.
The rough target sound spectrogram as described above is acquired or generated by, for example, the following method.
-Record the sound with a microphone installed near the target sound (for example, a pin microphone attached to the speaker), and obtain the amplitude spectrogram from it. (Corresponds to the example of b above)
-A neural network (NN) that extracts a specific type of sound in the amplitude spectrogram region is learned in advance, and an observation signal is input to the neural network (NN). (Equivalent to a, c, e above)
-Amplitude spectrogram is obtained from a signal acquired by a sensor other than the normally used air conduction microphone such as a bone conduction microphone. (Equivalent to c above)
-A spectrogram in the linear frequency domain is generated by applying a predetermined conversion to the spectrogram-equivalent data calculated in the non-linear frequency domain such as the mel frequency. (Equivalent to a, d, e above)
-Instead of a microphone, use a sensor that can observe the vibration of the skin surface near the speaker's mouth and throat, and obtain the amplitude spectrogram from the signal acquired by that sensor. (Equivalent to d, e, f above)

One object of the present disclosure is to use the rough target sound spectrogram thus acquired and generated as a reference signal, and to exceed the reference signal (the target sound is further emphasized, in other words, to be true. It is to generate an extraction result (closer to the target sound). More specifically, in the sound source extraction process in which a linear filter is applied to a multi-channel observation signal to generate an extraction result, a linear filter that generates an extraction result (closer to the true target sound) with an accuracy exceeding the reference signal. To estimate.

In the present disclosure, the reason for estimating the linear filter for the sound source extraction process is to enjoy the following advantages of the linear filter.
Advantage 1: The distortion of the extraction result is small compared to the non-linear extraction process. Therefore, when combined with voice recognition or the like, it is possible to avoid a decrease in recognition accuracy due to distortion.
Advantage 2: The phase of the extraction result can be appropriately estimated by the rescaling process described later. Therefore, it is possible to avoid a problem caused by an inappropriate phase when combined with a phase-dependent post-stage processing (including a case where the extraction result is reproduced as a sound and a human hears it).
Advantage 3: By increasing the number of microphones, it is easy to improve the extraction accuracy.

(Issues to be considered in this disclosure)
One of the purposes of the present disclosure will be described again as follows.
Purpose: Estimate a linear filter to generate extraction results with higher accuracy than the signal of c), assuming that the following conditions a) to c) are met.
a) There is a signal recorded by a multi-channel microphone. The arrangement of microphones and the position of each sound source may be unknown.
b) The section in which the target sound (the sound to be retained) is sounding is known. However, it is unknown whether the target sound exists outside the section.
c) A rough amplitude spectrogram (or similar data) of the target sound can be acquired or generated. The amplitude spectrogram is real and the phase is unknown.
However, a linear filtering method that satisfies all of the above three conditions has not existed in the past. The following three types are mainly known as general linear filtering methods.
-Adaptive beamformer-Blind separation-Existing linear filtering process using reference signal The following describes the problems of each method.

(Problems of adaptive beam former)
The adaptive beam former here is an adaptive estimation of a linear filter for extracting a target sound by using signals observed by a plurality of microphones and information indicating which sound source is to be extracted as the target sound. It is a method to do. Examples of the adaptive beam former include the methods described in JP-A-2012-234150 and JP-A-2006-072163.

Below, we will explain the SN ratio (Signal to Noise Ratio) maximizing beam former (also known as GEV beam former) as an adaptive beam former that can be used even when the placement of the microphone and the direction of the target sound are unknown.

The SN ratio maximizing beamformer (maximum SNR beamformer) is a method for finding a linear filter that maximizes the ratio V_s / V_n between a) and b) below.
a) Dispersion of processing results by applying a predetermined linear filter to the section where only the target sound is sounding V_s
b) Dispersion of processing results by applying the same linear filter to the section where only the disturbing sound is sounding V_n

With this method, a linear filter can be estimated if each section can be detected, and there is no need for the placement of microphones or the direction of the target sound.

However, it is assumed that this disclosure can be applied, and the known section is only the timing when the target sound is sounding. Since both the target sound and the disturbing sound are present in that section, it cannot be used as either of the above a) and b) sections. Other adaptive beam former methods shall also be used in situations where this disclosure can be applied because the section b) above is required separately or the direction of the target sound must be known. It is difficult.

(Problems of blind separation)
Blind sound source separation is a technology that estimates each sound source from a signal that is a mixture of multiple sound sources, using only the signals observed by multiple microphones (without using information such as the direction of the sound source and the arrangement of the microphones). be. An example of such a technique is the technique of Japanese Patent No. 4449871. The technology of Japanese Patent No. 4449871 is an example of a technology called Independent Component Analysis (hereinafter, appropriately referred to as ICA), and ICA decomposes a signal observed by N microphones into N sound sources. do. The observation signal used at that time may include a section in which the target sound is sounding, and information on a section in which only the target sound or only the disturbing sound is sounding is unnecessary.

Therefore, after applying ICA to the observation signal in the section where the target sound is sounding and decomposing it into N components, select only one component that most closely resembles the rough target sound spectrogram, which is the reference signal. By doing so, it can be used in situations where this disclosure can be applied. To determine whether they are similar or not, after converting each separation result into an amplitude spectrogram, the square error (Euclidean distance) between each amplitude spectrogram and the reference signal is calculated, and the amplitude that minimizes the error. The separation result corresponding to the spectrogram may be adopted.

However, the method of selecting after separation has the following problems.
1) Although only one sound source is desired, N sound sources are generated in the middle step, which is disadvantageous in terms of calculation cost and memory usage.
2) The rough target sound spectrogram, which is a reference signal, is used only in the step of selecting one sound source from N sound sources, and is not used in the step of separating into N sound sources. Therefore, the reference signal does not contribute to the improvement of extraction accuracy.

(Problems of existing linear filtering process using reference signal)
Conventionally, there are several methods for estimating a linear filter using a reference signal.
Here, the following a) and b) are referred to as such techniques.
a) Independent deep learning matrix analysis b) Sound source extraction using the time envelope as a reference signal

Independent Deeply Learned Matrix Analysis (hereinafter, appropriately referred to as IDLMA) is an advanced form of independent component analysis. For details, refer to Document 1 below.
"(Reference 1)
N. Makishima et al.,
"Independent Deeply Learned Matrix Analysis for Determined Audio Source Separation,"
in IEEE / ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1601-1615, Oct. 2019.
doi: 10.1109 / TASLP.2019.2925450 "

The feature of IDLMA is that the neural network (NN) that generates the power spectrogram (square of the amplitude spectrogram) of each sound source to be separated is learned in advance. For example, when it is desired to separate the parts of each musical instrument from the music in which a plurality of musical instruments are played at the same time, the NN for inputting the music and outputting the sound of each musical instrument is learned in advance. At the time of separation, the observation signal is input to each NN, and the output power spectrogram is used as a reference signal to perform the separation. Therefore, compared to the completely blind separation process, the separation accuracy can be expected to be improved by the amount of the reference signal used. Furthermore, by inputting the once generated separation result to each NN again, a power spectrum with higher accuracy than the first time is generated, and by performing separation using it as a reference signal, the separation result with higher accuracy than the first time can be obtained. It has also been reported that it can be obtained.

However, it is difficult to use this IDLMA in situations where this disclosure can be applied for the following reasons:
IDLMA requires N different power spectrograms as reference signals to generate N separation results. Therefore, even if there is only one sound source of interest and no other sound source is required, it is necessary to prepare reference signals for all sound sources. However, in reality it can be difficult. Further, in the above-mentioned document 1, only the case where the number of microphones and the number of sound sources match is mentioned, and how many reference signals should be prepared when the numbers of both do not match. Is not mentioned. Further, since IDLMA is a method of sound source separation, in order to use it for the purpose of sound source extraction, it is necessary to take a step of generating N separation results once and then leaving only one sound source. Therefore, the problem of sound source separation, which is wasteful in terms of calculation cost and memory usage, still remains.

Examples of the sound source extraction using the time envelope as a reference signal include the techniques described in Japanese Patent Application Laid-Open No. 2014-219467 proposed by the present inventor. This method estimates a linear filter using a reference signal and a multi-channel observation signal, as in the present disclosure. However, there are differences in the following points.
-The reference signal is a time envelope, not a spectrogram. This corresponds to a rough target sound spectrogram that is homogenized by applying an operation such as averaging in the frequency direction. Therefore, if the target sound has a characteristic that the change in the time direction differs for each frequency, the reference signal cannot properly express it, and as a result, the extraction accuracy may decrease.
-The reference signal is reflected only as an initial value in the iterative process for obtaining the extraction filter. Since the reference signal is not restricted after the second iteration, another sound source different from the reference signal may be extracted. For example, if there is a sound that occurs only for a moment in the section, it is more optimal to extract it as an objective function, so there is a possibility that an undesired sound may be extracted depending on the number of repetitions.

As described above, the above-mentioned technique has a problem that it is difficult to use it in a situation where the present disclosure can be applied, or an extraction result with sufficient accuracy cannot be obtained.

[Technology used in this disclosure]
Next, the technique used in the present disclosure will be described. By introducing the following elements together into the blind sound source separation method based on independent component analysis, a sound source extraction technique suitable for the purpose of the present disclosure can be realized.
Element 1: In the process of separation, prepare an objective function that reflects not only the independence of the separation results but also the dependency between one of the separation results and the reference signal, and optimize it.
Element 2: Similarly, in the separation process, a method called the deflation method, which separates sound sources one by one, is introduced. Then, the separation process is terminated when the first sound source is separated.

The sound source extraction technology of the present disclosure extracts one desired sound source from multi-channel observation signals observed by a plurality of microphones by applying an extraction filter which is a linear filter. Therefore, it can be regarded as a kind of beam former (BF). In the extraction process, both the similarity between the reference signal and the extraction result and the independence between the extraction result and other separation results are reflected. Therefore, the sound source extraction method of the present disclosure is appropriately referred to as Similarity-and-Independence-aware Beamformer: SIBF.

The separation process of the present disclosure will be described with reference to FIG. The frame marked with (1-1) is the separation process assumed in the conventional time-frequency domain independent component analysis (Patent No. 4449871 etc.), and exists outside (1-5) and (1). -6) is an element added in this disclosure. In the following, the conventional time-frequency domain blind sound source separation will be described first using the frame of (1-1), and then the separation process of the present disclosure will be described.

In FIG. 1, X_1 to X_N are observation signal spectrograms (1-2) corresponding to N microphones, respectively. These are complex data and are generated by applying the short-time Fourier transform described later to the waveform of the sound observed by each microphone. In each spectrogram, the vertical axis represents frequency and the horizontal axis represents time. The time length shall be the same as or longer than the length of the target sound to be extracted.

In independent component analysis, the separation result spectrograms Y_1 to Y_N are generated by multiplying this observed signal spectrogram by a predetermined square matrix called the separation matrix with (1-3) attached (1-4). The number of separation result spectrograms is N, which is the same as the number of microphones. In the separation, the value of the separation matrix is determined so that Y_1 to Y_N are statistically independent (that is, the difference between Y_1 to Y_N is as large as possible). Since such a matrix cannot be obtained once, prepare an objective function that reflects the independence of the separation result specograms, and that function is optimal (maximum or minimum depending on the nature of the objective function). Iteratively find such a separation matrix. After obtaining the results of the separation matrix and the separation result spectrogram, when the inverse Fourier transform is applied to each of the separation result spectrograms to generate waveforms, they are signals that estimate each sound source before mixing.

The above is the explanation of the separation process of independent component analysis in the conventional time frequency domain. In the present disclosure, the above-mentioned two elements are added to this.

One of the additional factors is the dependency on the reference signal. The reference signal is a rough amplitude spectrogram of the target sound and is generated by the reference signal generation unit marked with (1-5). In the separation process, in addition to the independence of the separation result spectrograms, the separation matrix is determined by considering the dependency between Y_1, which is one of the separation result spectrograms, and the reference signal R. That is, the separation matrix that optimizes the function is obtained by reflecting both of the following for the objective function.
a) Independence between Y_1 and Y_N (solid line L1)
b) Dependency between Y_1 and R (dotted line L2)
The specific formula of the objective function will be described later.

By reflecting both independence and dependency in the objective function, the following advantages can be obtained.
Advantage 1: In independent component analysis in the normal time frequency domain, it is uncertain which original signal appears at which position in the separation result spectrogram, and it corresponds to the initial value of the separation matrix and the observed signal (corresponding to the mixed sound signal described later). It changes depending on the degree of mixing in the signal) and the difference in the algorithm for obtaining the separation matrix. On the other hand, since the present disclosure considers the dependency between the separation result Y_1 and the reference signal R in addition to the independence, a spectrogram similar to R can always appear in Y_1.
Advantage 2: By simply solving the problem of making Y_1, which is one of the separation results, similar to the reference signal R, it is possible to bring Y_1 closer to R, but it exceeds the reference signal R in terms of extraction accuracy (for the target sound). You can't get closer). On the other hand, in the present disclosure, since the independence of the separation results is also taken into consideration, the extraction accuracy of the separation result Y_1 can exceed the reference signal.

However, even if the dependency with the reference signal is introduced in the independent component analysis in the time frequency domain, the number of generated signals is N because it is still a separation method. That is, even if the desired sound source is only Y_1, at the same time, N-1 signals are generated even though they are unnecessary.

Therefore, as another additional element, the deflation method is introduced. The deflation method is a method of estimating the original signals one by one instead of separating all the sound sources at the same time. For a general explanation of the deflation method, refer to Chapter 8 of Reference 2 below, for example.
"(Reference 2)
Detailed explanation Independent component analysis-a new world of signal analysis Arpo Bibalinen (Author), Elkioya (Author), Yuha Karunen (Author),
Aapo Hyvaerinen (Original), Erkki Oja (Original), Juha Karhunen (Original),
Iku Nemoto (translation), Maki Kawakatsu (translation)
(Original title)
Independent Component Analysis
Aapo Hyvaerinen (Author), Juha Karhunen (Author), Erkki Oja (Author) "

In general, even with the deflation method, the order of separation results is indefinite, so the order in which the desired sound source appears is indefinite. However, if the deflation method is applied to the sound source separation using the objective function that reflects both independence and dependence as described above, it is possible to make the separation result similar to the reference signal always appear first. Become. That is, the separation process may be terminated when the first one sound source is separated (estimated), and it is not necessary to generate unnecessary N-1 separation results. In addition, it is not necessary to estimate all the elements of the separation matrix, and only the elements necessary to generate Y_1 need to be estimated.

In the deflation method that estimates only one sound source, among the separation results marked with (1-4) in FIG. 1, other than Y_1 (that is, Y_2 to Y_N) are virtual and are not actually generated. .. However, the calculation of independence is equivalent to using all the separation results Y_1 to Y_N. Therefore, while considering the independence, the advantage of sound source separation that Y_1 can be made more accurate than R can be obtained, while avoiding the waste of generating unnecessary separation results Y_2 to Y_N. You can also.

The deflation method is one of the separation (estimating all the sound sources before mixing) method, but if the separation is interrupted when one sound source is estimated, the extraction (estimating one desired sound source) method. Can be used as. Therefore, in the following description, the operation of estimating only the separation result Y_1 is referred to as “extraction”, and Y_1 is appropriately referred to as “(target sound) extraction result”. Further, each separation result is generated from the vectors constituting the separation matrix with (1-3). This vector is appropriately referred to as an "extraction filter".

A sound source extraction method using a reference signal based on the deflation method will be described with reference to FIG. FIG. 2 shows the details of FIG. 1, and the elements necessary for applying the deflation method are added.

The observation signal spectrogram marked with (2-1) in FIG. 2 is the same as (1-2) in FIG. 1, and the short-time Fourier transform is applied to the time domain signal observed by N microphones. Is generated by. By applying the process of uncorrelatedness with (2-2) to this observation signal spectrogram, an uncorrelated observation signal spectrogram with (2-3) is generated. Uncorrelated is also called whitening, and is a conversion that makes the signals observed by each microphone uncorrelated. Specific mathematical formulas used in the process will be described later. If uncorrelated is performed as a pretreatment for separation, an efficient algorithm utilizing the properties of uncorrelated signals can be applied in separation. The deflation method is one such algorithm.

The number of uncorrelated observation signal spectrograms is the same as the number of microphones, and each is U_1 to U_N. The generation of the uncorrelated observation signal spectrogram need only be performed once as a process before obtaining the extraction filter. As described in FIG. 1, in the deflation method, instead of estimating the matrix that simultaneously generates the separation results Y_1 to Y_N, one filter that generates each separation result is estimated. In this disclosure, since only Y_1 is generated, the only filter to be estimated is w_1, which has the function of inputting U_1 to U_N to generate Y_1, and Y_2 to Y_N and w_2 to w_N are not actually generated. It is a thing.

The reference signal R with (2-8) is the same as (1-6) in FIG. As described above, in estimating the filter w_1, both the independence of Y_1 to Y_N and the dependency between R and Y_1 are taken into consideration.

In the sound source extraction method of the present disclosure, only one sound source is estimated (extracted) for one section. Therefore, if there are multiple sound sources to be extracted, that is, target sounds, and there are overlaps in the sections where they are sounding, the overlapping sections are detected respectively, and a reference signal is generated for each section before the sound source. Extract. This point will be described with reference to FIG.

In the example shown in FIG. 3, the target sound is human voice, and the number of sound sources of the target sound, that is, the number of speakers is 2. Of course, the target sound may be any kind of sound, and the number of sound sources is not limited to 2. Further, it is assumed that there are 0 or more disturbing sounds that are not the target of extraction. A non-voice signal is a disturbing sound, but even if it is a voice, a sound output from a device such as a speaker is treated as a disturbing sound.

Let the two speakers be speaker 1 and speaker 2, respectively. Further, in FIG. 3, the utterances marked with (3-1) and the utterances marked with (3-2) are the utterances of the speaker 1. Further, in FIG. 3, the utterances marked with (3-3) and the utterances marked with (3-4) are the utterances of the speaker 2. (3-5) represents a disturbing sound. In FIG. 3, the vertical axis represents the difference in sound source position, and the horizontal axis represents time. Part of the utterance section overlaps between the utterances (3-1) and (3-3). This corresponds to, for example, the case where the speaker 2 starts speaking immediately before the speaker 1 finishes speaking. There is overlap between utterances (3-2) and (3-4), which corresponds to, for example, a case where speaker 1 makes a short utterance such as an aizuchi while speaker 1 is speaking for a long time. Both are phenomena that frequently occur in conversations between humans.

First, consider the extraction of utterances (3-1). In the time range (3-6) in which the utterance (3-1) was made, in addition to the utterance of speaker 1 (3-1), a part of the utterance of speaker 2 (3-3) and interference. There are a total of 3 sound sources that are part of the sound (3-5). The extraction of the utterance (3-1) in the present disclosure is a reference signal corresponding to the utterance (3-1), that is, a rough amplitude spectrogram and an observation signal (mixture of three sound sources) in the time range (3-6). It is used to generate (estimate) a signal that is as clean as possible (consisting only of the voice of speaker 1 and not including other sound sources).

Similarly, in extracting the utterance (3-3) of the speaker 2, the reference signal corresponding to (3-3) and the observation signal in the time range (3-7) are used to clean the speaker 2. Estimate a signal close to. In this way, even if the utterance sections overlap, if reference signals corresponding to the respective target sounds can be prepared, different extraction results can be generated in the present disclosure.

Similarly, the utterance of speaker 2 (3-4) has a time range completely included in the utterance of speaker 1 (3-2), but different extraction results can be obtained by preparing different reference signals for each. Can be generated. That is, in order to extract the utterance (3-2), the reference signal corresponding to the utterance (3-2) and the observation signal in the time range (3-8) are used, and the utterance (3-4) is extracted. For this purpose, the reference signal corresponding to the utterance (3-4) and the observation signal in the time range (3-9) are used.

Next, the objective function used in the estimation of the filter and the algorithm for optimizing it will be explained using mathematical formulas.

The observation signal spectrogram X_k corresponding to the k-th microphone is expressed as a matrix having x_k (f, t) as an element as shown in the following equation (1).

In equation (1), f is the frequency bin number and t is the frame number, both of which are indexes that appear by the short-time Fourier transform. In the following, changing f is referred to as the "frequency direction", and changing t is referred to as the "time direction".

The uncorrelated observation signal spectrogram U_k and the separation result spectrogram Y_k are also expressed as matrices with u_k (f, t) and y_k (f, t) as elements, respectively (the notation of mathematical formulas is omitted).

In addition, the vector x (f, t) whose elements are the observation signals of all microphones (all channels) at specific f and t is expressed by the following equation (2).

For the uncorrelated observation signal and the separation result, prepare the vectors u (f, t) and y (f, t) with the same shape, respectively (the notation of the mathematical formula is omitted).

The following equation (3) is an equation for obtaining the vector u (f, t) of the uncorrelated observation signal.

This vector is generated by the product of P (f), which is called the uncorrelated matrix, and the observed signal vector x (f, t). The uncorrelated matrix P (f) is calculated by the following equations (4) to (6).

The above equation (4) is an equation for obtaining the covariance matrix R_ {xx} (f) of the observed signal in the fth frequency bin. <・> _T on the right side represents the operation of calculating the average in a predetermined range of t (frame number). In the present disclosure, the range of t is the time length of the spectrogram, that is, the section in which the target sound is sounding (or the range including the section). The superscript H represents Hermitian transpose (conjugate transpose).

Apply the eigendecomposition to the covariance matrix R_ {xx} (f) and decompose it into a product of three terms as shown on the right side of equation (5). V (f) is a matrix consisting of eigenvectors, and D (f) is a diagonal matrix consisting of eigenvalues. V (f) is a unitary matrix, and the inverse matrix of V (f) and the Hermitian transpose of V (f) are the same.

The uncorrelated matrix P (f) is calculated by Eq. (6). Since D (f) is a diagonal matrix, its -1 / 2th power can be obtained by multiplying each diagonal element by -1 / 2th power.

Since each element of the uncorrelated observation signal u (f, t) obtained in this way is uncorrelated, the value of the covariance matrix calculated by the following equation (7) is the identity matrix I.

The following equation (8) is an equation that generates the separation result y (f, t) for all channels in f, t, and is obtained by the product of the separation matrix W (f) and u (f, t). .. The method for obtaining W (f) will be described later.

Equation (9) is an equation that generates only the k-th separation result, and w_k (f) is the k-th row vector of the separation matrix W (f). In the present disclosure, only Y_1 is generated as the extraction result, so that equation (9) is basically limited to k = 1.

When uncorrelated is performed as a pretreatment for separation, it has been proved that it is sufficient to find the separation matrix W (f) in the unitary matrix. When the separation matrix W (f) is a unitary matrix, the following equation (10) is satisfied, and the row vector w_k (f) constituting W (f) satisfies the following equation (11). By utilizing this feature, separation by the deflation method becomes possible. (Equation (11) is basically limited to k = 1 as in Eq. (9).)

The reference signal R is represented as a matrix having r (f, t) as an element, as in Eq. (12). The shape itself is the same as the observation signal spectrogram X_k, but the element x_k (f, t) of X_k is a complex number, while the element r (f, t) of R is a non-negative real number.

This disclosure estimates only w_1 (f) instead of estimating all the elements of the separation matrix W (f). That is, only the elements used in the generation of the first separation result (target sound extraction result) are estimated. In the following, the derivation of the formula for estimating w_1 (f) will be described. The derivation of the equation consists of the following three points, each of which will be explained in order.

(1) Objective function (2) Sound source model (3) Update formula

(1) Objective function The objective function used in the present disclosure has a negative log-likelihood, and is basically the same as that used in Document 1 and the like. This objective function is minimized when the separation results are independent of each other. However, in the present disclosure, in order to reflect the dependency between the extraction result and the reference signal in the objective function, the objective function is derived as follows.

In order to reflect the above-mentioned dependence in the objective function, the uncorrelated and separation (extraction) equations are slightly modified. Equation (13) is a modification of equation (3), which is an uncorrelated equation, and equation (14) is a modification of equation (8), which is a separation equation. In each case, the reference signal r (f, t) is added to the vector on both sides, and the element 1 representing "passing of the reference signal" is added to the matrix on the right side. The matrix and vector to which these elements are added are represented by adding a prime symbol to the original matrix and vector.

As the objective function, the negative log-likelihood L of the reference signal and the observed signal represented by the following equation (15) is used. In this equation, p (・) represents the probability density function (hereinafter, appropriately referred to as pdf) of the signal in parentheses. When multiple elements are described in parentheses of pdf (when multiple variables are described or a matrix or vector is described), it indicates the probability that those elements occur at the same time. For example, p (R, X_1, ..., X_N) in Eq. (15) is the probability that the reference signal R and the observed signal spectrograms X_1 to X_N occur at the same time.

Even if the same character p is used, different variables in parentheses represent different probability distributions, so for example, p (R) and p (Y_1) are different functions. In addition, most of the probability density functions appearing in the following equations are virtual, and it is necessary to apply a concrete equation to the ones that appear at the end of the equation transformation, p (r (f, t), y_1 ( Only r, t)).

In order to optimize (minimize in this case) the extraction filter w_1 (f), it is necessary to transform the negative log-likelihood L so that w_1 (f) is included. To that end, we make the following assumptions about the observed signals and the separation results.
Assumption 1: The observed signal spectrograms are dependent on the channel direction (in other words, the spectrograms corresponding to each microphone are similar to each other), but are independent in the time direction and the frequency direction. That is, in one spectrogram, the components constituting each point are generated independently of each other and are not affected by other time or frequency.
Assumption 2: Separation Results The spectrogram is independent in the channel direction as well as in the time and frequency directions. That is, the spectrograms of the separation results are not similar.
Assumption 3: There is a dependency between the separation result spectrogram, Y_1, and the reference signal. That is, they have similar spectrograms.

The transformation process of p (R, X_1, ..., X_N) is shown in equations (16) to (21).

Since the probability of simultaneous occurrence of independent variables can be decomposed into the product of each pdf, the left side of equation (16) is transformed into the right side by Assumption 1. The inside of the parentheses on the right side is expressed as in equation (17) using x'(f, t) introduced in equation (13).

Equation (17) is transformed into equation (18) and equation (19) using the relationship in the lower part of equation (14). In these equations, det (・) represents the determinant of the matrix in parentheses.

Equation (20) is an important modification of the deflation method. Since the matrix W (f)'is a unitary matrix like the separation matrix W (f), its determinant is 1. Also, since the matrix P'(f) does not change during separation, the determinant is a constant. Therefore, both determinants can be written together as a constant.

Equation (21) is a unique variant of this disclosure. The components of y'(f, t) are r (f, t) and y_1 (f, t) to y_N (f, t), but according to Assumptions 2 and 3, the probability densities with these variables as arguments. The functions are p (r (f, t), y_1 (f, t)), which is the simultaneous probability of r (f, t) and y_1 (f, t), and y_2 (f, t) to y_N (f). , t) It is decomposed into the product of each of the probability density functions p (y_2 (f, t)) to p (y_N (f, t)).

Substituting equation (21) into equation (15) gives equation (22).

The extraction filter w_1 (f) is a subset of the arguments that minimize equation (22). Of the terms in equation (22), w_1 (f) is included only in y_1 (f, t) at a specific f, so w_1 (f) is obtained as the minimum solution of equation (23) below. Will be done. However, in order to eliminate the trivial solution of w_1 (f) = 0, the constraint that the norm of the vector expressed by the equation (11) is 1 is applied.

When an extraction filter with a norm of 1 constraint is applied to the uncorrelated observation signal, the scale of each frequency bin of the generated extraction result is different from the scale of the true target sound. Therefore, after the filter is estimated, the extraction filter and the extraction result are corrected for each frequency bin. Such post-processing is called rescaling. The specific formula for rescaling will be described later.

In order to solve the minimization problem of equation (23), it is necessary to embody the following two points.
-What kind of expression is assigned as p (r (f, t), y_1 (f, t)), which is the simultaneous probability of r (f, t) and y_1 (f, t). This probability density function is called a sound source model.
-What algorithm is used to find the minimum solution w_1 (f)? Basically, w_1 (f) cannot be obtained once, but needs to be updated iteratively. The expression to be updated by w_1 (f) is called the update expression.
Each will be described below.

(2) Sound source model The sound source model p (r (f, t), y_1 (f, t)) takes two variables, the reference signal r (f, t) and the extraction result y_1 (f, t), as arguments. It is a pdf and represents the dependency (dependency) of two variables. The sound source model can be formulated based on various concepts. In this disclosure, the following three methods are used.

a) Bivariate spherical distribution b) Divergence-based model c) Time-frequency variable variance model Each will be described below.

a) Bivariate spherical distribution A spherical distribution is a type of multi-variate pdf. A multivariate pdf is constructed by considering multiple arguments of a pdf as a vector and substituting the norm (L2 norm) of the vector into the univariate pdf. Using a spherical distribution in independent component analysis has the effect of resembling the variables used in the arguments. For example, the technique described in Japanese Patent No. 4449871 utilizes this property to solve a problem called a frequency permutation problem, in which "which sound source appears in the kth separation result differs for each frequency bin".

When a spherical distribution with a reference signal and an extraction result as arguments is used as the sound source model of the present disclosure, both can be made similar. The spherical distribution used here can be expressed by the general form of the following equation (24). In this equation, the function F is any univariate pdf. In addition, c_1 and c_2 are positive constants, and the influence of the reference signal on the extraction result can be adjusted by changing these values. Using the Laplace distribution as the univariate pdf as in Japanese Patent No. 4449871, the following equation (25) is obtained. Hereinafter, this equation will be referred to as a bivariate Laplace distribution.

b) Divergence-based model Another type of sound source model is a pdf based on divergence, which is a superordinate concept of the distance scale, and is expressed in the form of the following equation (26). In this equation, divergence (r (f, t), | y_1 (f, t) |) is the reference signal r (f, t) and the amplitude of the extraction result | y_1 (f, t) | Represents any divergence between.

Further, α is a positive constant and is a correction term for making the right side of equation (26) satisfy the condition of pdf, but the value of α is irrelevant in the minimization problem of equation (23). Therefore, α = 1 may be set. Substituting this pdf into equation (23) is equivalent to the problem of minimizing the divergence between r (f, t) and | y_1 (f, t) |, so they are inevitably similar.

When the Euclidean distance is used as the divergence, the following equation (27) is obtained. Further, when Itakura Saito divergence is used, the following equation (28) can be obtained. Itakura Saito Since divergence is a distance measure between power spectra, both r (f, t) and | y_1 (f, t) | use squared values. On the other hand, the same distance scale as Itakura Saito divergence may be calculated for the amplitude spectrum, and in that case, the following equation (29) is obtained.

Equation (30) below is a pdf based on another divergence. The more similar r (f, t) and | y_1 (f, t) | are, the closer the ratio is to 1, so the squared error between that ratio and 1 acts as divergence.

c) Time-frequency variable variance model As another sound source model, a time-frequency-varying variance (TFVV) model is also possible. This is a model in which each point that makes up the spectrogram has a different variance or standard deviation over time and frequency. Then, it is interpreted that the rough amplitude spectrogram, which is a reference signal, represents the standard deviation (or some value depending on the standard deviation) of each point.

Assuming a Laplace distribution with a variable time-frequency variance (hereinafter referred to as TFVV Laplace distribution) as the distribution, it can be expressed as the following equation (31). In this equation, α is a correction term for satisfying the condition of pdf on the right side as in equation (26), and α = 1 may be set. β is a term for adjusting the magnitude of the influence of the reference signal on the extraction result. The true TFVV Laplace distribution corresponds to β = 1, but other values such as 1/2 and 2 may be used.

Similarly, assuming a TVVF Gaussian distribution, the following equation (32) is obtained. On the other hand, assuming the TVVF Student-t distribution, the sound source model of the following equation (33) can be obtained.

Ν (new) in equation (33) is a parameter called the degree of freedom, and the shape of the distribution can be changed by changing this value. For example, ν = 1 represents the Cauchy distribution and ν → ∞ represents the Gaussian distribution.

The sound source models of equations (32) and (33) are also used in Reference 1, but there is a difference in the present disclosure that these models are used for extraction rather than separation.

(3) The solution of the minimization problem of the update equation (23) w_1 (f) often does not have a closed form solution (solution without iteration) and uses an iterative algorithm. There is a need. (However, when the TFVV Gaussian distribution of Eq. (32) is used as the sound source model, there is a closed form solution as described later.)

A high-speed and stable algorithm called the auxiliary function method can be applied to the equations (25), (31), and (33). On the other hand, for equations (27) to (30), another algorithm called the fixed point method can be applied.

Hereinafter, the update formula when the equation (32) is used will be described first, and then the update equation using the auxiliary function method and the fixed point method will be described.

Substituting the TFVV Gaussian distribution represented by Eq. (32) into Eq. (23) and ignoring terms unrelated to minimization, Eq. (34) below is obtained.

This equation can be interpreted as a minimization problem of the weighted covariance matrix of u (f, t) and can be solved using eigenvalue decomposition.
(Strictly speaking, the inside of the parentheses on the right side of equation (34) does not represent the weighted covariance matrix itself, but its T times, but the difference is in the solution of the minimization problem of equation (34). Since it has no effect, the expression itself including the sigma in the middle parenthesis is also called a weighted covariance matrix from now on.)

Eig (A) represents a function that takes a matrix A as an argument and performs eigenvalue decomposition on the matrix to obtain all eigenvectors. Using this function, the eigenvectors of the weighted covariance matrix in equation (34) can be written as in equation (35) below.

A_ {min} (f), ..., a_ {max} (f) on the left side of equation (35) are eigenvectors, a_ {min} (f) is the smallest eigenvalue, and a_ {max} (f) ) Corresponds to the largest eigenvalue. The norm of each eigenvector is 1, and it is assumed that they are orthogonal to each other. W_1 (f), which minimizes equation (34), is the Hermitian transpose of the eigenvectors corresponding to the smallest eigenvalues, as shown in equation (36) below.

Next, a method of deriving the updated equation by applying the auxiliary function method to the equations (25), (31), and (33) will be described.

The auxiliary function method is one of the methods for efficiently solving an optimization problem, and details thereof are described in Japanese Patent Application Laid-Open No. 2011-175114 and Japanese Patent Application Laid-Open No. 2014-219467.

Substituting the TFVV Laplace distribution represented by the equation (31) into the equation (23) and ignoring the terms irrelevant to the minimization, the following equation (37) is obtained.

The solution to this minimization problem cannot be found in closed form.

Therefore, an inequality that "presses from above" such as equation (38) is prepared.

The right-hand side of equation (38) is called the auxiliary function, and b (f, t) in it is called the auxiliary variable. This inequality holds when b (f, t) = | y_1 (f, t) |. Applying this inequality to equation (37) gives equation (39): Hereafter, let G be the right-hand side of this inequality.

In the auxiliary function method, the minimization problem is solved quickly and stably by repeating the following two steps alternately.
1. 1. As shown in Eq. (40) below, fix w_1 (f) and find b (f, t) that minimizes G.

2. As shown in Eq. (41) below, fix b (f, t) and find w_1 (f) that minimizes G.

Equation (40) is minimized when the equal sign of equation (38) holds. Since the value of y_1 (f, t) changes every time w_1 (f) changes, the calculation is performed using equation (9). Since equation (41) is a minimization problem of a weighted covariance matrix like equation (34), it can be solved by using eigenvalue decomposition.

When the eigenvector is calculated by the following equation (42) for the weighted covariance matrix of equation (41), the solution of equation (41), w_1 (f), is the Hermitian transpose of the eigenvector corresponding to the minimum value. (Equation (36)).

At the first iteration, neither w_1 (f) nor y_1 (f, t) is known, so equation (40) cannot be applied. Therefore, the initial value of the auxiliary variable b (f, t) is calculated by one of the following methods.
a) A normalized value of the reference signal is used as an auxiliary variable. That is, b (f, t) = normalize (r (f, t)).
b) Calculate a tentative value as the separation result y_1 (f, t), and calculate the auxiliary variable from it by equation (40).
c) Substitute a temporary value for w_1 (f) to calculate equation (40).
Normalize () in a) above is a function defined by the following equation (43), and s (t) in this equation represents an arbitrary time series signal. The function of normalize () is to normalize the root mean square of the absolute value of the signal to 1.

As an example of y_1 (f, t) in b) above, operations such as selecting one channel of the observation signal or averaging the observation signals of all channels can be considered. For example, when using the microphone installation form as shown in FIG. 5 described later, since the microphone assigned to the speaker who is speaking always exists, the observation signal of the microphone is used as a temporary extraction result. Is good. If the microphone number is k, then y_1 (f, t) = normalize (x_k (f, t)).

As the temporary value in c) above, in addition to a simple method such as using a vector in which all elements have the same value, the value of the extraction filter estimated in the previous target sound section is saved. It is also possible to use it as the initial value of w_1 (f) when calculating the next target sound section. For example, when sound source extraction is performed for the utterance (3-2) shown in FIG. 3, the extraction filter estimated for the previous utterance (3-1) of the same speaker is tentatively used for w_1 (f) in this extraction. Use as a value. Alternatively, as another method of the above c), w_1 (f) may be obtained by using the update formula derived from the TFVV Gaussian distribution only for the first time.

The bivariate Laplace distribution represented by Eq. (25) can also be solved in the same way using an auxiliary function. By substituting the equation (25) into the equation (23), the following equation (44) is obtained.

Here, an auxiliary function such as the following equation (45) is prepared.

Then, the step (corresponding to Eq. (40)) for finding the auxiliary variable b (f, t) can be expressed as Eq. (46).

The step of obtaining the extraction filter w_1 (f) (corresponding to the equation (41)) can be expressed as the following equation (47).

This minimization problem can be solved by the eigenvalue decomposition of Eq. (48) below.

Next, the case of the TFVV Student-t distribution represented by the equation (33) will be described. Since an example of applying the auxiliary function method to the TFVV Student-t distribution is described in Reference 1, only the update formula is described.

The step for finding the auxiliary variable b (f, t) is as shown in Eq. (49) below.

The degree of freedom ν functions as a parameter for adjusting the degree of influence of r (f, t), which is a reference signal, and y_1 (f, t), which is an extraction result during repetition. When ν = 0, the reference signal is ignored, and when it is 0 or more and less than 2, the influence of the extraction result is larger than that of the reference signal. When ν is greater than 2, the effect of the reference signal is greater, and at the limit ν → ∞, the extraction result is ignored, which is equivalent to the TFVV Gaussian distribution.

The step for obtaining the extraction filter w_1 (f) is as shown in the following equation (50).

Since the formula (50) is the same as the formula (47) in the case of the bivariate Laplace distribution, the extraction filter can be similarly obtained by the formula (48).

Next, a method of deriving the update equation from the equations (27) to (30), which are sound source models based on divergence, will be described. Substituting these pdfs into equation (23) gives an equation that minimizes the sum of divergence in the fth frequency bin, but no suitable auxiliary function has been found for each divergence. Therefore, another optimization algorithm, the fixed point method, is applied.

The fixed point algorithm expresses the condition that is satisfied when the parameter to be optimized (w_1 (f), which is the extraction filter in this disclosure) converges, and transforms the equation to “w_1 (f) = J”. The update equation is derived by using the fixed point form (w_1 (f))'';. In this disclosure, the equation that the partial differentiation by the parameter is zero is used as the condition that holds at the time of convergence, and the following equation is used. A concrete equation is derived by performing the partial differentiation shown in (51).

The left side of equation (51) is the partial derivative by conj (w_1 (f)). Then, the equation (51) is modified to obtain the form of the equation (52).

In the fixed point algorithm, the following equation (53), in which the equal sign of equation (52) is replaced with a substitution, is repeatedly executed. However, since it is necessary to satisfy the constraint of equation (11) for w_1 (f) in the present disclosure, norm normalization by equation (54) is also performed after equation (53).

In the following, the update formulas corresponding to the formulas (27) to (30) will be described. In each case, only the formula corresponding to the formula (53) is described, but in the actual extraction process, the norm normalization of the formula (54) is also performed after the substitution is performed.

The update formula derived from the formula (27), which is a pdf corresponding to the Euclidean distance, is as shown in the following formula (55).

Equation (55) is described in two stages, but the upper row is assumed to be used after calculating y_1 (f, t) using equation (9), and the lower row is assumed to be used after calculating y_1 (f, t). It is assumed that w_1 (f) and u (f, t) are used directly without calculating t). The same applies to the equations (56) to (60) described later.

Since the extraction filter w_1 (f) and the extraction result y_1 (f, t) are unknown only at the first iteration, w_1 (f) is calculated by either of the following methods.
a) Calculate a tentative value as the separation result y_1 (f, t), and then calculate w_1 (f) from the upper equation of equation (55).
b) Substitute a temporary value for w_1 (f), and calculate w_1 (f) from it using the lower equation of equation (55).
For the temporary value of y_1 (f, t) in a) above, the method b) in the explanation of equation (40) can be used. Similarly, for the tentative value of w_1 (f) in b), the method of c) in equation (40) can be used.

The update formulas derived from the formula (28), which is a pdf corresponding to Itakura Saito divergence (power spectrogram version), are the following formulas (56) and (57).

Equation (57) is as follows.

Since it is possible to transform the equation 52 into two forms, there are also two update equations.
The second item on the right side of the lower part of equation (56) and the third term on the right side of the lower part of equation (57) are both composed of only u (f, t) and r (f, t), and during the iterative process. It is constant. Therefore, these terms need to be calculated only once before the iteration, and the inverse matrix of Eq. (57) needs to be calculated only once.

The update equations derived from the equation (29), which is a pdf corresponding to Itakura Saito divergence (amplitude spectrogram version), are the following equations (58) and (59). There are two possibilities here as well.

Equation (59) is as follows.

The update formula derived from the formula (30) is as shown in the following formula (60). Again, the last term on the right-hand side needs to be calculated only once before the iteration.

The contents of the processing described above are applied to the embodiment of the present disclosure described below.

<One Embodiment>
[Configuration example of sound source extraction device]
FIG. 4 is a diagram showing a configuration example of a sound source extraction device (sound source extraction device 100) which is an example of the signal processing device according to the present embodiment. The sound source extraction device 100 includes, for example, a plurality of microphones 11, an AD (Analog to Digital) transform unit 12, a SFTT (Short-Time Fourier Transform) unit 13, an observation signal buffer 14, an interval estimation unit 15, and a reference signal generation unit 16. It has a sound source extraction unit 17 and a control unit 18. The sound source extraction device 100 includes a post-stage processing unit 19 and a section / reference signal estimation sensor 20 as needed.

The plurality of microphones 11 are installed at different positions. There are several variations in the installation form of the microphone as described later. A mixed sound signal in which a target sound and a sound other than the target sound are mixed is input by the microphone 11.

The AD conversion unit 12 converts the multi-channel signal acquired by each microphone 11 into a digital signal for each channel. This signal is appropriately referred to as an observation signal (in the time domain).

The STFT unit 13 converts the observed signal into a signal in the time frequency domain by applying a short-time Fourier transform to the observed signal. The observation signal in the time frequency domain is sent to the observation signal buffer 14 and the interval estimation unit 15.

The observation signal buffer 14 stores observation signals for a predetermined time (number of frames). The observation signal is stored for each frame, and when a request for which time range of the observation signal is required is received from another module, the observation signal corresponding to that time range is returned. The signal accumulated here is used in the reference signal generation unit 16 and the sound source extraction unit 17.

The section estimation unit 15 detects a section in which the target sound is included in the mixed sound signal. Specifically, the section estimation unit 15 detects the start time (time when the sound starts to sound), the end time (time when the sound ends), and the like of the target sound. The technique used to estimate this section depends on the usage scene of this embodiment and the installation mode of the microphone, and will be described in detail later.

The reference signal generation unit 16 generates a reference signal corresponding to the target sound based on the mixed sound signal. For example, the reference signal generation unit 16 estimates a rough amplitude spectrogram of the target sound. Since the processing performed by the reference signal generation unit 16 depends on the usage scene of this embodiment and the installation mode of the microphone, the details will be described later.

The sound source extraction unit 17 extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized. Specifically, the sound source extraction unit 17 estimates the estimation result of the target sound by using the observation signal and the reference signal corresponding to the section in which the target sound is sounding. Alternatively, an extraction filter is estimated to generate such an estimation result from the observation signal.

The output of the sound source extraction unit 17 is sent to the post-stage processing unit 19 as needed. Examples of the post-stage processing performed by the post-stage processing unit 19 include voice recognition and the like. When combined with voice recognition, the sound source extraction unit 17 outputs a time domain extraction result, that is, a voice waveform, and the voice recognition unit performs recognition processing on the voice waveform.

Although some voice recognition has a voice section detection function, the voice recognition side voice section detection function can be omitted because the present embodiment includes a section estimation unit 15 equivalent thereto. Further, the voice recognition often includes an SFT for extracting the voice feature amount required for the recognition process from the waveform, but when combined with the present embodiment, the SFTT on the voice recognition side may be omitted. When STFT on the voice recognition side is omitted, the sound source extraction unit 17 outputs the extraction result of the time frequency region, that is, the spectrogram. On the voice recognition side, the spectrogram is converted into a voice feature.

The control unit 18 comprehensively controls each unit of the sound source extraction device 100. The control unit 18 controls, for example, the operation of each of the above-mentioned units. Although omitted in FIG. 4, the control unit 18 and the above-mentioned functional blocks are connected to each other.

The section / reference signal estimation sensor 20 is a sensor different from the microphone of the microphone 11 which is supposed to be used for section estimation or reference signal generation. In FIG. 4, the post-stage processing unit 19 and the section / reference signal estimation sensor 20 are shown in parentheses because the post-stage processing unit 19 and the section / reference signal estimation sensor 20 can be omitted in the sound source extraction device 100. It shows that there is. That is, if the accuracy of section estimation or reference signal generation can be improved by providing a dedicated sensor different from the microphone 11, such a sensor may be used.

For example, when a method using a lip image described in Japanese Patent Application Laid-Open No. 10-51889 is used as a method for detecting an utterance section, an image sensor (camera) can be applied as a sensor. Alternatively, the following sensors used as auxiliary sensors in Japanese Patent Application No. 2019-073542 proposed by the present inventor may be provided, and section estimation or reference signal generation may be performed using the signals acquired thereby.
-A type of microphone that is used in close contact with the body, such as a bone conduction microphone and a pharyngeal microphone.
-A sensor that can observe the vibration of the skin surface near the speaker's mouth and throat. For example, a combination of a laser pointer and an optical sensor.

[About interval estimation and reference signal generation]
There are several possible variations in the usage scene of this embodiment and the installation mode of the microphone 11, and in each case, what kind of technology can be applied for estimating the section and generating the reference signal is different. In order to explain each variation, it is necessary to clarify whether or not there may be overlap between the sections of the target sound, and how to deal with the case where there is overlap. In the following, there are three typical usage situations and installation modes, which will be described with reference to FIGS. 5 to 7, respectively.

FIG. 5 is a diagram assuming a situation in which there are N (two or more) speakers in a certain environment and a microphone is assigned to each speaker. Assigning a microphone means that each speaker is wearing a pin microphone, a headset microphone, or the like, or the microphone is installed at a close distance to each speaker. Let N speakers be S1, S2 ... Sn, and microphones assigned to each speaker be M1, M2 ... Mn. Further, there are 0 or more interfering sound sources Ns.

In such a situation, for example, a conference is held in a room, and in order to automatically create the minutes of the conference, voice recognition is performed for the voice picked up by each speaker's microphone. Applicable to various situations. In this case, the utterances may overlap with each other, and when the utterances overlap, a signal in which the voices are mixed is observed in each microphone. Further, as a disturbing sound source, there may be a sound of a fan of a projector or an air conditioner, a reproduced sound emitted from a device equipped with a speaker, and the like, and these sounds are also included in the observation signal of each microphone. Both of them cause erroneous recognition, but if the sound source extraction technology of this embodiment is used, only the voice of the speaker corresponding to each microphone is left, and other sound sources (other speakers and disturbing sound sources) are removed. Since it can be suppressed (suppressed), the speech recognition accuracy can be improved.

The section detection method and reference signal generation method that can be used in such a situation will be described below. In the following, among the sounds observed by each microphone, the voice of the corresponding (target) speaker will be referred to as the main voice or the main utterance, and the voice of another speaker will be appropriately referred to as the wraparound voice or crosstalk.

As the section detection method, the main utterance detection described in Japanese Patent Application No. 2019-227192 can be used. In the present application, by performing learning using a neural network, a detector that responds to the main voice while ignoring crosstalk is realized. Further, since the utterances are duplicated, even if the utterances are duplicated, the section of each utterance and the speaker can be estimated as shown in FIG.

At least two methods are possible for the reference signal generation method. One is to generate directly from the signal observed by the microphone assigned to the speaker. For example, the signal observed by the microphone M1 in FIG. 5 is a mixture of all sound sources, but the sound of the speaker S1, which is the nearest sound source, is picked up greatly, while the other sound sources are quieter than that. The sound is picked up by. Therefore, if an amplitude spectrogram is generated by cutting out the observation signal of the microphone M1 according to the speech section of the speaker S1, applying a short-time Fourier transform to it, and then taking an absolute value, it is a rough amplitude spectrogram of the target sound. , Can be used as a reference signal in this embodiment.

Another method is to use the crosstalk reduction technique described in Japanese Patent Application No. 2019-227192 described above. In the above application, by learning the neural network, it is realized that the crosstalk is removed (reduced) from the signal in which the main voice and the crosstalk are mixed and the main voice is left. The output of this neural network is an amplitude spectrogram or a time-frequency mask of the crosstalk reduction result, and the former can be used as a reference signal as it is. Even in the latter case, by applying a time-frequency mask to the amplitude spectrogram of the observed signal, it is possible to generate an amplitude spectrogram of the crosstalk removal result, and thus it can be used as a reference signal.

Next, reference signal generation processing and the like in a usage scene different from that of FIG. 5 will be described with reference to FIG. FIG. 6 assumes an environment in which there are one or more speakers and one or more interfering sound sources. In FIG. 5, the focus was on the overlap of utterances rather than the existence of the disturbing sound source Ns, but in the example shown in FIG. 6, the focus is on obtaining a clean voice in a noisy environment in which a loud disturbing sound is present. However, when there are two or more speakers, duplication of utterances is also an issue.

There are n speakers, and each speaker is speaker S1 to speaker Sn. n is 1 or more. In FIG. 6, only one disturbing sound source Ns is shown, but the number is arbitrary.

There are two types of sensors used. One is a sensor (sensor corresponding to the section / reference signal estimation sensor 20) worn by each speaker or installed in the immediate vicinity of each speaker, and the following are sensor SEs (sensors SE1 and SE2). .. SEn) as appropriate. The other is a microphone array 11A composed of a plurality of microphones 11 having a fixed position.

The section / reference signal estimation sensor 20 may use the same type as the microphone shown in FIG. 5 (a type microphone called an air conduction microphone that collects sound propagating in the atmosphere). As described in FIG. 4, using a microphone of the type that is used in close contact with the body, such as a bone conduction microphone or a pharyngeal microphone, or a sensor that can observe the vibration of the skin surface near the speaker's mouth and throat. Is also good. In any case, since the sensor SE is closer to or in close contact with each speaker than the microphone array, the utterances of the speakers corresponding to each sensor can be recorded at a high SN ratio.

As the microphone array 11A, in addition to the form in which a plurality of microphones are installed in one device, a form in which microphones are installed in a plurality of places in a space called distributed microphones is also possible. As an example of the distributed microphone, a form in which the microphone is installed on the wall surface or the ceiling surface of the room, a form in which the microphone is installed on the seat, the wall surface, the ceiling, the dashboard, etc. in the automobile can be considered.

In this example, the signals acquired by the sensors SE1 to SEn corresponding to the interval / reference signal estimation sensor 20 are used for interval estimation and reference signal generation, and the multi-channel acquired from the microphone array 11A is used for sound source extraction. Use observation signals. As for the interval estimation method and the reference signal generation method when the air conduction microphone is used as the sensor SE, the same method as the method described with reference to FIG. 5 can be used.

On the other hand, when a close-contact microphone is used, in addition to the same method as that shown in FIG. Is. For example, as the interval estimation, a method of discriminating by the threshold value of the power of the input signal can be used, and as the reference signal, the amplitude spectrogram generated from the input signal can be used as it is. The sound recorded by the close-fitting microphone is not always used as an input for voice recognition, etc., because the high frequencies are attenuated and sounds generated in the body such as swallowing sounds may also be recorded. Although it is not appropriate, it can be effectively used for interval estimation and reference signal generation.

When a sensor other than a microphone such as an optical sensor is used as the sensor SE, the method described in Application No. 2019-227192 can be used. In the patent application, a clean target sound is supported from the sound acquired by the air conduction microphone (mixture of the target sound and the disturbing sound) and the signal acquired by the auxiliary sensor (some signal corresponding to the target sound). The relationship is learned in advance by the neural network, and at the time of inference, the signal acquired by the air conduction microphone and the auxiliary sensor is input to the neural network to generate a near-clean target sound. Since the output of the neural network is an amplitude spectrogram (or time-frequency mask), it can be used as a reference signal (or a reference signal is generated) in this embodiment. Further, as a modified example, since a method of generating a clean target sound and at the same time estimating a section in which the target sound is sounding is also mentioned, it can be used as a section detecting means.

Sound source extraction is basically performed using the observation signal acquired by the microphone array 11A. However, when an air conduction microphone is used as the sensor SE, it is possible to add the observation signal acquired by the air conduction microphone. That is, assuming that the microphone array 11A is composed of N microphones, sound source extraction may be performed using observation signals of (N + m) channels combined with m sections / reference signal estimation sensors. Further, in that case, since there are a plurality of air conduction microphones even when N = 1, a single microphone may be used instead of the microphone array 11A.

Similarly, in section estimation and reference signal generation, a signal derived from a microphone array may be used in addition to the sensor SE. Since the microphone array 11A is far from any speaker, the speaker's utterance is always observed as crosstalk. By comparing the signal with the signal of the interval / reference signal estimation microphone, it can be expected to improve the accuracy of interval estimation, especially the accuracy of interval estimation when utterances overlap.

FIG. 7 shows a microphone installation form different from that of FIG. It is the same as FIG. 6 in that it assumes an environment with one or more speakers and one or more interfering sound sources, but the microphone used is only the microphone array 11A, and it is installed in the immediate vicinity of each speaker. There is no sensor that has been used. Similar to FIG. 6, the form of the microphone array 11A can be applied to a plurality of microphones installed in one device, a plurality of microphones (distributed microphones) installed in a space, and the like.

In such a situation, how to estimate the utterance section and the reference signal, which is a premise in the sound source extraction of the present disclosure, becomes an issue, but it depends on whether the frequency of mixing of voices is low or high. , Applicable technology is different. Each will be described below.

The case where the mixing of voices is low is the case where there is only one speaker (that is, only the speaker S1) in a certain environment, and the disturbing sound source Ns can be regarded as non-voice. In that case, as the section estimation method, a voice section detection technique focusing on "voice-likeness" described in Japanese Patent No. 4182444 or the like can be applied. That is, in the environment of FIG. 7, when it is considered that the "voice-like" signal is only the utterance of the speaker S1, the non-voice signal is ignored and the purpose is the location (timing) in which the voice-like signal is included. Detected as a section of sound.

As a reference signal generation method, a method called denoise as described in Document 3, that is, a process of inputting a signal in which voice and non-sound are mixed, removing non-sound, and leaving sound. Is applicable. A wide variety of methods can be applied to denoising. For example, the following method uses a neural network, and its output is an amplitude spectrogram, so that the output can be used as a reference signal as it is.
"Reference 3
・ Liu, D. & Smaragdis, P. & Kim, M.,
"Experiments on deep learning for speech denoising,"
Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH2014. P.2685-2689. "

On the other hand, when the frequency of utterances in which voices are mixed is high, there are cases where multiple speakers are talking in a certain environment and the utterances overlap each other, or even if there is only one speaker, the disturbing sound source is voice. For example. As an example of the latter, there is a case where sound is output from a speaker such as a television or a radio. In such a case, it is necessary to use a method applicable to mixing voices as the utterance section detection. For example, the following techniques can be applied.
a) Speech section detection using sound source direction estimation (for example, the method described in JP-A-2010-121975 and JP-A-2012-150237).
b) Speech section detection using a face image (lip image) (for example, the method described in JP-A-10-51889 and JP-A-2011-191423).

Since the microphone array exists in the microphone installation form shown in FIG. 7, the sound source direction estimation which is the premise of a) can be applied. Further, if an image sensor (camera) is used as the section / reference signal estimation sensor 20 in the example shown in FIG. 4, b) can also be applied. In either method, the direction of the utterance can be known when the utterance section is detected (in the method (b) above, the utterance direction can be calculated from the position of the lips in the image), so refer to that value as a reference signal. Can be used for generation. Hereinafter, the sound source direction estimated in the utterance section estimation is appropriately referred to as θ.

The reference signal generation method also needs to support mixing of voices, and the following can be applied as such a technique.
a) Time frequency masking using the sound source direction (This is a reference signal generation method used in Japanese Patent Application Laid-Open No. 2014-219467. A steering vector corresponding to the sound source direction θ is calculated, and the observation signal vector (described above (described above). When the cosine similarity is calculated with Eq. (2)), it becomes a mask that leaves the sound arriving from the direction θ and attenuates the sound arriving from other directions. The mask is applied to the amplitude spectrogram of the observed signal. , The signal thus generated is used as a reference signal.
b) Neural network-based selective listening technology such as Speaker Beam and Voice Filter The selective listening technology here is a technology that extracts the voice of a specified person from a monaural signal in which multiple voices are confused. .. For the speaker you want to extract, if you pre-record a clean voice that is not mixed with other speakers (the utterance content may be different from the mixed voice), and input both the mixed signal and the clean voice into the neural network. , The voice of the designated speaker included in the mixed signal is output. Correctly, a time-frequency mask is output to generate such a spectrogram. When the mask thus output is applied to the amplitude spectrogram of the observed signal, it can be used as the reference signal of the present embodiment.
The details of Speaker Beam and Voice Filter are described in Documents 4 and 5 below, respectively.
"Reference 4:
・ M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa and T. Nakatani,
"Single channel target speaker extraction and recognition with speaker beam,"
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. "
Reference 5:
・ Author: Quan Wang, Hannah Muckenhire, Kevin Wilson, Prashant Sridhar, Zelin Wu,
John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno
"VOICEFILTER: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking," arXiv: 1810.4826v3 [eess.AS] 27 Oct 2018
https://arxiv.org/abs/1810.04826 "

(Details of sound source extraction unit)
Next, the details of the sound source extraction unit 17 will be described with reference to FIG. The sound source extraction unit 17 includes, for example, a pre-processing unit 17A, an extraction filter estimation unit 17B, and a post-processing unit 17C.

The preprocessing unit 17A performs uncorrelated processing shown in equations (3) to (7), that is, uncorrelated processing on the time frequency domain observation signal.

The extraction filter estimation unit 17B estimates a filter that extracts a signal in which the target sound is emphasized. Specifically, the extraction filter estimation unit 17B estimates the extraction filter for sound source extraction and generates the extraction result. More specifically, the extraction filter estimation unit 17B is an objective function that reflects the dependency between the reference signal and the extraction result by the extraction filter, and the independence between the output result and the separation result of another virtual sound source. Estimate the extraction filter as a solution to optimize.

As described above, the extraction filter estimation unit 17B serves as a sound source model that represents the dependence between the reference signal and the extraction result included in the objective function.
-A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that regards the reference signal as a value corresponding to the variance for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal To use. Further, the bivariate Laplace distribution may be used as the bivariate spherical distribution. Further, as the time-frequency variable variance model, any one of the time-frequency variable variance Gaussian distribution, the time-frequency variable variance Laplace distribution, and the time-frequency variable variance Student-t distribution may be used. In addition, as the divergence of the model using divergence, the Euclidean distance or square error between the absolute value of the extraction result and the reference signal, the Itakura Saito distance between the power spectrum of the extraction result and the power spectrum of the absolute value, and the amplitude spectrum of the extraction result. Either the Itakura Saito distance to the absolute amplitude spectrum, the ratio of the absolute value of the extraction result to the reference signal, or the square error between 1 may be used.

The post-processing unit 17C performs at least the processing of applying the extraction filter to the mixed sound signal. In addition to the rescaling process described later, the post-processing unit 17C may perform a process of applying an inverse Fourier transform to the extraction result spectrogram to generate an extraction result waveform.

[Flow of processing performed by the sound source extractor]
(Overall flow)
Next, the flow of processing (overall flow) performed by the sound source extraction device 100 will be described with reference to the flowchart shown in FIG. The processing described below is performed by the control unit 18 unless otherwise specified.

In step ST11, the AD conversion unit 12 converts the analog observation signal (mixed sound signal) input to the microphone 11 into a digital signal. The observed signal at this point is in the time domain. Then, the process proceeds to step ST12.

In step ST12, the STFT unit 13 applies a short-time Fourier transform (STFT) to the observation signal in the time domain to obtain the observation signal in the time frequency domain. Input may be performed not only from the microphone but also from a file or network as needed. Details of the specific processing performed by the FTFT unit 13 will be described later. In the present embodiment, since there are a plurality of input channels (for the number of microphones), AD conversion and RTM are also performed for the number of channels. Then, the process proceeds to step ST13.

In step ST13, a process (buffering) is performed in which the observation signal converted into the time frequency domain by the FTFT is accumulated for a predetermined time (a predetermined number of frames). Then, the process proceeds to step ST14.

In step ST14, the section estimation unit 15 estimates the start time (time when the sound starts to sound) and the end time (time when the sound ends) of the target sound. Furthermore, when used in an environment where utterances may overlap with each other, information that can identify which speaker is the utterance is also estimated. For example, in the usage pattern shown in FIGS. 5 and 6, the microphone (sensor) number assigned to each speaker is also estimated, and in the usage pattern shown in FIG. 7, the direction of utterance is also estimated.

Sound source extraction and associated processing are performed for each section of the target sound. Therefore, the process proceeds to step ST16 only when the section is detected, and if not detected, steps ST16 to ST19 are skipped and the process proceeds to step ST20.

When a section is detected, in step ST16, the reference signal generation unit 16 generates a rough amplitude spectrogram of the target sound sounding in that section. The methods that can be used to generate the reference signal are as described with reference to FIGS. 5 to 7. Then, the process proceeds to step ST17.

In step ST17, the sound source extraction unit 17 generates the extraction result of the target sound by using the reference signal obtained in step ST16 and the observation signal corresponding to the time range of the target sound section. The details of the process will be described later.

In step ST18, it is determined whether or not the processing related to step ST16 and step ST17 is repeated a predetermined number of times. The meaning of this iteration is that if the sound source extraction process generates an extraction result with higher accuracy than the observed signal or reference signal, then the reference signal is regenerated from the extraction result, and the sound source extraction process is executed again using it. This means that the extraction result can be obtained with higher accuracy than the previous time.

For example, when an observation signal is input to a neural network to generate a reference signal, if the first extraction result is input to the neural network instead of the observation signal, the output will be more accurate than the output of the first neural network. Is likely to be. Therefore, when the second extraction result is generated by using it as a reference signal, it is likely that the extraction result is more accurate than the first extraction result, and it is possible to obtain a more accurate extraction result by repeating the process. Unlike Document 1, the present embodiment is characterized in that the extraction process is repeated instead of the separation process. It should be noted that this iteration is different from the iteration used when estimating the filter by the auxiliary function method or the fixed point method inside the sound source extraction process according to step ST17. After the process according to step ST18, the process proceeds to step ST19.

In step ST19, the post-processing is performed by the post-processing unit 17C using the extraction result generated in step ST17. As an example of the post-stage processing, voice recognition and response generation for voice dialogue using the recognition result can be considered. Then, the process proceeds to step ST20.

In step ST20, it is determined whether or not to continue the process. If it continues, the process returns to step ST11, and if it does not continue, the process ends.

(About FTFT)
Next, the short-time Fourier transform performed by the STFT unit 13 will be described with reference to FIG. In the present embodiment, since the microphone observation signal is a multi-channel signal observed by a plurality of signals, the SFTT is performed for each channel. The following is a description of the STFT in the kth channel.

A certain length is cut out from the waveform of the microphone recording signal obtained by the AD conversion process according to step ST11, and a window function such as a humming window or a humming window is applied to them (see FIG. 10A). This cut out unit is called a frame. By applying the short-time Fourier transform to the data for one frame (see FIG. 10B), x_k (1, t) to x_k (F, t) are obtained as observation signals in the time frequency domain. However, t represents the frame number and F represents the total number of frequency bins (see FIG. 10C).

There may be overlap between the frames to be cut out, so that the change of the signal in the time frequency domain becomes smooth between consecutive frames. In FIG. 10, data for one frame, x_k (1, t) to x_k (F, t), is collectively described as one vector x_k (t) (see FIG. 10C difference). x_k (t) is called a spectrum, and a data structure in which multiple spectra are arranged in the time direction is called a spectrogram.

In FIG. 10C, the horizontal axis represents the frame number and the vertical axis represents the frequency bin number, and three

spectra

51A, 52A, and 53A are generated from each of the cut out observation signals 51, 52, and 53, respectively.

(Sound source extraction process)
Next, the sound source extraction process according to the present embodiment will be described with reference to the flowchart shown in FIG.

In step ST31, preprocessing is performed by the preprocessing unit 17A. As an example of pretreatment, there is uncorrelatedness represented by equations (3) to (6). In addition, some update formulas used in filter estimation perform special processing only for the first time, but such processing is also performed as preprocessing. Then, the process proceeds to step ST32.

In step ST32, a process of estimating the extraction filter is performed. Then, the process proceeds to step ST33. Steps ST32 and ST33 represent iterations for estimating the extraction filter. Except when the equation (32) TFVV Gaussian distribution is used as the sound source model, the extraction filter cannot be obtained in the closed form. Therefore, until the extraction filter and the extraction result converge, or a predetermined number of times, the step ST32 is performed. This process is repeated.

The extraction filter estimation process according to step ST32 is a process for obtaining the extraction filter w_1 (f), and the specific formula differs for each sound source model.

For example, when the TFVV Gaussian distribution of equation (32) is used as the sound source model, the reference signal r (f, t) and the uncorrelated observation signal u (f, t) are used on the right side of equation (35). Compute a weighted covariance matrix and then use eigenvalue decomposition to find the eigenvectors. Then, when the Hermitian transpose is applied to the eigenvector corresponding to the smallest eigenvalue as in Eq. (36), the extraction filter w_1 (f) obtained by the transpose is obtained. This process is performed for all frequency bins, that is, f = 1 to F.

Similarly, when the TFVV Laplace distribution of equation (31) is used as the sound source model, first, the reference signal r (f, t) and the uncorrelated observation signal u (f, t) are used according to equation (40). To calculate the auxiliary variable b (f, t). Next, the weighted covariance matrix on the right side of equation (42) is calculated, and the eigenvalue decomposition is applied to it to obtain the eigenvector. Finally, the extraction filter w_1 (f) is obtained by the equation (36). Since the extraction filter of w_1 (f) at this point has not yet converged, the process returns to equation (40) and the auxiliary variable is calculated again. These processes are executed until w_1 (f) converges or a predetermined number of times.

Similarly, when the bivariate Laplace distribution of Eq. (25) is used as the sound source model, the calculation of the auxiliary variables b (f, t) (Equation (46)) and the calculation of the extraction filter (Equation (48) and Eq. (36)) )) And are performed alternately.

On the other hand, when a model based on divergence represented by the equation (26) is used as the sound source model, the update equations (equations (55) to (60)) corresponding to each model and the norm are normalized to 1. The equation (formula (54)) is alternately performed.

The process proceeds to step ST34 until the extraction filter converges or the repetition is performed a predetermined number of times.

In step ST34, post-processing is performed by the post-processing unit 17C. In the post-processing, the extraction result is rescaled. Further, the inverse Fourier transform is performed as necessary to generate a waveform in the time domain. Rescaling is a process of adjusting the scale of each frequency bin of the extraction result. In the extraction filter estimation, the norm of the filter is restricted to 1 in order to apply an efficient algorithm, but the extraction result generated by applying the extraction filter with this constraint is an ideal purpose. The scale is different from the sound. Therefore, the scale of the extraction result is adjusted using the observation signal before uncorrelatedness.

The rescaling process is as follows.
First, with k = 1 in equation (9), y_1 (f, t), which is the extraction result before rescaling, is calculated from the converged extraction filter w_1 (f). The rescaling coefficient γ (f) can be obtained as a value that minimizes the following equation (61), and the specific equation is as shown in equation (62).

X_i (f, t) in this equation is the observed signal (before uncorrelated) that is the target of rescaling. How to select x_i (f, t) will be described later. The coefficient γ (f) thus obtained is multiplied by the extraction result as shown in the following equation (63). The extraction result y_1 (f, t) after rescaling corresponds to the component derived from the target sound in the observation signal of the i-th microphone. That is, it is almost equal to the signal observed by the i-th microphone when there is no sound source other than the target sound.

Further, if necessary, the waveform of the extraction result is obtained by applying the inverse Fourier transform to the rescaled extraction result. As described above, the inverse Fourier transform can be omitted depending on the post-stage processing.

Here, we will explain how to select the observation signal x_i (f, t) that is the target of rescaling. This depends on how the microphone is installed. Depending on the microphone installation form, there are microphones that strongly collect the target sound. For example, in the installation mode shown in FIG. 5, since a microphone is assigned to each speaker, the speaker i's utterance is picked up most strongly by the microphone i. Therefore, the observation signal x_i (f, t) of the microphone i can be used as the target of rescaling.

The same method can be applied to the case where an air-conducting microphone such as a pin microphone is used as the sensor SE in the installation mode shown in FIG. On the other hand, when a close-fitting microphone such as a bone conduction microphone is used as the sensor SE, or when a sensor other than the microphone such as an optical sensor is used, the signal picked up by those microphones is a target for rescaling. Since it is inappropriate, the same method as in FIG. 7 described below is used.

In the installation mode shown in FIG. 7, since there is no microphone assigned to each speaker, the rescaling target needs to be found by another method. Hereinafter, a case where the microphones constituting the microphone array are fixed to one device and a case where the microphones are installed in the space (distributed microphones) will be described.

When the microphones are fixed to one device, the SN ratio (power ratio of the target sound and other signals) of each microphone is considered to be almost the same. Therefore, the observation signal of any microphone may be selected as the target of rescaling, x_i (f, t).

Alternatively, rescaling using delay and sum, which is used in the technique described in Japanese Patent Application Laid-Open No. 2014-219467, can also be applied. As described with reference to FIG. 7, when the method corresponding to the duplication of utterances is used in the section detection process, the utterance direction θ is estimated at the same time in addition to the utterance section. By using the signal observed by the microphone array and the utterance direction θ, it is possible to generate a signal in which the sound coming from that direction is emphasized to some extent by the delay sum. If the result of performing the sum of delays with respect to the direction θ is written as z (f, t, θ), the rescaling coefficient is calculated by the following equation (64).

If the microphone array is a distributed microphone, use another method. With distributed microphones, the signal-to-noise ratio of the observed signal differs from microphone to microphone, and it is expected that the signal-to-noise ratio will be high for microphones close to the speaker and low for microphones far away. Therefore, it is desirable to select a microphone that is close to the speaker as the observation signal that is the target of rescaling. Therefore, the observation signal of each microphone is rescaled, and the one that maximizes the power of the rescaling result is adopted.

The magnitude of the power of the rescaling result is determined only by the magnitude of the absolute value of the rescaling coefficient. Therefore, the rescaling coefficient is calculated for each microphone number i by the following equation (65), and the one with the maximum absolute value is set as γ_ {max}, and the rescaling is performed by the following equation (66).

When determining γ_ {max}, it is also known which microphone picks up the speaker's utterance most. If the position of each microphone is known, it is possible to know about where the speaker is located in the space, and that information can be used in the subsequent processing.

For example, when the subsequent processing is a voice dialogue, that is, when the technique of the present disclosure is used in the voice dialogue system, the voice of the response from the dialogue system may be output from the speaker presumed to be the closest to the speaker. Alternatively, it is possible to change the response of the system according to the position of the speaker.

[Effects obtained in this embodiment]
According to this embodiment, for example, the following effects can be obtained.
In the sound source extraction with the reference signal of the present embodiment, the multi-channel observation signal of the section where the target sound is sounding and the rough amplitude spectrogram of the target sound in the section are input, and the rough amplitude spectrogram is used as the reference signal. By doing so, the extraction result with higher accuracy than the reference signal, that is, closer to the true target sound is estimated.

In the processing, an objective function that reflects both the dependency between the reference signal and the extraction result and the independence between the extraction result and other virtual separation results is prepared, and the extraction filter is used as a solution to optimize it. Ask for. By using the deflation method used in blind sound source separation, the output signal can be limited to one sound source corresponding to the reference signal.

Due to such features, there are the following advantages as compared with the prior art.
(1) Compared with blind sound source separation Compared with the method of applying blind sound source separation to the observed signal to generate multiple separation results and selecting one sound source that is most similar to the reference signal from among them, the following There are advantages of.
-There is no need to generate multiple separation results.
-In principle, in blind sound source separation, the reference signal is used only for selection and does not contribute to the improvement of separation accuracy, but in the sound source extraction of the present disclosure, the reference signal also contributes to the improvement of extraction accuracy.
(2) Compared with the conventional adaptive beam former, extraction can be performed even if there is no observation signal outside the section. That is, the extraction can be performed without separately preparing the observation signal acquired at the timing when only the disturbing sound is sounding.
(3) Compared with the reference signal-based sound source extraction (for example, the technique described in JP-A-2014-219467 etc.)-The reference signal in the technique described in JP-A-2014-219467 etc. is a time envelope and is a target sound. It was assumed that the change in the time direction was common to all frequency bins. In contrast, the reference signal of this embodiment is an amplitude spectrogram. Therefore, improvement in extraction accuracy can be expected when the change in the time direction of the target sound differs greatly for each frequency bin.
-Since the reference signal in the technique described in the above document was used only as the initial value of the iteration, there was a possibility that a sound source different from the reference signal was extracted as a result of the iteration. On the other hand, in the present embodiment, since the reference signal is used throughout the iteration as part of the sound source model, it is unlikely that a sound source different from the reference signal will be extracted.
(4) Compared with independent deep learning matrix analysis (IDLMA) -IDLMA cannot be applied when there is an unknown sound source because it is necessary to prepare a different reference signal for each sound source. Moreover, it could be applied only when the number of microphones and the number of sound sources match. On the other hand, in the present embodiment, it is applicable if the reference signal of one sound source to be extracted can be prepared.

<Modification example>
Although one embodiment of the present disclosure has been specifically described above, the content of the present disclosure is not limited to the above-described embodiment, and various modifications based on the technical idea of the present disclosure are possible. In the description of the modified example, the same or homogeneous configurations in the above description are designated by the same reference numerals, and duplicated descriptions are appropriately omitted.

(Integration of uncorrelatedness and filter estimation processing)
For the update formulas of the extraction filter that use the eigenvalue decomposition, the uncorrelatedness and the filter estimation can be combined into one formula by using the generalized eigenvalue decomposition. In that case, the process corresponding to uncorrelated can be skipped.

In the following, the process of deriving an equation that integrates the two will be described using the TFVV Gaussian distribution of equation (32) as an example.

The formula in which k = 1 is set in the formula (9) is rewritten as the following formula (67).

q_1 (f) is a filter that directly generates the extraction result from the observation signal before uncorrelated observation (without going through the uncorrelated observation signal). When Eq. (34), which represents the optimization problem corresponding to the TFVV Gaussian distribution, is transformed using Eqs. (67) and Eqs. (3) to (6), the optimization problem for q_1 (f) is obtained. A formula (68) is obtained.

This equation is a constrained minimization problem different from equation (34), but it can be solved by using Lagrange's undetermined multiplier method. If the Lagrange undetermined multiplier is λ and the equations to be optimized in Eq. (68) and the equations representing the constraints are put together to create an objective function, it can be written as Eq. (69) below.

Equation (69) is obtained by partially differentiating equation (69) with conj (q_1 (f)), adding = 0, and then transforming it.

Equation (70) represents a generalized eigenvalue problem, where λ is one of the eigenvalues. Further, by multiplying both sides of the equation (70) by q_1 (f) from the left, the following equation (71) is obtained.

The right side of equation (71) is the function itself that we want to minimize in equation (68). Therefore, the minimum value of the equation (71) is the smallest among the eigenvalues satisfying the equation (70), and the extraction filter q_1 (f) to be obtained is the Hermitian transpose of the eigenvector corresponding to the minimum eigenvalue.

A function that takes two matrices A and B as arguments, solves the generalized eigenvalue problem for the two matrices, and returns all the eigenvectors is expressed as gev (A, B). Using this function, the eigenvector of equation (70) can be written as in equation (72) below.

Similar to equation (36), v_ {min} (f), ..., v_ {max} (f) in equation (72) are eigenvectors, and v_ {min} (f) corresponds to the minimum eigenvalue. It is an eigenvector. The extraction filter q_1 (f) is the Hermitian transpose of v_ {min} (f) as in equation (73).

Similarly, when the TFVV Laplace distribution of Eq. (31) is used as the sound source model, Eqs. (74) and (75) can be obtained.

That is, when the auxiliary variable b (f, t) is calculated by the equation (4) and then the eigenvectors corresponding to the two matrices are obtained by the equation (75), the extraction filter q_1 (f) corresponds to the minimum eigenvalue. This is the Hermitian transpose of the eigenvector v_ {min} (f) (Equation (73)). Since q_1 (f) does not converge once, the equations (74) to (75) and (73) are executed until they converge or a predetermined number of times.

When the TFVV Student-t distribution of Eq. (33) is used as the sound source model and the case of using the bivariate Laplace distribution of Eq. (25), some of the derived equations are common, so they are combined. I will explain. The formula for calculating the auxiliary variable b (f, t) is different between the two, and the following formula (76) is used for the TFVV Student-t distribution and the following formula (77) is used for the bivariate Laplace distribution.

On the other hand, the following equations (78) and (73) are used for both equations for obtaining the extraction filter q_1 (f, t). Since the extraction filter q_1 (f, t) does not converge once, it repeats a predetermined number of times, which is the same as other models.

[Other variants]
The configurations, methods, processes, shapes, materials, numerical values, etc. given in the above-described embodiments and modifications are merely examples, and different configurations, methods, processes, shapes, materials, numerical values, etc. may be used as necessary. Alternatively, it may be replaced with a known one. In addition, the configurations, methods, processes, shapes, materials, numerical values, and the like in the embodiments and modifications can be combined with each other as long as there is no technical contradiction.

It should be noted that the content of the present disclosure is not construed as being limited by the effects exemplified in this specification.

The present disclosure may also adopt the following configuration.
(1)
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
A reference signal generation unit that generates a reference signal corresponding to the target sound based on the mixed sound signal,
A signal processing device including a sound source extraction unit that extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
(2)
The signal processing device according to (1), which has a section detection unit that detects a section in which the target sound is included in the mixed sound signal.
(3)
The sound source extraction unit
The signal processing device according to (1) or (2), which has an extraction filter estimation unit that estimates a filter that extracts a signal in which the target sound is more emphasized.
(4)
The extraction filter estimation unit
The filter is estimated as a solution that optimizes the objective function that reflects the dependence of the reference signal on the extraction result by the filter and the independence of the extraction result and the separation result of other virtual sound sources. The signal processing device according to (3).
(5)
As a sound source model that represents the dependency between the reference signal and the extraction result included in the objective function,
-A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that considers the reference signal as a value corresponding to the dispersion for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal The signal processing apparatus according to (4).
(6)
The signal processing apparatus according to (5), wherein the bivariate Laplace distribution is used as the bivariate spherical distribution.
(7)
As the time-frequency variable dispersion model,
-The signal processing apparatus according to (5), which uses any of the time-frequency variable dispersion Gaussian distribution, the time-frequency variable dispersion Laplace distribution, and the time-frequency variable dispersion Student-t distribution.
(8)
As the divergence of the model using the divergence,
・ Euclidean distance or square error between the absolute value of the extraction result and the reference signal ・ Itakura Saito distance between the power spectrum of the extraction result and the absolute value power spectrum ・ Itakura Saito distance between the amplitude spectrum of the extraction result and the absolute value amplitude spectrum -The signal processing apparatus according to (5), which uses either the ratio of the absolute value of the extraction result to the reference signal and the squared error between 1.
(9)
The sound source extraction unit
A pre-processing unit that performs uncorrelated processing on the time-frequency domain observation signal as pre-processing for processing by the extraction filter estimation unit, and a pre-processing unit.
The signal processing apparatus according to any one of (3) to (8), which has at least a post-processing unit for applying the filter to the mixed sound signal.
(10)
The reference signal generator
It is provided with a neural network that extracts a speaker's voice by inputting a signal in which voices are mixed and a predetermined speaker's clean voice acquired at a timing different from the signal, and the mixed sound signal and The signal processing device according to any one of (1) to (9), wherein the clean sound is input to the neural network, and an amplitude spectrogram generated from the output of the neural network is generated as the reference signal.
(11)
The reference signal generator
The arrival direction of the target sound is estimated, a time frequency mask having an effect of leaving the sound arriving from a predetermined direction and reducing the sound arriving from other directions is generated, and the time frequency mask is used as the mixed sound signal. The signal processing apparatus according to any one of (1) to (9), which generates an amplitude spectrogram generated by applying it to an amplitude spectrogram as the reference signal.
(12)
The reference signal generator
The signal processing device according to any one of (1) to (11), which generates the reference signal by using a sensor different from the microphone.
(13)
The reference signal generator
The signal processing apparatus according to any one of (1) to (12), which generates a reference signal by inputting an extraction result by a filter estimated by the extraction filter estimation unit into a neural network.
(14)
The signal processing device according to any one of (1) to (13), wherein the microphone is a microphone assigned to each speaker.
(15)
The signal processing device according to (14), wherein the microphone is a microphone worn by a speaker.
(16)
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
A signal processing method in which a sound source extraction unit extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
(17)
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
A program in which a sound source extraction unit causes a computer to execute a signal processing method for extracting a signal similar to the reference signal and in which the target sound is more emphasized from the mixed sound signal.

15 ... Section estimation unit 16 ... Reference signal estimation unit 17 ... Sound source extraction unit 17A ... Pre-processing unit 17B ... Extraction filter estimation unit 17C ... Post-processing unit 20 ... Control unit 100 ... Sound source extraction device

Claims

A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
A reference signal generation unit that generates a reference signal corresponding to the target sound based on the mixed sound signal,
A signal processing device including a sound source extraction unit that extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
The signal processing device according to claim 1, further comprising a section detection unit that detects a section in which the target sound is included in the mixed sound signal.
The sound source extraction unit
The signal processing device according to claim 1, further comprising an extraction filter estimation unit that estimates a filter that extracts a signal in which the target sound is more emphasized.
The extraction filter estimation unit
The filter is estimated as a solution that optimizes the objective function that reflects the dependence of the reference signal on the extraction result by the filter and the independence of the extraction result and the separation result of other virtual sound sources. The signal processing apparatus according to claim 3.
As a sound source model that represents the dependency between the reference signal and the extraction result included in the objective function,
-A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that considers the reference signal as a value corresponding to the dispersion for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal The signal processing apparatus according to claim 4.
The signal processing apparatus according to claim 5, wherein the bivariate Laplace distribution is used as the bivariate spherical distribution.
As the time-frequency variable dispersion model,
The signal processing apparatus according to claim 5, which uses any of a time-frequency variable dispersion Gaussian distribution, a time-frequency variable dispersion Laplace distribution, and a time-frequency variable dispersion Student-t distribution.
As the divergence of the model using the divergence,
・ Euclidean distance or square error between the absolute value of the extraction result and the reference signal ・ Itakura Saito distance between the power spectrum of the extraction result and the absolute value power spectrum ・ Itakura Saito distance between the amplitude spectrum of the extraction result and the absolute value amplitude spectrum The signal processing apparatus according to claim 5, wherein any of the ratio of the absolute value of the extraction result to the reference signal and the squared error between 1 is used.
The sound source extraction unit
A pre-processing unit that performs uncorrelated processing on the time-frequency domain observation signal as pre-processing for processing by the extraction filter estimation unit, and a pre-processing unit.
The signal processing apparatus according to claim 3, further comprising a post-processing unit for applying the filter to the mixed sound signal.
The reference signal generator
It is provided with a neural network that extracts a speaker's voice by inputting a signal in which voices are mixed and a predetermined speaker's clean voice acquired at a timing different from the signal, and the mixed sound signal and The signal processing device according to claim 1, wherein the clean sound is input to the neural network, and an amplitude spectrogram generated from the output of the neural network is generated as the reference signal.
The reference signal generator
The arrival direction of the target sound is estimated, a time frequency mask having an effect of leaving the sound arriving from a predetermined direction and reducing the sound arriving from other directions is generated, and the time frequency mask is used as the mixed sound signal. The signal processing apparatus according to claim 1, wherein an amplitude spectrogram generated by applying the spectrogram to the amplitude spectrogram is generated as the reference signal.
The reference signal generator
The signal processing device according to claim 1, wherein a sensor different from the microphone is used to generate the reference signal.
The reference signal generator
The signal processing apparatus according to claim 1, wherein a reference signal is generated by inputting an extraction result by a filter estimated by the extraction filter estimation unit into a neural network.
The signal processing device according to claim 1, wherein the microphone is a microphone assigned to each speaker.
The signal processing device according to claim 14, wherein the microphone is a microphone worn on a speaker.
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
A signal processing method in which a sound source extraction unit extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
A program in which a sound source extraction unit causes a computer to execute a signal processing method for extracting a signal similar to the reference signal and in which the target sound is more emphasized from the mixed sound signal.