WO2021193093A1 - Signal processing device, signal processing method, and program - Google Patents

Signal processing device, signal processing method, and program Download PDF

Info

Publication number
WO2021193093A1
WO2021193093A1 PCT/JP2021/009764 JP2021009764W WO2021193093A1 WO 2021193093 A1 WO2021193093 A1 WO 2021193093A1 JP 2021009764 W JP2021009764 W JP 2021009764W WO 2021193093 A1 WO2021193093 A1 WO 2021193093A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
sound
reference signal
extraction
equation
Prior art date
Application number
PCT/JP2021/009764
Other languages
French (fr)
Japanese (ja)
Inventor
厚夫 廣江
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2021193093A1 publication Critical patent/WO2021193093A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present disclosure relates to signal processing devices, signal processing methods and programs.
  • a technique for extracting a target sound from a mixed sound signal in which a sound to be extracted (hereinafter, appropriately referred to as a target sound) and a sound to be removed (hereinafter, appropriately referred to as a nuisance sound) is proposed (for example). See Patent Documents 1 to 3 below.).
  • One of the purposes of the present disclosure is to provide a signal processing device, a signal processing method, a program, and a signal processing system with improved accuracy of extracting a target sound.
  • a mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
  • a reference signal generator that generates a reference signal corresponding to the target sound based on the mixed sound signal, It is a signal processing device having a sound source extraction unit that extracts a signal that is similar to a reference signal from a mixed sound signal and has a more emphasized target sound.
  • the present disclosure is, for example, A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
  • the reference signal generator generates a reference signal corresponding to the target sound based on the mixed sound signal.
  • This is a signal processing method in which the sound source extraction unit extracts a signal that is similar to the reference signal and has a more emphasized target sound from the mixed sound signal.
  • the present disclosure is, for example, A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
  • the reference signal generator generates a reference signal corresponding to the target sound based on the mixed sound signal.
  • This is a program in which the sound source extraction unit causes a computer to execute a signal processing method for extracting a signal that is similar to a reference signal and has a more emphasized target sound from a mixed sound signal.
  • FIG. 1 is a diagram for explaining an example of the sound source separation process of the present disclosure.
  • FIG. 2 is a diagram for explaining an example of a sound source extraction method using a reference signal based on the deflation method.
  • FIG. 3 is a diagram referred to when explaining a process of extracting a sound source after generating a reference signal for each section.
  • FIG. 4 is a block diagram showing a configuration example of the sound source extraction device according to the embodiment.
  • FIG. 5 is a diagram referred to when explaining an example of interval estimation and reference signal generation processing.
  • FIG. 6 is a diagram referred to when explaining other examples of interval estimation and reference signal generation processing.
  • FIG. 7 is a diagram referred to when explaining other examples of interval estimation and reference signal generation processing.
  • FIG. 1 is a diagram for explaining an example of the sound source separation process of the present disclosure.
  • FIG. 2 is a diagram for explaining an example of a sound source extraction method using a reference signal based on the deflation method.
  • FIG. 8 is a diagram referred to when explaining the details of the sound source extraction unit according to the embodiment.
  • FIG. 9 is a flowchart referred to when explaining the flow of the entire processing performed by the sound source extraction device according to the embodiment.
  • FIG. 10 is a diagram referred to when explaining the process performed by the FTFT unit according to the embodiment.
  • FIG. 11 is a flowchart referred to when explaining the flow of the sound source extraction process according to the embodiment.
  • ⁇ "_” Represents a subscript character.
  • X_k ⁇ ⁇ ⁇ "k” is a subscript character.
  • R_ ⁇ xx ⁇ ⁇ ⁇ ⁇ "xx” is a subscript character.
  • ⁇ " ⁇ ” Represents a superscript.
  • the present disclosure is sound source extraction using a reference signal (reference).
  • a reference signal In addition to recording a signal in which the sound to be extracted (target sound) and the sound to be erased (jamming sound) are mixed with multiple microphones, a "rough" amplitude spectrometer corresponding to the target sound is generated, and the amplitude spectrometer is generated.
  • it is a signal processing device that produces an extraction result similar to the reference signal and with higher accuracy. That is, one form of the present disclosure is a signal processing device that extracts a signal that is similar to a reference signal and has a more emphasized target sound from a mixed sound signal.
  • an objective function that reflects both the dependency (similarity) between the reference signal and the extraction result and the independence between the extraction result and other virtual separation results is prepared.
  • the output signal can be limited to one sound source corresponding to the reference signal. Since it can be regarded as a beamformer that considers both dependence and independence, it is appropriately referred to as Similarity-and-Independence-aware Beamformer (SIBF) below.
  • SIBF Similarity-and-Independence-aware Beamformer
  • the present disclosure is sound source extraction using a reference signal (reference).
  • a reference signal reference signal
  • the spectrogram By using the spectrogram as a reference signal, it produces extraction results that are similar to and more accurate than the reference signal.
  • the usage situation assumed in the present disclosure shall satisfy all of the following conditions (1) to (3), for example.
  • (1) Observation signals are recorded synchronously by a plurality of microphones.
  • each microphone may or may not be fixed, and the position of each microphone and sound source may be unknown in either case.
  • An example of a fixed microphone is a microphone array, and an example of a non-fixed microphone is a case where each speaker wears a pin microphone or the like.
  • the section in which the target sound is sounding is, for example, the utterance section in the case of extracting the voice of a specific speaker. It is assumed that the section is known, but it is unknown whether or not the target sound is sounding outside the section. That is, the assumption that the target sound does not exist outside the section may not hold.
  • the rough target sound spectrogram means that the spectrogram of the true target sound is deteriorated because it meets one or more of the following conditions a) to f). .. a) Real number data that does not include phase information. b) Although the target sound is predominant, the disturbing sound is also included. c) The disturbing sound is almost eliminated, but the sound is distorted as a side effect. d) The resolution is lower than that of the true target sound spectrogram in either or both of the time direction and the frequency direction. e) The spectrogram amplitude scale is different from the observed signal, and the size comparison is meaningless.
  • the amplitude of the rough target sound spectrogram is half the amplitude of the observed signal spectrogram, it does not mean that the target sound and the disturbing sound are included in the observed signal with the same magnitude.
  • Amplitude spectrogram generated from a signal other than sound is acquired or generated by, for example, the following method. -Record the sound with a microphone installed near the target sound (for example, a pin microphone attached to the speaker), and obtain the amplitude spectrogram from it.
  • -A neural network that extracts a specific type of sound in the amplitude spectrogram region is learned in advance, and an observation signal is input to the neural network (NN).
  • -Amplitude spectrogram is obtained from a signal acquired by a sensor other than the normally used air conduction microphone such as a bone conduction microphone.
  • -A spectrogram in the linear frequency domain is generated by applying a predetermined conversion to the spectrogram-equivalent data calculated in the non-linear frequency domain such as the mel frequency.
  • One object of the present disclosure is to use the rough target sound spectrogram thus acquired and generated as a reference signal, and to exceed the reference signal (the target sound is further emphasized, in other words, to be true. It is to generate an extraction result (closer to the target sound). More specifically, in the sound source extraction process in which a linear filter is applied to a multi-channel observation signal to generate an extraction result, a linear filter that generates an extraction result (closer to the true target sound) with an accuracy exceeding the reference signal. To estimate.
  • the reason for estimating the linear filter for the sound source extraction process is to enjoy the following advantages of the linear filter.
  • Advantage 1 The distortion of the extraction result is small compared to the non-linear extraction process. Therefore, when combined with voice recognition or the like, it is possible to avoid a decrease in recognition accuracy due to distortion.
  • Advantage 2 The phase of the extraction result can be appropriately estimated by the rescaling process described later. Therefore, it is possible to avoid a problem caused by an inappropriate phase when combined with a phase-dependent post-stage processing (including a case where the extraction result is reproduced as a sound and a human hears it).
  • Advantage 3 By increasing the number of microphones, it is easy to improve the extraction accuracy.
  • the adaptive beam former here is an adaptive estimation of a linear filter for extracting a target sound by using signals observed by a plurality of microphones and information indicating which sound source is to be extracted as the target sound. It is a method to do.
  • Examples of the adaptive beam former include the methods described in JP-A-2012-234150 and JP-A-2006-072163.
  • SN ratio Signal to Noise Ratio
  • GEV beam former an adaptive beam former that can be used even when the placement of the microphone and the direction of the target sound are unknown.
  • the SN ratio maximizing beamformer is a method for finding a linear filter that maximizes the ratio V_s / V_n between a) and b) below. a) Dispersion of processing results by applying a predetermined linear filter to the section where only the target sound is sounding V_s b) Dispersion of processing results by applying the same linear filter to the section where only the disturbing sound is sounding V_n
  • a linear filter can be estimated if each section can be detected, and there is no need for the placement of microphones or the direction of the target sound.
  • Blind sound source separation is a technology that estimates each sound source from a signal that is a mixture of multiple sound sources, using only the signals observed by multiple microphones (without using information such as the direction of the sound source and the arrangement of the microphones). be.
  • An example of such a technique is the technique of Japanese Patent No. 4449871.
  • the technology of Japanese Patent No. 4449871 is an example of a technology called Independent Component Analysis (hereinafter, appropriately referred to as ICA), and ICA decomposes a signal observed by N microphones into N sound sources. do.
  • the observation signal used at that time may include a section in which the target sound is sounding, and information on a section in which only the target sound or only the disturbing sound is sounding is unnecessary.
  • each separation result After converting each separation result into an amplitude spectrogram, the square error (Euclidean distance) between each amplitude spectrogram and the reference signal is calculated, and the amplitude that minimizes the error.
  • the separation result corresponding to the spectrogram may be adopted.
  • the method of selecting after separation has the following problems. 1) Although only one sound source is desired, N sound sources are generated in the middle step, which is disadvantageous in terms of calculation cost and memory usage. 2)
  • the rough target sound spectrogram which is a reference signal, is used only in the step of selecting one sound source from N sound sources, and is not used in the step of separating into N sound sources. Therefore, the reference signal does not contribute to the improvement of extraction accuracy.
  • IDLMA Independent Deeply Learned Matrix Analysis
  • the feature of IDLMA is that the neural network (NN) that generates the power spectrogram (square of the amplitude spectrogram) of each sound source to be separated is learned in advance. For example, when it is desired to separate the parts of each musical instrument from the music in which a plurality of musical instruments are played at the same time, the NN for inputting the music and outputting the sound of each musical instrument is learned in advance. At the time of separation, the observation signal is input to each NN, and the output power spectrogram is used as a reference signal to perform the separation. Therefore, compared to the completely blind separation process, the separation accuracy can be expected to be improved by the amount of the reference signal used.
  • IDLMA requires N different power spectrograms as reference signals to generate N separation results. Therefore, even if there is only one sound source of interest and no other sound source is required, it is necessary to prepare reference signals for all sound sources. However, in reality it can be difficult. Further, in the above-mentioned document 1, only the case where the number of microphones and the number of sound sources match is mentioned, and how many reference signals should be prepared when the numbers of both do not match. Is not mentioned. Further, since IDLMA is a method of sound source separation, in order to use it for the purpose of sound source extraction, it is necessary to take a step of generating N separation results once and then leaving only one sound source. Therefore, the problem of sound source separation, which is wasteful in terms of calculation cost and memory usage, still remains.
  • Examples of the sound source extraction using the time envelope as a reference signal include the techniques described in Japanese Patent Application Laid-Open No. 2014-219467 proposed by the present inventor.
  • This method estimates a linear filter using a reference signal and a multi-channel observation signal, as in the present disclosure.
  • the reference signal is a time envelope, not a spectrogram. This corresponds to a rough target sound spectrogram that is homogenized by applying an operation such as averaging in the frequency direction. Therefore, if the target sound has a characteristic that the change in the time direction differs for each frequency, the reference signal cannot properly express it, and as a result, the extraction accuracy may decrease.
  • the reference signal is reflected only as an initial value in the iterative process for obtaining the extraction filter. Since the reference signal is not restricted after the second iteration, another sound source different from the reference signal may be extracted. For example, if there is a sound that occurs only for a moment in the section, it is more optimal to extract it as an objective function, so there is a possibility that an undesired sound may be extracted depending on the number of repetitions.
  • the above-mentioned technique has a problem that it is difficult to use it in a situation where the present disclosure can be applied, or an extraction result with sufficient accuracy cannot be obtained.
  • Element 1 In the process of separation, prepare an objective function that reflects not only the independence of the separation results but also the dependency between one of the separation results and the reference signal, and optimize it.
  • Element 2 Similarly, in the separation process, a method called the deflation method, which separates sound sources one by one, is introduced. Then, the separation process is terminated when the first sound source is separated.
  • the sound source extraction technology of the present disclosure extracts one desired sound source from multi-channel observation signals observed by a plurality of microphones by applying an extraction filter which is a linear filter. Therefore, it can be regarded as a kind of beam former (BF). In the extraction process, both the similarity between the reference signal and the extraction result and the independence between the extraction result and other separation results are reflected. Therefore, the sound source extraction method of the present disclosure is appropriately referred to as Similarity-and-Independence-aware Beamformer: SIBF.
  • SIBF Similarity-and-Independence-aware Beamformer
  • the separation process of the present disclosure will be described with reference to FIG.
  • the frame marked with (1-1) is the separation process assumed in the conventional time-frequency domain independent component analysis (Patent No. 4449871 etc.), and exists outside (1-5) and (1). -6) is an element added in this disclosure.
  • the conventional time-frequency domain blind sound source separation will be described first using the frame of (1-1), and then the separation process of the present disclosure will be described.
  • X_1 to X_N are observation signal spectrograms (1-2) corresponding to N microphones, respectively. These are complex data and are generated by applying the short-time Fourier transform described later to the waveform of the sound observed by each microphone.
  • the vertical axis represents frequency and the horizontal axis represents time. The time length shall be the same as or longer than the length of the target sound to be extracted.
  • the separation result spectrograms Y_1 to Y_N are generated by multiplying this observed signal spectrogram by a predetermined square matrix called the separation matrix with (1-3) attached (1-4).
  • the number of separation result spectrograms is N, which is the same as the number of microphones.
  • the value of the separation matrix is determined so that Y_1 to Y_N are statistically independent (that is, the difference between Y_1 to Y_N is as large as possible). Since such a matrix cannot be obtained once, prepare an objective function that reflects the independence of the separation result specograms, and that function is optimal (maximum or minimum depending on the nature of the objective function). Iteratively find such a separation matrix. After obtaining the results of the separation matrix and the separation result spectrogram, when the inverse Fourier transform is applied to each of the separation result spectrograms to generate waveforms, they are signals that estimate each sound source before mixing.
  • the reference signal is a rough amplitude spectrogram of the target sound and is generated by the reference signal generation unit marked with (1-5).
  • the separation matrix is determined by considering the dependency between Y_1, which is one of the separation result spectrograms, and the reference signal R. That is, the separation matrix that optimizes the function is obtained by reflecting both of the following for the objective function. a) Independence between Y_1 and Y_N (solid line L1) b) Dependency between Y_1 and R (dotted line L2) The specific formula of the objective function will be described later.
  • Advantage 1 In independent component analysis in the normal time frequency domain, it is uncertain which original signal appears at which position in the separation result spectrogram, and it corresponds to the initial value of the separation matrix and the observed signal (corresponding to the mixed sound signal described later). It changes depending on the degree of mixing in the signal) and the difference in the algorithm for obtaining the separation matrix.
  • Advantage 1 In independent component analysis in the normal time frequency domain, it is uncertain which original signal appears at which position in the separation result spectrogram, and it corresponds to the initial value of the separation matrix and the observed signal (corresponding to the mixed sound signal described later). It changes depending on the degree of mixing in the signal) and the difference in the algorithm for obtaining the separation matrix.
  • Advantage 1 In independent component analysis in the normal time frequency domain, it is uncertain which original signal appears at which position in the separation result spectrogram, and it corresponds to the initial value of the separation matrix and the observed signal (corresponding to the mixed sound signal described later). It changes depending on the degree of mixing in the signal) and the difference in the algorithm for obtaining the separation matrix.
  • the number of generated signals is N because it is still a separation method. That is, even if the desired sound source is only Y_1, at the same time, N-1 signals are generated even though they are unnecessary.
  • the deflation method is a method of estimating the original signals one by one instead of separating all the sound sources at the same time.
  • the deflation method refer to Chapter 8 of Reference 2 below, for example.
  • the order of separation results is indefinite, so the order in which the desired sound source appears is indefinite.
  • the deflation method is applied to the sound source separation using the objective function that reflects both independence and dependence as described above, it is possible to make the separation result similar to the reference signal always appear first. Become. That is, the separation process may be terminated when the first one sound source is separated (estimated), and it is not necessary to generate unnecessary N-1 separation results. In addition, it is not necessary to estimate all the elements of the separation matrix, and only the elements necessary to generate Y_1 need to be estimated.
  • the deflation method is one of the separation (estimating all the sound sources before mixing) method, but if the separation is interrupted when one sound source is estimated, the extraction (estimating one desired sound source) method. Can be used as. Therefore, in the following description, the operation of estimating only the separation result Y_1 is referred to as “extraction”, and Y_1 is appropriately referred to as “(target sound) extraction result”. Further, each separation result is generated from the vectors constituting the separation matrix with (1-3). This vector is appropriately referred to as an "extraction filter”.
  • FIG. 2 shows the details of FIG. 1, and the elements necessary for applying the deflation method are added.
  • the observation signal spectrogram marked with (2-1) in FIG. 2 is the same as (1-2) in FIG. 1, and the short-time Fourier transform is applied to the time domain signal observed by N microphones. Is generated by.
  • an uncorrelated observation signal spectrogram with (2-3) is generated.
  • Uncorrelated is also called whitening, and is a conversion that makes the signals observed by each microphone uncorrelated. Specific mathematical formulas used in the process will be described later. If uncorrelated is performed as a pretreatment for separation, an efficient algorithm utilizing the properties of uncorrelated signals can be applied in separation.
  • the deflation method is one such algorithm.
  • the number of uncorrelated observation signal spectrograms is the same as the number of microphones, and each is U_1 to U_N.
  • the generation of the uncorrelated observation signal spectrogram need only be performed once as a process before obtaining the extraction filter.
  • the deflation method instead of estimating the matrix that simultaneously generates the separation results Y_1 to Y_N, one filter that generates each separation result is estimated.
  • the only filter to be estimated is w_1, which has the function of inputting U_1 to U_N to generate Y_1, and Y_2 to Y_N and w_2 to w_N are not actually generated. It is a thing.
  • the reference signal R with (2-8) is the same as (1-6) in FIG. As described above, in estimating the filter w_1, both the independence of Y_1 to Y_N and the dependency between R and Y_1 are taken into consideration.
  • the target sound is human voice
  • the number of sound sources of the target sound that is, the number of speakers is 2.
  • the target sound may be any kind of sound, and the number of sound sources is not limited to 2.
  • a non-voice signal is a disturbing sound, but even if it is a voice, a sound output from a device such as a speaker is treated as a disturbing sound.
  • the two speakers be speaker 1 and speaker 2, respectively.
  • the utterances marked with (3-1) and the utterances marked with (3-2) are the utterances of the speaker 1.
  • the utterances marked with (3-3) and the utterances marked with (3-4) are the utterances of the speaker 2.
  • (3-5) represents a disturbing sound.
  • the vertical axis represents the difference in sound source position
  • the horizontal axis represents time. Part of the utterance section overlaps between the utterances (3-1) and (3-3). This corresponds to, for example, the case where the speaker 2 starts speaking immediately before the speaker 1 finishes speaking.
  • utterances (3-2) and (3-4) corresponds to, for example, a case where speaker 1 makes a short utterance such as an aizuchi while speaker 1 is speaking for a long time. Both are phenomena that frequently occur in conversations between humans.
  • the extraction of the utterance (3-1) in the present disclosure is a reference signal corresponding to the utterance (3-1), that is, a rough amplitude spectrogram and an observation signal (mixture of three sound sources) in the time range (3-6). It is used to generate (estimate) a signal that is as clean as possible (consisting only of the voice of speaker 1 and not including other sound sources).
  • the reference signal corresponding to (3-3) and the observation signal in the time range (3-7) are used to clean the speaker 2. Estimate a signal close to. In this way, even if the utterance sections overlap, if reference signals corresponding to the respective target sounds can be prepared, different extraction results can be generated in the present disclosure.
  • the utterance of speaker 2 (3-4) has a time range completely included in the utterance of speaker 1 (3-2), but different extraction results can be obtained by preparing different reference signals for each. Can be generated. That is, in order to extract the utterance (3-2), the reference signal corresponding to the utterance (3-2) and the observation signal in the time range (3-8) are used, and the utterance (3-4) is extracted. For this purpose, the reference signal corresponding to the utterance (3-4) and the observation signal in the time range (3-9) are used.
  • the observation signal spectrogram X_k corresponding to the k-th microphone is expressed as a matrix having x_k (f, t) as an element as shown in the following equation (1).
  • f is the frequency bin number
  • t is the frame number, both of which are indexes that appear by the short-time Fourier transform.
  • changing f is referred to as the "frequency direction”
  • changing t is referred to as the "time direction”.
  • the uncorrelated observation signal spectrogram U_k and the separation result spectrogram Y_k are also expressed as matrices with u_k (f, t) and y_k (f, t) as elements, respectively (the notation of mathematical formulas is omitted).
  • the following equation (3) is an equation for obtaining the vector u (f, t) of the uncorrelated observation signal. This vector is generated by the product of P (f), which is called the uncorrelated matrix, and the observed signal vector x (f, t).
  • the uncorrelated matrix P (f) is calculated by the following equations (4) to (6).
  • the above equation (4) is an equation for obtaining the covariance matrix R_ ⁇ xx ⁇ (f) of the observed signal in the fth frequency bin.
  • ⁇ > _T on the right side represents the operation of calculating the average in a predetermined range of t (frame number).
  • the range of t is the time length of the spectrogram, that is, the section in which the target sound is sounding (or the range including the section).
  • the superscript H represents Hermitian transpose (conjugate transpose).
  • V (f) is a matrix consisting of eigenvectors
  • D (f) is a diagonal matrix consisting of eigenvalues
  • V (f) is a unitary matrix, and the inverse matrix of V (f) and the Hermitian transpose of V (f) are the same.
  • the uncorrelated matrix P (f) is calculated by Eq. (6). Since D (f) is a diagonal matrix, its -1 / 2th power can be obtained by multiplying each diagonal element by -1 / 2th power.
  • the following equation (8) is an equation that generates the separation result y (f, t) for all channels in f, t, and is obtained by the product of the separation matrix W (f) and u (f, t). .. The method for obtaining W (f) will be described later.
  • the reference signal R is represented as a matrix having r (f, t) as an element, as in Eq. (12).
  • the shape itself is the same as the observation signal spectrogram X_k, but the element x_k (f, t) of X_k is a complex number, while the element r (f, t) of R is a non-negative real number.
  • This disclosure estimates only w_1 (f) instead of estimating all the elements of the separation matrix W (f). That is, only the elements used in the generation of the first separation result (target sound extraction result) are estimated.
  • the derivation of the formula for estimating w_1 (f) will be described.
  • the derivation of the equation consists of the following three points, each of which will be explained in order.
  • the objective function used in the present disclosure has a negative log-likelihood, and is basically the same as that used in Document 1 and the like. This objective function is minimized when the separation results are independent of each other.
  • the objective function is derived as follows.
  • Equation (13) is a modification of equation (3), which is an uncorrelated equation
  • equation (14) is a modification of equation (8), which is a separation equation.
  • the reference signal r (f, t) is added to the vector on both sides
  • the element 1 representing "passing of the reference signal” is added to the matrix on the right side.
  • the matrix and vector to which these elements are added are represented by adding a prime symbol to the original matrix and vector.
  • the negative log-likelihood L of the reference signal and the observed signal represented by the following equation (15) is used.
  • p ( ⁇ ) represents the probability density function (hereinafter, appropriately referred to as pdf) of the signal in parentheses.
  • pdf probability density function
  • p (R, X_1, ..., X_N) in Eq. (15) is the probability that the reference signal R and the observed signal spectrograms X_1 to X_N occur at the same time.
  • Equation (16) Since the probability of simultaneous occurrence of independent variables can be decomposed into the product of each pdf, the left side of equation (16) is transformed into the right side by Assumption 1. The inside of the parentheses on the right side is expressed as in equation (17) using x'(f, t) introduced in equation (13).
  • Equation (17) is transformed into equation (18) and equation (19) using the relationship in the lower part of equation (14).
  • det ( ⁇ ) represents the determinant of the matrix in parentheses.
  • Equation (20) is an important modification of the deflation method. Since the matrix W (f)'is a unitary matrix like the separation matrix W (f), its determinant is 1. Also, since the matrix P'(f) does not change during separation, the determinant is a constant. Therefore, both determinants can be written together as a constant.
  • Equation (21) is a unique variant of this disclosure.
  • the components of y'(f, t) are r (f, t) and y_1 (f, t) to y_N (f, t), but according to Assumptions 2 and 3, the probability densities with these variables as arguments.
  • the functions are p (r (f, t), y_1 (f, t)), which is the simultaneous probability of r (f, t) and y_1 (f, t), and y_2 (f, t) to y_N (f). , t) It is decomposed into the product of each of the probability density functions p (y_2 (f, t)) to p (y_N (f, t)).
  • Equation (21) Substituting equation (21) into equation (15) gives equation (22).
  • the extraction filter w_1 (f) is a subset of the arguments that minimize equation (22).
  • the sound source model p (r (f, t), y_1 (f, t)) takes two variables, the reference signal r (f, t) and the extraction result y_1 (f, t), as arguments. It is a pdf and represents the dependency (dependency) of two variables.
  • the sound source model can be formulated based on various concepts. In this disclosure, the following three methods are used.
  • a spherical distribution is a type of multi-variate pdf.
  • a multivariate pdf is constructed by considering multiple arguments of a pdf as a vector and substituting the norm (L2 norm) of the vector into the univariate pdf.
  • Using a spherical distribution in independent component analysis has the effect of resembling the variables used in the arguments.
  • the technique described in Japanese Patent No. 4449871 utilizes this property to solve a problem called a frequency permutation problem, in which "which sound source appears in the kth separation result differs for each frequency bin".
  • a spherical distribution with a reference signal and an extraction result as arguments can be made similar.
  • the spherical distribution used here can be expressed by the general form of the following equation (24).
  • the function F is any univariate pdf.
  • c_1 and c_2 are positive constants, and the influence of the reference signal on the extraction result can be adjusted by changing these values.
  • the following equation (25) is obtained.
  • this equation will be referred to as a bivariate Laplace distribution.
  • divergence-based model Another type of sound source model is a pdf based on divergence, which is a superordinate concept of the distance scale, and is expressed in the form of the following equation (26).
  • ) is the reference signal r (f, t) and the amplitude of the extraction result
  • Equation (30) below is a pdf based on another divergence.
  • are similar.
  • Time-frequency variable variance model As another sound source model, a time-frequency-varying variance (TFVV) model is also possible. This is a model in which each point that makes up the spectrogram has a different variance or standard deviation over time and frequency. Then, it is interpreted that the rough amplitude spectrogram, which is a reference signal, represents the standard deviation (or some value depending on the standard deviation) of each point.
  • TFVV time-frequency-varying variance
  • TFVV Laplace distribution a Laplace distribution with a variable time-frequency variance (hereinafter referred to as TFVV Laplace distribution) as the distribution.
  • is a term for adjusting the magnitude of the influence of the reference signal on the extraction result.
  • equation (32) is obtained.
  • the sound source model of the following equation (33) can be obtained.
  • ⁇ (new) in equation (33) is a parameter called the degree of freedom, and the shape of the distribution can be changed by changing this value.
  • 1 represents the Cauchy distribution and ⁇ ⁇ ⁇ represents the Gaussian distribution.
  • auxiliary function method A high-speed and stable algorithm called the auxiliary function method can be applied to the equations (25), (31), and (33).
  • equations (27) to (30) another algorithm called the fixed point method can be applied.
  • Eig (A) represents a function that takes a matrix A as an argument and performs eigenvalue decomposition on the matrix to obtain all eigenvectors.
  • the eigenvectors of the weighted covariance matrix in equation (34) can be written as in equation (35) below.
  • A_ ⁇ min ⁇ (f), ..., a_ ⁇ max ⁇ (f) on the left side of equation (35) are eigenvectors, a_ ⁇ min ⁇ (f) is the smallest eigenvalue, and a_ ⁇ max ⁇ (f) ) Corresponds to the largest eigenvalue.
  • the norm of each eigenvector is 1, and it is assumed that they are orthogonal to each other.
  • W_1 (f) which minimizes equation (34), is the Hermitian transpose of the eigenvectors corresponding to the smallest eigenvalues, as shown in equation (36) below.
  • the auxiliary function method is one of the methods for efficiently solving an optimization problem, and details thereof are described in Japanese Patent Application Laid-Open No. 2011-175114 and Japanese Patent Application Laid-Open No. 2014-219467.
  • equation (38) an inequality that "presses from above” is prepared.
  • Equation (38) The right-hand side of equation (38) is called the auxiliary function, and b (f, t) in it is called the auxiliary variable.
  • b (f, t)
  • the minimization problem is solved quickly and stably by repeating the following two steps alternately. 1. 1. As shown in Eq. (40) below, fix w_1 (f) and find b (f, t) that minimizes G. 2. As shown in Eq. (41) below, fix b (f, t) and find w_1 (f) that minimizes G.
  • Equation (40) is minimized when the equal sign of equation (38) holds. Since the value of y_1 (f, t) changes every time w_1 (f) changes, the calculation is performed using equation (9). Since equation (41) is a minimization problem of a weighted covariance matrix like equation (34), it can be solved by using eigenvalue decomposition.
  • Equation (36) When the eigenvector is calculated by the following equation (42) for the weighted covariance matrix of equation (41), the solution of equation (41), w_1 (f), is the Hermitian transpose of the eigenvector corresponding to the minimum value. (Equation (36)).
  • auxiliary variable b (f, t) normalize (r (f, t)).
  • b (f, t) normalize (r (f, t)).
  • b) Calculate a tentative value as the separation result y_1 (f, t), and calculate the auxiliary variable from it by equation (40).
  • c) Substitute a temporary value for w_1 (f) to calculate equation (40).
  • Normalize () in a) above is a function defined by the following equation (43), and s (t) in this equation represents an arbitrary time series signal. The function of normalize () is to normalize the root mean square of the absolute value of the signal to 1.
  • the value of the extraction filter estimated in the previous target sound section is saved. It is also possible to use it as the initial value of w_1 (f) when calculating the next target sound section. For example, when sound source extraction is performed for the utterance (3-2) shown in FIG. 3, the extraction filter estimated for the previous utterance (3-1) of the same speaker is tentatively used for w_1 (f) in this extraction. Use as a value.
  • w_1 (f) may be obtained by using the update formula derived from the TFVV Gaussian distribution only for the first time.
  • the step of obtaining the extraction filter w_1 (f) (corresponding to the equation (41)) can be expressed as the following equation (47).
  • the step for finding the auxiliary variable b (f, t) is as shown in Eq. (49) below.
  • the degree of freedom ⁇ functions as a parameter for adjusting the degree of influence of r (f, t), which is a reference signal, and y_1 (f, t), which is an extraction result during repetition.
  • 0, the reference signal is ignored, and when it is 0 or more and less than 2, the influence of the extraction result is larger than that of the reference signal.
  • is greater than 2
  • the effect of the reference signal is greater, and at the limit ⁇ ⁇ ⁇ , the extraction result is ignored, which is equivalent to the TFVV Gaussian distribution.
  • the step for obtaining the extraction filter w_1 (f) is as shown in the following equation (50). Since the formula (50) is the same as the formula (47) in the case of the bivariate Laplace distribution, the extraction filter can be similarly obtained by the formula (48).
  • the update equation is derived by using the fixed point form (w_1 (f))'';.
  • the equation that the partial differentiation by the parameter is zero is used as the condition that holds at the time of convergence, and the following equation is used.
  • a concrete equation is derived by performing the partial differentiation shown in (51).
  • Equation (51) The left side of equation (51) is the partial derivative by conj (w_1 (f)). Then, the equation (51) is modified to obtain the form of the equation (52).
  • Equation (55) is described in two stages, but the upper row is assumed to be used after calculating y_1 (f, t) using equation (9), and the lower row is assumed to be used after calculating y_1 (f, t). It is assumed that w_1 (f) and u (f, t) are used directly without calculating t). The same applies to the equations (56) to (60) described later.
  • w_1 (f) is calculated by either of the following methods. a) Calculate a tentative value as the separation result y_1 (f, t), and then calculate w_1 (f) from the upper equation of equation (55). b) Substitute a temporary value for w_1 (f), and calculate w_1 (f) from it using the lower equation of equation (55). For the temporary value of y_1 (f, t) in a) above, the method b) in the explanation of equation (40) can be used. Similarly, for the tentative value of w_1 (f) in b), the method of c) in equation (40) can be used.
  • the update formulas derived from the formula (28), which is a pdf corresponding to Itakura Saito divergence (power spectrogram version), are the following formulas (56) and (57).
  • Equation (57) is as follows.
  • Equation 52 Since it is possible to transform the equation 52 into two forms, there are also two update equations.
  • the second item on the right side of the lower part of equation (56) and the third term on the right side of the lower part of equation (57) are both composed of only u (f, t) and r (f, t), and during the iterative process. It is constant. Therefore, these terms need to be calculated only once before the iteration, and the inverse matrix of Eq. (57) needs to be calculated only once.
  • Equation (59) is as follows.
  • FIG. 4 is a diagram showing a configuration example of a sound source extraction device (sound source extraction device 100) which is an example of the signal processing device according to the present embodiment.
  • the sound source extraction device 100 includes, for example, a plurality of microphones 11, an AD (Analog to Digital) transform unit 12, a SFTT (Short-Time Fourier Transform) unit 13, an observation signal buffer 14, an interval estimation unit 15, and a reference signal generation unit 16. It has a sound source extraction unit 17 and a control unit 18.
  • the sound source extraction device 100 includes a post-stage processing unit 19 and a section / reference signal estimation sensor 20 as needed.
  • the plurality of microphones 11 are installed at different positions. There are several variations in the installation form of the microphone as described later. A mixed sound signal in which a target sound and a sound other than the target sound are mixed is input by the microphone 11.
  • the AD conversion unit 12 converts the multi-channel signal acquired by each microphone 11 into a digital signal for each channel. This signal is appropriately referred to as an observation signal (in the time domain).
  • the STFT unit 13 converts the observed signal into a signal in the time frequency domain by applying a short-time Fourier transform to the observed signal.
  • the observation signal in the time frequency domain is sent to the observation signal buffer 14 and the interval estimation unit 15.
  • the observation signal buffer 14 stores observation signals for a predetermined time (number of frames).
  • the observation signal is stored for each frame, and when a request for which time range of the observation signal is required is received from another module, the observation signal corresponding to that time range is returned.
  • the signal accumulated here is used in the reference signal generation unit 16 and the sound source extraction unit 17.
  • the section estimation unit 15 detects a section in which the target sound is included in the mixed sound signal. Specifically, the section estimation unit 15 detects the start time (time when the sound starts to sound), the end time (time when the sound ends), and the like of the target sound. The technique used to estimate this section depends on the usage scene of this embodiment and the installation mode of the microphone, and will be described in detail later.
  • the reference signal generation unit 16 generates a reference signal corresponding to the target sound based on the mixed sound signal. For example, the reference signal generation unit 16 estimates a rough amplitude spectrogram of the target sound. Since the processing performed by the reference signal generation unit 16 depends on the usage scene of this embodiment and the installation mode of the microphone, the details will be described later.
  • the sound source extraction unit 17 extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized. Specifically, the sound source extraction unit 17 estimates the estimation result of the target sound by using the observation signal and the reference signal corresponding to the section in which the target sound is sounding. Alternatively, an extraction filter is estimated to generate such an estimation result from the observation signal.
  • the output of the sound source extraction unit 17 is sent to the post-stage processing unit 19 as needed.
  • Examples of the post-stage processing performed by the post-stage processing unit 19 include voice recognition and the like.
  • the sound source extraction unit 17 outputs a time domain extraction result, that is, a voice waveform, and the voice recognition unit performs recognition processing on the voice waveform.
  • the voice recognition side voice section detection function can be omitted because the present embodiment includes a section estimation unit 15 equivalent thereto. Further, the voice recognition often includes an SFT for extracting the voice feature amount required for the recognition process from the waveform, but when combined with the present embodiment, the SFTT on the voice recognition side may be omitted.
  • the sound source extraction unit 17 outputs the extraction result of the time frequency region, that is, the spectrogram. On the voice recognition side, the spectrogram is converted into a voice feature.
  • the control unit 18 comprehensively controls each unit of the sound source extraction device 100.
  • the control unit 18 controls, for example, the operation of each of the above-mentioned units. Although omitted in FIG. 4, the control unit 18 and the above-mentioned functional blocks are connected to each other.
  • the section / reference signal estimation sensor 20 is a sensor different from the microphone of the microphone 11 which is supposed to be used for section estimation or reference signal generation.
  • the post-stage processing unit 19 and the section / reference signal estimation sensor 20 are shown in parentheses because the post-stage processing unit 19 and the section / reference signal estimation sensor 20 can be omitted in the sound source extraction device 100. It shows that there is. That is, if the accuracy of section estimation or reference signal generation can be improved by providing a dedicated sensor different from the microphone 11, such a sensor may be used.
  • an image sensor (camera) can be applied as a sensor.
  • the following sensors used as auxiliary sensors in Japanese Patent Application No. 2019-073542 proposed by the present inventor may be provided, and section estimation or reference signal generation may be performed using the signals acquired thereby.
  • -A type of microphone that is used in close contact with the body, such as a bone conduction microphone and a pharyngeal microphone.
  • -A sensor that can observe the vibration of the skin surface near the speaker's mouth and throat. For example, a combination of a laser pointer and an optical sensor.
  • FIG. 5 is a diagram assuming a situation in which there are N (two or more) speakers in a certain environment and a microphone is assigned to each speaker. Assigning a microphone means that each speaker is wearing a pin microphone, a headset microphone, or the like, or the microphone is installed at a close distance to each speaker. Let N speakers be S1, S2 ... Sn, and microphones assigned to each speaker be M1, M2 ... Mn. Further, there are 0 or more interfering sound sources Ns.
  • a conference is held in a room, and in order to automatically create the minutes of the conference, voice recognition is performed for the voice picked up by each speaker's microphone.
  • the utterances may overlap with each other, and when the utterances overlap, a signal in which the voices are mixed is observed in each microphone.
  • a disturbing sound source there may be a sound of a fan of a projector or an air conditioner, a reproduced sound emitted from a device equipped with a speaker, and the like, and these sounds are also included in the observation signal of each microphone.
  • the section detection method and reference signal generation method that can be used in such a situation will be described below.
  • the voice of the corresponding (target) speaker will be referred to as the main voice or the main utterance, and the voice of another speaker will be appropriately referred to as the wraparound voice or crosstalk.
  • the main utterance detection described in Japanese Patent Application No. 2019-227192 can be used.
  • a detector that responds to the main voice while ignoring crosstalk is realized. Further, since the utterances are duplicated, even if the utterances are duplicated, the section of each utterance and the speaker can be estimated as shown in FIG.
  • the reference signal generation method is to generate directly from the signal observed by the microphone assigned to the speaker.
  • the signal observed by the microphone M1 in FIG. 5 is a mixture of all sound sources, but the sound of the speaker S1, which is the nearest sound source, is picked up greatly, while the other sound sources are quieter than that. The sound is picked up by. Therefore, if an amplitude spectrogram is generated by cutting out the observation signal of the microphone M1 according to the speech section of the speaker S1, applying a short-time Fourier transform to it, and then taking an absolute value, it is a rough amplitude spectrogram of the target sound. , Can be used as a reference signal in this embodiment.
  • Another method is to use the crosstalk reduction technique described in Japanese Patent Application No. 2019-227192 described above.
  • the crosstalk is removed (reduced) from the signal in which the main voice and the crosstalk are mixed and the main voice is left.
  • the output of this neural network is an amplitude spectrogram or a time-frequency mask of the crosstalk reduction result, and the former can be used as a reference signal as it is.
  • Even in the latter case by applying a time-frequency mask to the amplitude spectrogram of the observed signal, it is possible to generate an amplitude spectrogram of the crosstalk removal result, and thus it can be used as a reference signal.
  • FIG. 6 assumes an environment in which there are one or more speakers and one or more interfering sound sources.
  • the focus was on the overlap of utterances rather than the existence of the disturbing sound source Ns, but in the example shown in FIG. 6, the focus is on obtaining a clean voice in a noisy environment in which a loud disturbing sound is present.
  • duplication of utterances is also an issue.
  • n speakers There are n speakers, and each speaker is speaker S1 to speaker Sn. n is 1 or more. In FIG. 6, only one disturbing sound source Ns is shown, but the number is arbitrary.
  • sensors There are two types of sensors used. One is a sensor (sensor corresponding to the section / reference signal estimation sensor 20) worn by each speaker or installed in the immediate vicinity of each speaker, and the following are sensor SEs (sensors SE1 and SE2). .. SEn) as appropriate.
  • the other is a microphone array 11A composed of a plurality of microphones 11 having a fixed position.
  • the section / reference signal estimation sensor 20 may use the same type as the microphone shown in FIG. 5 (a type microphone called an air conduction microphone that collects sound propagating in the atmosphere). As described in FIG. 4, using a microphone of the type that is used in close contact with the body, such as a bone conduction microphone or a pharyngeal microphone, or a sensor that can observe the vibration of the skin surface near the speaker's mouth and throat. Is also good. In any case, since the sensor SE is closer to or in close contact with each speaker than the microphone array, the utterances of the speakers corresponding to each sensor can be recorded at a high SN ratio.
  • the microphone array 11A in addition to the form in which a plurality of microphones are installed in one device, a form in which microphones are installed in a plurality of places in a space called distributed microphones is also possible.
  • distributed microphone a form in which the microphone is installed on the wall surface or the ceiling surface of the room, a form in which the microphone is installed on the seat, the wall surface, the ceiling, the dashboard, etc. in the automobile can be considered.
  • the signals acquired by the sensors SE1 to SEn corresponding to the interval / reference signal estimation sensor 20 are used for interval estimation and reference signal generation, and the multi-channel acquired from the microphone array 11A is used for sound source extraction. Use observation signals.
  • the interval estimation method and the reference signal generation method when the air conduction microphone is used as the sensor SE the same method as the method described with reference to FIG. 5 can be used.
  • a close-contact microphone in addition to the same method as that shown in FIG. Is.
  • the interval estimation a method of discriminating by the threshold value of the power of the input signal can be used, and as the reference signal, the amplitude spectrogram generated from the input signal can be used as it is.
  • the sound recorded by the close-fitting microphone is not always used as an input for voice recognition, etc., because the high frequencies are attenuated and sounds generated in the body such as swallowing sounds may also be recorded. Although it is not appropriate, it can be effectively used for interval estimation and reference signal generation.
  • a clean target sound is supported from the sound acquired by the air conduction microphone (mixture of the target sound and the disturbing sound) and the signal acquired by the auxiliary sensor (some signal corresponding to the target sound).
  • the relationship is learned in advance by the neural network, and at the time of inference, the signal acquired by the air conduction microphone and the auxiliary sensor is input to the neural network to generate a near-clean target sound.
  • the output of the neural network is an amplitude spectrogram (or time-frequency mask), it can be used as a reference signal (or a reference signal is generated) in this embodiment.
  • a method of generating a clean target sound and at the same time estimating a section in which the target sound is sounding is also mentioned, it can be used as a section detecting means.
  • Sound source extraction is basically performed using the observation signal acquired by the microphone array 11A.
  • a signal derived from a microphone array may be used in addition to the sensor SE. Since the microphone array 11A is far from any speaker, the speaker's utterance is always observed as crosstalk. By comparing the signal with the signal of the interval / reference signal estimation microphone, it can be expected to improve the accuracy of interval estimation, especially the accuracy of interval estimation when utterances overlap.
  • FIG. 7 shows a microphone installation form different from that of FIG. It is the same as FIG. 6 in that it assumes an environment with one or more speakers and one or more interfering sound sources, but the microphone used is only the microphone array 11A, and it is installed in the immediate vicinity of each speaker. There is no sensor that has been used. Similar to FIG. 6, the form of the microphone array 11A can be applied to a plurality of microphones installed in one device, a plurality of microphones (distributed microphones) installed in a space, and the like.
  • the case where the mixing of voices is low is the case where there is only one speaker (that is, only the speaker S1) in a certain environment, and the disturbing sound source Ns can be regarded as non-voice.
  • a voice section detection technique focusing on "voice-likeness" described in Japanese Patent No. 4182444 or the like can be applied. That is, in the environment of FIG. 7, when it is considered that the "voice-like" signal is only the utterance of the speaker S1, the non-voice signal is ignored and the purpose is the location (timing) in which the voice-like signal is included. Detected as a section of sound.
  • a method called denoise as described in Document 3 that is, a process of inputting a signal in which voice and non-sound are mixed, removing non-sound, and leaving sound.
  • a wide variety of methods can be applied to denoising.
  • the following method uses a neural network, and its output is an amplitude spectrogram, so that the output can be used as a reference signal as it is.
  • Reference 3 ⁇ Liu, D. & Smaragdis, P. & Kim, M., "Experiments on deep learning for speech denoising," Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH2014. P.2685-2689.
  • the sound source direction estimation which is the premise of a) can be applied. Further, if an image sensor (camera) is used as the section / reference signal estimation sensor 20 in the example shown in FIG. 4, b) can also be applied. In either method, the direction of the utterance can be known when the utterance section is detected (in the method (b) above, the utterance direction can be calculated from the position of the lips in the image), so refer to that value as a reference signal. Can be used for generation.
  • the sound source direction estimated in the utterance section estimation is appropriately referred to as ⁇ .
  • the reference signal generation method also needs to support mixing of voices, and the following can be applied as such a technique.
  • a) Time frequency masking using the sound source direction (This is a reference signal generation method used in Japanese Patent Application Laid-Open No. 2014-219467. A steering vector corresponding to the sound source direction ⁇ is calculated, and the observation signal vector (described above (described above). When the cosine similarity is calculated with Eq. (2)), it becomes a mask that leaves the sound arriving from the direction ⁇ and attenuates the sound arriving from other directions. The mask is applied to the amplitude spectrogram of the observed signal. , The signal thus generated is used as a reference signal.
  • the selective listening technology is a technology that extracts the voice of a specified person from a monaural signal in which multiple voices are confused. ..
  • the voice of the designated speaker included in the mixed signal is output.
  • a time-frequency mask is output to generate such a spectrogram. When the mask thus output is applied to the amplitude spectrogram of the observed signal, it can be used as the reference signal of the present embodiment.
  • Speaker Beam and Voice Filter are described in Documents 4 and 5 below, respectively.
  • Reference 4 ⁇ M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  • Reference 5 ⁇ Author: Quan Wang, Hannah Muckenhire, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J.
  • the sound source extraction unit 17 includes, for example, a pre-processing unit 17A, an extraction filter estimation unit 17B, and a post-processing unit 17C.
  • the preprocessing unit 17A performs uncorrelated processing shown in equations (3) to (7), that is, uncorrelated processing on the time frequency domain observation signal.
  • the extraction filter estimation unit 17B estimates a filter that extracts a signal in which the target sound is emphasized. Specifically, the extraction filter estimation unit 17B estimates the extraction filter for sound source extraction and generates the extraction result. More specifically, the extraction filter estimation unit 17B is an objective function that reflects the dependency between the reference signal and the extraction result by the extraction filter, and the independence between the output result and the separation result of another virtual sound source. Estimate the extraction filter as a solution to optimize.
  • the extraction filter estimation unit 17B serves as a sound source model that represents the dependence between the reference signal and the extraction result included in the objective function.
  • -A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that regards the reference signal as a value corresponding to the variance for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal To use.
  • the bivariate Laplace distribution may be used as the bivariate spherical distribution.
  • any one of the time-frequency variable variance Gaussian distribution, the time-frequency variable variance Laplace distribution, and the time-frequency variable variance Student-t distribution may be used as the time-frequency variable variance model.
  • the Euclidean distance or square error between the absolute value of the extraction result and the reference signal the Itakura Saito distance between the power spectrum of the extraction result and the power spectrum of the absolute value
  • the amplitude spectrum of the extraction result Either the Itakura Saito distance to the absolute amplitude spectrum, the ratio of the absolute value of the extraction result to the reference signal, or the square error between 1 may be used.
  • the post-processing unit 17C performs at least the processing of applying the extraction filter to the mixed sound signal.
  • the post-processing unit 17C may perform a process of applying an inverse Fourier transform to the extraction result spectrogram to generate an extraction result waveform.
  • step ST11 the AD conversion unit 12 converts the analog observation signal (mixed sound signal) input to the microphone 11 into a digital signal.
  • the observed signal at this point is in the time domain. Then, the process proceeds to step ST12.
  • step ST12 the STFT unit 13 applies a short-time Fourier transform (STFT) to the observation signal in the time domain to obtain the observation signal in the time frequency domain.
  • STFT short-time Fourier transform
  • Input may be performed not only from the microphone but also from a file or network as needed. Details of the specific processing performed by the FTFT unit 13 will be described later.
  • AD conversion and RTM are also performed for the number of channels. Then, the process proceeds to step ST13.
  • step ST13 a process (buffering) is performed in which the observation signal converted into the time frequency domain by the FTFT is accumulated for a predetermined time (a predetermined number of frames). Then, the process proceeds to step ST14.
  • the section estimation unit 15 estimates the start time (time when the sound starts to sound) and the end time (time when the sound ends) of the target sound. Furthermore, when used in an environment where utterances may overlap with each other, information that can identify which speaker is the utterance is also estimated. For example, in the usage pattern shown in FIGS. 5 and 6, the microphone (sensor) number assigned to each speaker is also estimated, and in the usage pattern shown in FIG. 7, the direction of utterance is also estimated.
  • step ST16 Sound source extraction and associated processing are performed for each section of the target sound. Therefore, the process proceeds to step ST16 only when the section is detected, and if not detected, steps ST16 to ST19 are skipped and the process proceeds to step ST20.
  • step ST16 the reference signal generation unit 16 generates a rough amplitude spectrogram of the target sound sounding in that section.
  • the methods that can be used to generate the reference signal are as described with reference to FIGS. 5 to 7. Then, the process proceeds to step ST17.
  • step ST17 the sound source extraction unit 17 generates the extraction result of the target sound by using the reference signal obtained in step ST16 and the observation signal corresponding to the time range of the target sound section. The details of the process will be described later.
  • step ST18 it is determined whether or not the processing related to step ST16 and step ST17 is repeated a predetermined number of times.
  • the meaning of this iteration is that if the sound source extraction process generates an extraction result with higher accuracy than the observed signal or reference signal, then the reference signal is regenerated from the extraction result, and the sound source extraction process is executed again using it. This means that the extraction result can be obtained with higher accuracy than the previous time.
  • the present embodiment is characterized in that the extraction process is repeated instead of the separation process. It should be noted that this iteration is different from the iteration used when estimating the filter by the auxiliary function method or the fixed point method inside the sound source extraction process according to step ST17. After the process according to step ST18, the process proceeds to step ST19.
  • step ST19 the post-processing is performed by the post-processing unit 17C using the extraction result generated in step ST17.
  • voice recognition and response generation for voice dialogue using the recognition result can be considered. Then, the process proceeds to step ST20.
  • step ST20 it is determined whether or not to continue the process. If it continues, the process returns to step ST11, and if it does not continue, the process ends.
  • the short-time Fourier transform performed by the STFT unit 13 will be described with reference to FIG.
  • the microphone observation signal is a multi-channel signal observed by a plurality of signals
  • the SFTT is performed for each channel. The following is a description of the STFT in the kth channel.
  • a certain length is cut out from the waveform of the microphone recording signal obtained by the AD conversion process according to step ST11, and a window function such as a humming window or a humming window is applied to them (see FIG. 10A).
  • This cut out unit is called a frame.
  • x_k (1, t) to x_k (F, t) are obtained as observation signals in the time frequency domain.
  • t represents the frame number
  • F represents the total number of frequency bins (see FIG. 10C).
  • x_k (1, t) to x_k (F, t) is collectively described as one vector x_k (t) (see FIG. 10C difference).
  • x_k (t) is called a spectrum, and a data structure in which multiple spectra are arranged in the time direction is called a spectrogram.
  • the horizontal axis represents the frame number and the vertical axis represents the frequency bin number, and three spectra 51A, 52A, and 53A are generated from each of the cut out observation signals 51, 52, and 53, respectively.
  • preprocessing is performed by the preprocessing unit 17A.
  • pretreatment there is uncorrelatedness represented by equations (3) to (6).
  • update formulas used in filter estimation perform special processing only for the first time, but such processing is also performed as preprocessing. Then, the process proceeds to step ST32.
  • step ST32 a process of estimating the extraction filter is performed. Then, the process proceeds to step ST33. Steps ST32 and ST33 represent iterations for estimating the extraction filter. Except when the equation (32) TFVV Gaussian distribution is used as the sound source model, the extraction filter cannot be obtained in the closed form. Therefore, until the extraction filter and the extraction result converge, or a predetermined number of times, the step ST32 is performed. This process is repeated.
  • the extraction filter estimation process according to step ST32 is a process for obtaining the extraction filter w_1 (f), and the specific formula differs for each sound source model.
  • the reference signal r (f, t) and the uncorrelated observation signal u (f, t) are used on the right side of equation (35).
  • the Hermitian transpose is applied to the eigenvector corresponding to the smallest eigenvalue as in Eq. (36)
  • equation (31) when the TFVV Laplace distribution of equation (31) is used as the sound source model, first, the reference signal r (f, t) and the uncorrelated observation signal u (f, t) are used according to equation (40). To calculate the auxiliary variable b (f, t). Next, the weighted covariance matrix on the right side of equation (42) is calculated, and the eigenvalue decomposition is applied to it to obtain the eigenvector. Finally, the extraction filter w_1 (f) is obtained by the equation (36). Since the extraction filter of w_1 (f) at this point has not yet converged, the process returns to equation (40) and the auxiliary variable is calculated again. These processes are executed until w_1 (f) converges or a predetermined number of times.
  • step ST34 The process proceeds to step ST34 until the extraction filter converges or the repetition is performed a predetermined number of times.
  • step ST34 post-processing is performed by the post-processing unit 17C.
  • the extraction result is rescaled.
  • the inverse Fourier transform is performed as necessary to generate a waveform in the time domain.
  • Rescaling is a process of adjusting the scale of each frequency bin of the extraction result.
  • the norm of the filter is restricted to 1 in order to apply an efficient algorithm, but the extraction result generated by applying the extraction filter with this constraint is an ideal purpose.
  • the scale is different from the sound. Therefore, the scale of the extraction result is adjusted using the observation signal before uncorrelatedness.
  • the rescaling coefficient ⁇ (f) can be obtained as a value that minimizes the following equation (61), and the specific equation is as shown in equation (62).
  • X_i (f, t) in this equation is the observed signal (before uncorrelated) that is the target of rescaling. How to select x_i (f, t) will be described later.
  • the coefficient ⁇ (f) thus obtained is multiplied by the extraction result as shown in the following equation (63).
  • the extraction result y_1 (f, t) after rescaling corresponds to the component derived from the target sound in the observation signal of the i-th microphone. That is, it is almost equal to the signal observed by the i-th microphone when there is no sound source other than the target sound. Further, if necessary, the waveform of the extraction result is obtained by applying the inverse Fourier transform to the rescaled extraction result. As described above, the inverse Fourier transform can be omitted depending on the post-stage processing.
  • the observation signal x_i (f, t) that is the target of rescaling. This depends on how the microphone is installed. Depending on the microphone installation form, there are microphones that strongly collect the target sound. For example, in the installation mode shown in FIG. 5, since a microphone is assigned to each speaker, the speaker i's utterance is picked up most strongly by the microphone i. Therefore, the observation signal x_i (f, t) of the microphone i can be used as the target of rescaling.
  • the rescaling target needs to be found by another method.
  • the microphones constituting the microphone array are fixed to one device and a case where the microphones are installed in the space (distributed microphones) will be described.
  • the SN ratio power ratio of the target sound and other signals
  • the observation signal of any microphone may be selected as the target of rescaling, x_i (f, t).
  • rescaling using delay and sum which is used in the technique described in Japanese Patent Application Laid-Open No. 2014-219467, can also be applied.
  • the utterance direction ⁇ is estimated at the same time in addition to the utterance section.
  • the signal observed by the microphone array and the utterance direction ⁇ it is possible to generate a signal in which the sound coming from that direction is emphasized to some extent by the delay sum. If the result of performing the sum of delays with respect to the direction ⁇ is written as z (f, t, ⁇ ), the rescaling coefficient is calculated by the following equation (64).
  • the microphone array is a distributed microphone, use another method.
  • the signal-to-noise ratio of the observed signal differs from microphone to microphone, and it is expected that the signal-to-noise ratio will be high for microphones close to the speaker and low for microphones far away. Therefore, it is desirable to select a microphone that is close to the speaker as the observation signal that is the target of rescaling. Therefore, the observation signal of each microphone is rescaled, and the one that maximizes the power of the rescaling result is adopted.
  • the magnitude of the power of the rescaling result is determined only by the magnitude of the absolute value of the rescaling coefficient. Therefore, the rescaling coefficient is calculated for each microphone number i by the following equation (65), and the one with the maximum absolute value is set as ⁇ _ ⁇ max ⁇ , and the rescaling is performed by the following equation (66).
  • ⁇ _ ⁇ max ⁇ it is also known which microphone picks up the speaker's utterance most. If the position of each microphone is known, it is possible to know about where the speaker is located in the space, and that information can be used in the subsequent processing.
  • the voice of the response from the dialogue system may be output from the speaker presumed to be the closest to the speaker.
  • the following effects can be obtained.
  • the multi-channel observation signal of the section where the target sound is sounding and the rough amplitude spectrogram of the target sound in the section are input, and the rough amplitude spectrogram is used as the reference signal.
  • the extraction result with higher accuracy than the reference signal, that is, closer to the true target sound is estimated.
  • an objective function that reflects both the dependency between the reference signal and the extraction result and the independence between the extraction result and other virtual separation results is prepared, and the extraction filter is used as a solution to optimize it. Ask for.
  • the output signal can be limited to one sound source corresponding to the reference signal.
  • the reference signal in the technique described in JP-A-2014-219467 etc. is a time envelope and is a target sound. It was assumed that the change in the time direction was common to all frequency bins. In contrast, the reference signal of this embodiment is an amplitude spectrogram. Therefore, improvement in extraction accuracy can be expected when the change in the time direction of the target sound differs greatly for each frequency bin. -Since the reference signal in the technique described in the above document was used only as the initial value of the iteration, there was a possibility that a sound source different from the reference signal was extracted as a result of the iteration.
  • the reference signal since the reference signal is used throughout the iteration as part of the sound source model, it is unlikely that a sound source different from the reference signal will be extracted.
  • IDLMA independent deep learning matrix analysis
  • -IDLMA cannot be applied when there is an unknown sound source because it is necessary to prepare a different reference signal for each sound source. Moreover, it could be applied only when the number of microphones and the number of sound sources match.
  • the present embodiment it is applicable if the reference signal of one sound source to be extracted can be prepared.
  • the uncorrelatedness and the filter estimation can be combined into one formula by using the generalized eigenvalue decomposition. In that case, the process corresponding to uncorrelated can be skipped.
  • q_1 (f) is a filter that directly generates the extraction result from the observation signal before uncorrelated observation (without going through the uncorrelated observation signal).
  • Eq. (34) which represents the optimization problem corresponding to the TFVV Gaussian distribution
  • Eqs. (67) and Eqs. (3) to (6) the optimization problem for q_1 (f) is obtained.
  • a formula (68) is obtained.
  • This equation is a constrained minimization problem different from equation (34), but it can be solved by using Lagrange's undetermined multiplier method. If the Lagrange undetermined multiplier is ⁇ and the equations to be optimized in Eq. (68) and the equations representing the constraints are put together to create an objective function, it can be written as Eq. (69) below.
  • Equation (70) represents a generalized eigenvalue problem, where ⁇ is one of the eigenvalues. Further, by multiplying both sides of the equation (70) by q_1 (f) from the left, the following equation (71) is obtained.
  • Equation (71) The right side of equation (71) is the function itself that we want to minimize in equation (68). Therefore, the minimum value of the equation (71) is the smallest among the eigenvalues satisfying the equation (70), and the extraction filter q_1 (f) to be obtained is the Hermitian transpose of the eigenvector corresponding to the minimum eigenvalue.
  • Equation (72) A function that takes two matrices A and B as arguments, solves the generalized eigenvalue problem for the two matrices, and returns all the eigenvectors is expressed as gev (A, B). Using this function, the eigenvector of equation (70) can be written as in equation (72) below.
  • v_ ⁇ min ⁇ (f), ..., v_ ⁇ max ⁇ (f) in equation (72) are eigenvectors, and v_ ⁇ min ⁇ (f) corresponds to the minimum eigenvalue. It is an eigenvector.
  • the extraction filter q_1 (f) is the Hermitian transpose of v_ ⁇ min ⁇ (f) as in equation (73).
  • the extraction filter q_1 (f) corresponds to the minimum eigenvalue. This is the Hermitian transpose of the eigenvector v_ ⁇ min ⁇ (f) (Equation (73)). Since q_1 (f) does not converge once, the equations (74) to (75) and (73) are executed until they converge or a predetermined number of times.
  • the present disclosure may also adopt the following configuration.
  • a mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
  • a reference signal generation unit that generates a reference signal corresponding to the target sound based on the mixed sound signal
  • a signal processing device including a sound source extraction unit that extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
  • the signal processing device according to (1) which has a section detection unit that detects a section in which the target sound is included in the mixed sound signal.
  • the sound source extraction unit The signal processing device according to (1) or (2), which has an extraction filter estimation unit that estimates a filter that extracts a signal in which the target sound is more emphasized.
  • the extraction filter estimation unit The filter is estimated as a solution that optimizes the objective function that reflects the dependence of the reference signal on the extraction result by the filter and the independence of the extraction result and the separation result of other virtual sound sources.
  • the signal processing device according to (3).
  • (5) As a sound source model that represents the dependency between the reference signal and the extraction result included in the objective function, -A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that considers the reference signal as a value corresponding to the dispersion for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal.
  • the sound source extraction unit A pre-processing unit that performs uncorrelated processing on the time-frequency domain observation signal as pre-processing for processing by the extraction filter estimation unit, and a pre-processing unit.
  • the signal processing apparatus according to any one of (3) to (8), which has at least a post-processing unit for applying the filter to the mixed sound signal.
  • the reference signal generator It is provided with a neural network that extracts a speaker's voice by inputting a signal in which voices are mixed and a predetermined speaker's clean voice acquired at a timing different from the signal, and the mixed sound signal and
  • the signal processing device according to any one of (1) to (9), wherein the clean sound is input to the neural network, and an amplitude spectrogram generated from the output of the neural network is generated as the reference signal.
  • the reference signal generator The arrival direction of the target sound is estimated, a time frequency mask having an effect of leaving the sound arriving from a predetermined direction and reducing the sound arriving from other directions is generated, and the time frequency mask is used as the mixed sound signal.
  • the signal processing apparatus according to any one of (1) to (9), which generates an amplitude spectrogram generated by applying it to an amplitude spectrogram as the reference signal.
  • the reference signal generator The signal processing device according to any one of (1) to (11), which generates the reference signal by using a sensor different from the microphone.
  • the reference signal generator The signal processing apparatus according to any one of (1) to (12), which generates a reference signal by inputting an extraction result by a filter estimated by the extraction filter estimation unit into a neural network.
  • the signal processing device according to any one of (1) to (13), wherein the microphone is a microphone assigned to each speaker.
  • the microphone is a microphone worn by a speaker.
  • a mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
  • the reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
  • a signal processing method in which a sound source extraction unit extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
  • a mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
  • the reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
  • Section estimation unit 16 Reference signal estimation unit 17 . Sound source extraction unit 17A ... Pre-processing unit 17B . Extraction filter estimation unit 17C . Post-processing unit 20 . Control unit 100 ... Sound source extraction device

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Provided is a signal processing device comprising: a reference signal generation unit to which a mixed sound signal picked up by microphones disposed at different positions and obtained by mixing a target sound and sounds other than the target sound is inputted, and which generates a reference signal corresponding to the target sound on the basis of the mixed sound signal; and a sound source extraction unit which extracts, from the mixed sound signal, a signal which is similar to the reference signal and in which the target sound is further enhanced.

Description

信号処理装置、信号処理方法およびプログラムSignal processing equipment, signal processing methods and programs
 本開示は、信号処理装置、信号処理方法およびプログラムに関する。 The present disclosure relates to signal processing devices, signal processing methods and programs.
 抽出したい音(以下、目的音と適宜、称する)および除去したい音(以下、妨害音と適宜、称する)が混合された混合音信号から、目的音を抽出する技術が提案されている(例えば、下記特許文献1~3を参照のこと。)。 A technique for extracting a target sound from a mixed sound signal in which a sound to be extracted (hereinafter, appropriately referred to as a target sound) and a sound to be removed (hereinafter, appropriately referred to as a nuisance sound) is proposed (for example). See Patent Documents 1 to 3 below.).
特開2006-72163号公報Japanese Unexamined Patent Publication No. 2006-721163
特許4449871号公報Japanese Patent No. 4449871
特開2014-219467号公報Japanese Unexamined Patent Publication No. 2014-219467
 このような分野では、目的音を抽出する精度を向上させることが望まれている。 In such fields, it is desired to improve the accuracy of extracting the target sound.
 本開示は、目的音を抽出する精度を向上させた信号処理装置、信号処理方法、プログラムおよび信号処理システムを提供することを目的の一つとする。 One of the purposes of the present disclosure is to provide a signal processing device, a signal processing method, a program, and a signal processing system with improved accuracy of extracting a target sound.
 本開示は、例えば、
 異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
 混合音信号に基づいて目的音に対応する参照信号を生成する参照信号生成部と、
 混合音信号から参照信号に類似し、且つ、目的音がより強調された信号を抽出する音源抽出部と
 を有する
 信号処理装置である。
The present disclosure is, for example,
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
A reference signal generator that generates a reference signal corresponding to the target sound based on the mixed sound signal,
It is a signal processing device having a sound source extraction unit that extracts a signal that is similar to a reference signal from a mixed sound signal and has a more emphasized target sound.
 本開示は、例えば、
 異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
 参照信号生成部が、混合音信号に基づいて目的音に対応する参照信号を生成し、
 音源抽出部が、混合音信号から参照信号に類似し、且つ、目的音がより強調された信号を抽出する
 信号処理方法である。
The present disclosure is, for example,
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generator generates a reference signal corresponding to the target sound based on the mixed sound signal.
This is a signal processing method in which the sound source extraction unit extracts a signal that is similar to the reference signal and has a more emphasized target sound from the mixed sound signal.
 本開示は、例えば、
 異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
 参照信号生成部が、混合音信号に基づいて目的音に対応する参照信号を生成し、
 音源抽出部が、混合音信号から参照信号に類似し、且つ、目的音がより強調された信号を抽出する
 信号処理方法をコンピュータに実行させるプログラムである。
The present disclosure is, for example,
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generator generates a reference signal corresponding to the target sound based on the mixed sound signal.
This is a program in which the sound source extraction unit causes a computer to execute a signal processing method for extracting a signal that is similar to a reference signal and has a more emphasized target sound from a mixed sound signal.
図1は、本開示の音源分離過程の一例を説明するための図である。FIG. 1 is a diagram for explaining an example of the sound source separation process of the present disclosure. 図2は、デフレーション法に基づく、参照信号を用いた音源抽出方式の一例を説明するための図である。FIG. 2 is a diagram for explaining an example of a sound source extraction method using a reference signal based on the deflation method. 図3は、区間ごとに参照信号を生成した上で音源抽出を行なう処理を説明する際に参照される図である。FIG. 3 is a diagram referred to when explaining a process of extracting a sound source after generating a reference signal for each section. 図4は、一実施形態に係る音源抽出装置の構成例を示すブロック図である。FIG. 4 is a block diagram showing a configuration example of the sound source extraction device according to the embodiment. 図5は、区間推定および参照信号生成処理の一例を説明する際に参照される図である。FIG. 5 is a diagram referred to when explaining an example of interval estimation and reference signal generation processing. 図6は、区間推定および参照信号生成処理の他の例を説明する際に参照される図である。FIG. 6 is a diagram referred to when explaining other examples of interval estimation and reference signal generation processing. 図7は、区間推定および参照信号生成処理の他の例を説明する際に参照される図である。FIG. 7 is a diagram referred to when explaining other examples of interval estimation and reference signal generation processing. 図8は、実施形態に係る音源抽出部の詳細を説明する際に参照される図である。FIG. 8 is a diagram referred to when explaining the details of the sound source extraction unit according to the embodiment. 図9は、実施形態に係る音源抽出装置で行われる全体の処理の流れを説明する際に参照されるフローチャートである。FIG. 9 is a flowchart referred to when explaining the flow of the entire processing performed by the sound source extraction device according to the embodiment. 図10は、実施形態に係るSTFT部で行われる処理を説明する際に参照される図である。FIG. 10 is a diagram referred to when explaining the process performed by the FTFT unit according to the embodiment. 図11は、実施形態に係る音源抽出処理の流れを説明する際に参照されるフローチャートである。FIG. 11 is a flowchart referred to when explaining the flow of the sound source extraction process according to the embodiment.
 以下、本開示の実施形態等について図面を参照しながらの説明がなされる。なお、説明は以下の順序で行われる。
<本開示の概要、背景、および、考慮すべき問題について>
<本開示で用いられる技術>
<一実施形態>
<変形例>
 以下に説明する実施形態等は本開示の好適な具体例であり、本開示の内容がこれらの実施形態等に限定されるものではない。
Hereinafter, embodiments and the like of the present disclosure will be described with reference to the drawings. The explanation will be given in the following order.
<Outline of this disclosure, background, and issues to be considered>
<Technology used in this disclosure>
<One Embodiment>
<Modification example>
The embodiments and the like described below are suitable specific examples of the present disclosure, and the contents of the present disclosure are not limited to these embodiments and the like.
[本明細書における表記について]
(数式の表記)
 なお、以下では下記の表記に従って数式の説明を行う。
・「_」は、下つき文字を表わす。
(例) X_k ・・・「k」は下つき文字である。
・下つき文字が複数ある場合は、「{...}」で囲む。
(例)R_{xx} ・・・「xx」は下つき文字である。
・「^」は、上つき文字を表わす。
(例) W^H …… W のエルミート転置(=複素転置)行列 y_k(f,t)^H …… y_k(f,t) のエルミート転置ベクトル(共役複素数&転置) A^{-1} …… 分散行列 A の逆行列。 ・conj(X) は、複素数 X の共役複素数を表わす。式の上では、X の共役複素数は X に上線をつけて表わす。 ・hat(x) は、x の上に「^」をつけることを表わす。 ・値の代入は、「=」または「←」で表わす。特に、両辺で等号が成立しないような操作(例えば“x ← x + 1”)については、必ず“←”で表わしている。 ・行列は大文字で示し、ベクトルやスカラーは小文字で示す。また、行列とベクトルは太字で、スカラーは斜体で示している。
[Notation in this specification]
(Formula notation)
In the following, the mathematical formula will be described according to the following notation.
・ "_" Represents a subscript character.
(Example) X_k ・ ・ ・ "k" is a subscript character.
-If there are multiple subscript characters, enclose them in "{...}".
(Example) R_ {xx} ・ ・ ・ "xx" is a subscript character.
・ "^" Represents a superscript.
(Example) W ^ H …… W Elmeet transpose (= complex transpose) Matrix y_k (f, t) ^ H …… y_k (f, t) Elmeet transpose vector (conjugated complex number & transpose) A ^ {-1} …… The inverse matrix of the distribution matrix A. -Conj (X) represents the conjugate complex number of complex number X. In the equation, the conjugate complex number of X is represented by overlining X.・ Hat (x) means to put "^" on top of x. -Value assignment is represented by "=" or "←". In particular, operations that do not hold an equal sign on both sides (for example, "x ← x + 1") are always represented by "←". -Matrixes are shown in uppercase, and vectors and scalars are shown in lowercase. Matrixes and vectors are shown in bold and scalars are shown in italics.
(用語の定義)
 本明細書では、「音(信号)」と「音声(信号)」とを使い分けている。「音」はサウンドやオーディオなどの一般的な意味で使い、「音声」はボイスやスピーチを表わす用語として使用している。
 また、「分離」と「抽出」とを、以下のように使い分けている。「分離」は、混合の逆であり、複数の原信号が混合した信号をそれぞれの原信号に分けることを意味する用語として用いる(入力も出力も複数ある。)。「抽出」は、複数の原信号が混合した信号から1つの原信号を取り出すことを意味する用語として用いる。(入力は複数だが、出力は1つである。)。
 「フィルターを適用する」と「フィルタリングを行なう」とは同じ意味であり、同様に、「マスクを適用する」と「マスキングを行なう」とは同じ意味である。
(Definition of terms)
In this specification, "sound (signal)" and "voice (signal)" are used properly. "Sound" is used in a general sense such as sound and audio, and "voice" is used as a term for voice and speech.
In addition, "separation" and "extraction" are used properly as follows. "Separation" is the opposite of mixing, and is used as a term meaning that a signal obtained by mixing a plurality of original signals is divided into each original signal (there are multiple inputs and outputs). "Extraction" is used as a term meaning to extract one original signal from a signal in which a plurality of original signals are mixed. (There are multiple inputs, but one output.)
"Applying a filter" and "performing filtering" have the same meaning, and similarly, "applying a mask" and "performing a masking" have the same meaning.
<本開示の概要、背景、および、考慮すべき問題について>
 始めに、本開示の理解を容易とするために、本開示の概要、背景、本開示において考慮すべき問題について説明する。
<Outline of this disclosure, background, and issues to be considered>
First, in order to facilitate the understanding of the present disclosure, the outline, background, and issues to be considered in the present disclosure will be described.
(本開示の概要)
 本開示は、参照信号(リファレンス)を用いた音源抽出である。抽出したい音(目的音)と消したい音(妨害音)とが混合した信号を複数のマイクロホンで収録することに加え、目的音に対応した「ラフな」振幅スペクトログラムを生成し、その振幅スペクトログラムを参照信号として使用することで、参照信号に類似し、且つ、それよりも高精度の抽出結果を生成する信号処理装置である。すなわち、本開示の一形態は、混合音信号から参照信号に類似し、且つ、目的音がより強調された信号を抽出する信号処理装置である。
(Summary of this disclosure)
The present disclosure is sound source extraction using a reference signal (reference). In addition to recording a signal in which the sound to be extracted (target sound) and the sound to be erased (jamming sound) are mixed with multiple microphones, a "rough" amplitude spectrometer corresponding to the target sound is generated, and the amplitude spectrometer is generated. By using it as a reference signal, it is a signal processing device that produces an extraction result similar to the reference signal and with higher accuracy. That is, one form of the present disclosure is a signal processing device that extracts a signal that is similar to a reference signal and has a more emphasized target sound from a mixed sound signal.
 信号処理装置で行われる処理においては、参照信号と抽出結果との依存性(類似性)と、抽出結果と仮想的な他の分離結果との独立性との両方を反映した目的関数を用意し、それを最適化する解として抽出フィルターを求める。ブラインド音源分離で使用されるデフレーション法を用いることで、出力される信号は参照信号に対応した1音源分のみとすることができる。依存性と独立性とを共に考慮したビームフォーマーと見なせるため、以下では、Similarity-and-Independence-aware Beamformer(SIBF)と適宜、称する。 In the processing performed by the signal processing device, an objective function that reflects both the dependency (similarity) between the reference signal and the extraction result and the independence between the extraction result and other virtual separation results is prepared. , Find the extraction filter as a solution to optimize it. By using the deflation method used in blind sound source separation, the output signal can be limited to one sound source corresponding to the reference signal. Since it can be regarded as a beamformer that considers both dependence and independence, it is appropriately referred to as Similarity-and-Independence-aware Beamformer (SIBF) below.
(背景)
 本開示は、参照信号(リファレンス)を用いた音源抽出である。抽出したい音(目的音)と消したい音(妨害音)とが混合した信号を複数のマイクロホンで収録することに加え、目的音に対応した「ラフな」振幅スペクトログラムを取得または生成し、その振幅スペクトログラムを参照信号として使用することで、参照信号に類似かつそれよりも高精度の抽出結果を生成する。
(background)
The present disclosure is sound source extraction using a reference signal (reference). In addition to recording a signal with a mixture of the sound you want to extract (target sound) and the sound you want to eliminate (interfering sound) with multiple microphones, you can acquire or generate a "rough" amplitude spectrogram corresponding to the target sound, and its amplitude. By using the spectrogram as a reference signal, it produces extraction results that are similar to and more accurate than the reference signal.
 本開示が想定している使用状況は、例えば、下記の(1)~(3)の条件をすべて満たすものとする。
(1)観測信号は複数のマイクロホンで同期して収録される。
(2)目的音が鳴っている区間すなわち時間範囲は既知であり、前述の観測信号は少なくともその区間を含んでいるものとする。
(3)参照信号として、目的音に対応したラフな振幅スペクトログラム(ラフな目的音スペクトログラム)が取得済み、あるいは前述の観測信号から生成可能であるとする。
The usage situation assumed in the present disclosure shall satisfy all of the following conditions (1) to (3), for example.
(1) Observation signals are recorded synchronously by a plurality of microphones.
(2) It is assumed that the section in which the target sound is sounding, that is, the time range is known, and the above-mentioned observation signal includes at least that section.
(3) As a reference signal, it is assumed that a rough amplitude spectrogram corresponding to the target sound (rough target sound spectrogram) has been acquired, or can be generated from the above-mentioned observation signal.
 上記の各条件について補足する。
 上記(1)の条件において、各マイクロホンは固定されていてもいなくても良く、どちらであっても各マイクロホンおよび音源の位置は未知で良い。固定されたマイクロホンの例としてはマイクロホンアレイがあり、固定されていないマイクロホンの例としては、各発話者がピンマイクロホン等を装着している場合が考えられる。
Supplement each of the above conditions.
Under the condition (1) above, each microphone may or may not be fixed, and the position of each microphone and sound source may be unknown in either case. An example of a fixed microphone is a microphone array, and an example of a non-fixed microphone is a case where each speaker wears a pin microphone or the like.
 上記(2)の条件において、目的音が鳴っている区間とは、例えば特定話者の音声を抽出する場合であれば発話区間のことである。区間は既知である一方、区間の外側において、目的音が鳴っているか否かは未知であるとする。すなわち、区間の外側には目的音は存在しないといった仮定は、成立しない場合がある。 Under the condition (2) above, the section in which the target sound is sounding is, for example, the utterance section in the case of extracting the voice of a specific speaker. It is assumed that the section is known, but it is unknown whether or not the target sound is sounding outside the section. That is, the assumption that the target sound does not exist outside the section may not hold.
 上記(3)において、ラフな目的音スペクトログラムとは、真の目的音のスペクトログラムと比べ、以下のa)からf)のうち1つ以上の条件に該当するために劣化していることを意味する。
 a)位相情報を含まない実数のデータである。
 b)目的音が優勢ではあるものの、妨害音も含まれている。
 c)妨害音がほぼ除去されているが、その副作用として音が歪んでいる。
 d)時間方向・周波数方向いずれかまたは両方において、真の目的音スペクトログラムと比べて解像度が低下している。
 e)スペクトログラムの振幅のスケールは観測信号とは異なり、大きさの比較が無意味である。例えば、ラフな目的音スペクトログラムの振幅が観測信号スペクトログラムの振幅の半分であったとしても、それは観測信号において目的音と妨害音とが同じ大きさで含まれていることを決して意味しない。
 f)音以外の信号から生成された振幅スペクトログラムである。
 上記のようなラフな目的音スペクトログラムは、例えば以下のような方法で取得または生成される。
・目的音の近くに設置されたマイクロホン(例えば話者に装着されたピンマイクロホン)で音を収録し、そこから振幅スペクトログラムを求める。(上記bの例に相当)
・振幅スペクトログラム領域で特定の種類の音を抽出するニューラルネットワーク(NN)を予め学習しておき、そこに観測信号を入力する。(上記a、c、eに相当)
・骨伝導マイクロホンなど、通常使用される気導マイクロホンとは別のセンサーで取得された信号から振幅スペクトログラムを求める。(上記cに相当)
・メル周波数など、非線形な周波数領域において計算されたスペクトログラム相当のデータに対し、所定の変換を適用することで線形の周波数領域のスペクトログラムを生成する。(上記a、d、eに相当)
・マイクロホンの代わりに、発話者の口や喉付近の皮膚表面の振動を観測可能なセンサーを用い、そのセンサーで取得された信号から振幅スペクトログラムを求める。(上記d、e、fに相当)
In the above (3), the rough target sound spectrogram means that the spectrogram of the true target sound is deteriorated because it meets one or more of the following conditions a) to f). ..
a) Real number data that does not include phase information.
b) Although the target sound is predominant, the disturbing sound is also included.
c) The disturbing sound is almost eliminated, but the sound is distorted as a side effect.
d) The resolution is lower than that of the true target sound spectrogram in either or both of the time direction and the frequency direction.
e) The spectrogram amplitude scale is different from the observed signal, and the size comparison is meaningless. For example, even if the amplitude of the rough target sound spectrogram is half the amplitude of the observed signal spectrogram, it does not mean that the target sound and the disturbing sound are included in the observed signal with the same magnitude.
f) Amplitude spectrogram generated from a signal other than sound.
The rough target sound spectrogram as described above is acquired or generated by, for example, the following method.
-Record the sound with a microphone installed near the target sound (for example, a pin microphone attached to the speaker), and obtain the amplitude spectrogram from it. (Corresponds to the example of b above)
-A neural network (NN) that extracts a specific type of sound in the amplitude spectrogram region is learned in advance, and an observation signal is input to the neural network (NN). (Equivalent to a, c, e above)
-Amplitude spectrogram is obtained from a signal acquired by a sensor other than the normally used air conduction microphone such as a bone conduction microphone. (Equivalent to c above)
-A spectrogram in the linear frequency domain is generated by applying a predetermined conversion to the spectrogram-equivalent data calculated in the non-linear frequency domain such as the mel frequency. (Equivalent to a, d, e above)
-Instead of a microphone, use a sensor that can observe the vibration of the skin surface near the speaker's mouth and throat, and obtain the amplitude spectrogram from the signal acquired by that sensor. (Equivalent to d, e, f above)
 本開示の一つの目的は、このようにして取得・生成されたラフな目的音スペクトログラムを参照信号として利用し、参照信号を超える精度の(目的音が一層強調されている、言い換えると、真の目的音に一層近い)抽出結果を生成することである。より具体的には、マルチチャンネルの観測信号に線形フィルターを適用して抽出結果を生成する音源抽出処理において、参照信号を超える精度の(真の目的音に一層近い)抽出結果を生成する線形フィルターを推定する。 One object of the present disclosure is to use the rough target sound spectrogram thus acquired and generated as a reference signal, and to exceed the reference signal (the target sound is further emphasized, in other words, to be true. It is to generate an extraction result (closer to the target sound). More specifically, in the sound source extraction process in which a linear filter is applied to a multi-channel observation signal to generate an extraction result, a linear filter that generates an extraction result (closer to the true target sound) with an accuracy exceeding the reference signal. To estimate.
 本開示において、音源抽出処理のための線形フィルターを推定する理由は、線形フィルターが持つ以下の利点を享受するためである。
利点1: 非線形な抽出処理と比べ、抽出結果の歪みが小さい。そのため、音声認識等と組みわせた場合に、歪みによる認識精度の低下を回避することができる。
利点2:後述のリスケーリング処理により、抽出結果の位相を適切に推定することができる。そのため、位相に依存した後段処理と組みわせた場合(抽出結果を音として再生し、それを人間が聞くという場合も含む)に不適切な位相に由来する問題を回避することができる。
利点3: マイクロホンの個数を増やすことで、抽出精度の向上が容易である。
In the present disclosure, the reason for estimating the linear filter for the sound source extraction process is to enjoy the following advantages of the linear filter.
Advantage 1: The distortion of the extraction result is small compared to the non-linear extraction process. Therefore, when combined with voice recognition or the like, it is possible to avoid a decrease in recognition accuracy due to distortion.
Advantage 2: The phase of the extraction result can be appropriately estimated by the rescaling process described later. Therefore, it is possible to avoid a problem caused by an inappropriate phase when combined with a phase-dependent post-stage processing (including a case where the extraction result is reproduced as a sound and a human hears it).
Advantage 3: By increasing the number of microphones, it is easy to improve the extraction accuracy.
(本開示で考慮すべき問題)
 本開示の目的の一つを再度記述すると、以下の通りである。
目的: 以下のa)~c)までの条件が揃っているとして、c)の信号よりも高精度な抽出結果を生成するための線形フィルターを推定する。
a)マルチチャンネルのマイクロホンで収録された信号がある。マイクロホンの配置や各音源の位置は未知でも良い。
b)目的音(残したい音)が鳴っている区間は既知である。ただし、区間外にも目的音が存在するかどうかは未知である。
c)目的音のラフな振幅スペクトログラム(またはそれに類するデータ)が取得済みまたは生成可能である。振幅スペクトログラムは実数であり、位相は分からない。
 しかしながら、上記の3つの条件をすべて満たす線形フィルタリング方式は、従来は存在しなかった。一般的なの線形フィルタリング方式としては主に以下の3種類が知られている。
・適応ビームフォーマー
・ブラインド音源分離
・参照信号を用いた既存の線形フィルタリング処理
 以降ではそれぞれの方式についての問題点を説明する。
(Issues to be considered in this disclosure)
One of the purposes of the present disclosure will be described again as follows.
Purpose: Estimate a linear filter to generate extraction results with higher accuracy than the signal of c), assuming that the following conditions a) to c) are met.
a) There is a signal recorded by a multi-channel microphone. The arrangement of microphones and the position of each sound source may be unknown.
b) The section in which the target sound (the sound to be retained) is sounding is known. However, it is unknown whether the target sound exists outside the section.
c) A rough amplitude spectrogram (or similar data) of the target sound can be acquired or generated. The amplitude spectrogram is real and the phase is unknown.
However, a linear filtering method that satisfies all of the above three conditions has not existed in the past. The following three types are mainly known as general linear filtering methods.
-Adaptive beamformer-Blind separation-Existing linear filtering process using reference signal The following describes the problems of each method.
(適応ビームフォーマーの問題点)
 ここでいう適応ビームフォーマーとは、複数のマイクロホンで観測された信号と、どの音源を目的音として抽出するかを表わす情報と用いて、目的音を抽出するための線形フィルターを適応的に推定する方式である。適応ビームフォーマーには、例えば、特開2012-234150号公報や、特開2006-072163号公報に記載された方式がある。
(Problems of adaptive beam former)
The adaptive beam former here is an adaptive estimation of a linear filter for extracting a target sound by using signals observed by a plurality of microphones and information indicating which sound source is to be extracted as the target sound. It is a method to do. Examples of the adaptive beam former include the methods described in JP-A-2012-234150 and JP-A-2006-072163.
 以下では、マイクロホンの配置や目的音の方向などが未知の場合でも使用可能な適応ビームフォーマーとして、SN比(Signal to Noise Ratio)最大化ビームフォーマー(別名 GEV ビームフォーマー)について説明する。 Below, we will explain the SN ratio (Signal to Noise Ratio) maximizing beam former (also known as GEV beam former) as an adaptive beam former that can be used even when the placement of the microphone and the direction of the target sound are unknown.
 SN比最大化ビームフォーマー(maximum SNR beamformer)は、以下のa)とb)との比 V_s / V_n を最大にする線形フィルターを求める方式である。
a)目的音のみが鳴っている区間に所定の線形フィルターを適用した処理結果の分散 V_s
b)妨害音のみが鳴っている区間に同じ線形フィルターを適用した処理結果の分散 V_n
The SN ratio maximizing beamformer (maximum SNR beamformer) is a method for finding a linear filter that maximizes the ratio V_s / V_n between a) and b) below.
a) Dispersion of processing results by applying a predetermined linear filter to the section where only the target sound is sounding V_s
b) Dispersion of processing results by applying the same linear filter to the section where only the disturbing sound is sounding V_n
 この方式は、それぞれの区間が検出できれば線形フィルターが推定でき、マイクロホンの配置や目的音の方向は不要である。 With this method, a linear filter can be estimated if each section can be detected, and there is no need for the placement of microphones or the direction of the target sound.
 しかし、本開示が適用され得る想定では、既知の区間は目的音が鳴っているタイミングのみである。その区間では目的音も妨害音も存在しているため、上記のa)、b)どちらの区間としても使用することができない。他の適応ビームフォーマーの方式についても、上記b)の区間が別途必要である、あるいは、目的音の方向が既知でなければならないなどの理由により、本開示が適用され得る状況で使用することは困難である。 However, it is assumed that this disclosure can be applied, and the known section is only the timing when the target sound is sounding. Since both the target sound and the disturbing sound are present in that section, it cannot be used as either of the above a) and b) sections. Other adaptive beam former methods shall also be used in situations where this disclosure can be applied because the section b) above is required separately or the direction of the target sound must be known. It is difficult.
(ブラインド音源分離の問題点)
 ブラインド音源分離とは、複数のマイクロホンで観測された信号のみを用い(音源の方向やマイクロホンの配置といった情報は使用せずに)、複数の音源が混合された信号から各音源を推定する技術である。そのような技術の例として、特許第4449871号の技術が挙げられる。特許第4449871号の技術は、独立成分分析(Independent Component Analysis、以下、ICAと適宜、称する)と呼ばれる技術の一例であり、ICAはN個のマイクロホンで観測された信号をN個の音源に分解する。その際に使用する観測信号は、目的音が鳴っている区間が含まれていればよく、目的音のみ、あるいは妨害音のみが鳴っている区間に関する情報は不要である。
(Problems of blind separation)
Blind sound source separation is a technology that estimates each sound source from a signal that is a mixture of multiple sound sources, using only the signals observed by multiple microphones (without using information such as the direction of the sound source and the arrangement of the microphones). be. An example of such a technique is the technique of Japanese Patent No. 4449871. The technology of Japanese Patent No. 4449871 is an example of a technology called Independent Component Analysis (hereinafter, appropriately referred to as ICA), and ICA decomposes a signal observed by N microphones into N sound sources. do. The observation signal used at that time may include a section in which the target sound is sounding, and information on a section in which only the target sound or only the disturbing sound is sounding is unnecessary.
 従って、目的音が鳴っている区間の観測信号に対してICAを適用してN個の成分に分解した後、参照信号であるラフな目的音スペクトログラムに最も類似している成分を1個だけ選択することで、本開示が適用され得る状況で使用することが可能である。類似しているか否かの判定方法としては、各分離結果を振幅スペクトログラムに変換した上で、各振幅スペクトログラムと参照信号との間で二乗誤差(ユークリッド距離)を計算し、誤差が最小となる振幅スペクトログラムに対応した分離結果を採用すればよい。 Therefore, after applying ICA to the observation signal in the section where the target sound is sounding and decomposing it into N components, select only one component that most closely resembles the rough target sound spectrogram, which is the reference signal. By doing so, it can be used in situations where this disclosure can be applied. To determine whether they are similar or not, after converting each separation result into an amplitude spectrogram, the square error (Euclidean distance) between each amplitude spectrogram and the reference signal is calculated, and the amplitude that minimizes the error. The separation result corresponding to the spectrogram may be adopted.
 しかし、このように分離後に選択するという方法は、以下の問題がある。
1)欲しい音源は一つだけなのにも関わらず、途中のステップにおいてN個の音源が生成されるため、計算コストおよびメモリー使用量の点で不利である。
2)参照信号であるラフな目的音スペクトログラムは、N個の音源から1音源を選択するステップでのみ使用され、N個の音源へと分離するステップでは使用されない。そのため、参照信号は抽出精度の向上には寄与しない。
However, the method of selecting after separation has the following problems.
1) Although only one sound source is desired, N sound sources are generated in the middle step, which is disadvantageous in terms of calculation cost and memory usage.
2) The rough target sound spectrogram, which is a reference signal, is used only in the step of selecting one sound source from N sound sources, and is not used in the step of separating into N sound sources. Therefore, the reference signal does not contribute to the improvement of extraction accuracy.
(参照信号を用いた既存の線形フィルタリング処理の問題点)
 従来も、参照信号を用いて線形フィルターを推定する方式がいくつか存在する。
ここでは、そのような技術として以下のa)およびb)について言及する。
a)独立深層学習行列分析
b)時間エンベロープを参照信号として用いる音源抽出
(Problems of existing linear filtering process using reference signal)
Conventionally, there are several methods for estimating a linear filter using a reference signal.
Here, the following a) and b) are referred to as such techniques.
a) Independent deep learning matrix analysis b) Sound source extraction using the time envelope as a reference signal
 独立深層学習行列分析(Independent Deeply Learned Matrix Analysis:以下、IDLMAと適宜、称する)は、独立成分分析の発展形である。詳細は、以下の文献1を参照されたい。
「(文献1)
N. Makishima et al.,
"Independent Deeply Learned Matrix Analysis for Determined Audio Source Separation,"
in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1601-1615, Oct. 2019.
doi: 10.1109/TASLP.2019.2925450」
Independent Deeply Learned Matrix Analysis (hereinafter, appropriately referred to as IDLMA) is an advanced form of independent component analysis. For details, refer to Document 1 below.
"(Reference 1)
N. Makishima et al.,
"Independent Deeply Learned Matrix Analysis for Determined Audio Source Separation,"
in IEEE / ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1601-1615, Oct. 2019.
doi: 10.1109 / TASLP.2019.2925450 "
 IDLMAの特徴は、分離したい各音源のパワースペクトログラム(振幅スペクトログラムの二乗)を生成するようなニューラルネットワーク(NN)を予め学習しておくことである。例えば、複数の楽器が同時に演奏されている楽曲から各楽器のパートをそれぞれ分離したい場合は、楽曲を入力して各楽器音を出力するNNを予め学習しておく。分離時は、観測信号を各NNにそれぞれ入力し、その出力であるパワースペクトログラムを参照信号として用いることで分離を行なう。そのため、完全にブラインドな分離処理と比べ、参照信号を用いる分だけ分離精度の向上が期待できる。さらに、一度生成された分離結果を各NNに再度入力することで、初回よりも高精度のパワースペクトルが生成され、それを参照信号として分離を行なうことで、初回よりも高精度の分離結果が得られることも報告されている。 The feature of IDLMA is that the neural network (NN) that generates the power spectrogram (square of the amplitude spectrogram) of each sound source to be separated is learned in advance. For example, when it is desired to separate the parts of each musical instrument from the music in which a plurality of musical instruments are played at the same time, the NN for inputting the music and outputting the sound of each musical instrument is learned in advance. At the time of separation, the observation signal is input to each NN, and the output power spectrogram is used as a reference signal to perform the separation. Therefore, compared to the completely blind separation process, the separation accuracy can be expected to be improved by the amount of the reference signal used. Furthermore, by inputting the once generated separation result to each NN again, a power spectrum with higher accuracy than the first time is generated, and by performing separation using it as a reference signal, the separation result with higher accuracy than the first time can be obtained. It has also been reported that it can be obtained.
 しかしながら、このIDLMAを本開示が適用され得る状況で使用することは、以下の理由により困難である。
 IDLMAでは、N個の分離結果を生成するためには参照信号としてN個の異なるパワースペクトログラムが必要である。そのため、興味のある音源が1個だけであり、他の音源は不要であっても、全ての音源について参照信号を用意する必要がある。しかし、現実にはそれが困難な場合がある。また、上記の文献1では、マイクロホンの個数と音源の個数とが一致している場合のみしか言及しておらず、両者の個数が一致しない場合に何個の参照信号を用意すればよいのかについては言及されていない。また、IDLMAは音源分離の方法であるため、音源抽出の目的で使用するためには、N個の分離結果をいったん生成した後で1音源分のみを残すというステップが必要である。そのため、計算コストやメモリー使用量の点で無駄があるという音源分離の課題は依然として残っている。
However, it is difficult to use this IDLMA in situations where this disclosure can be applied for the following reasons:
IDLMA requires N different power spectrograms as reference signals to generate N separation results. Therefore, even if there is only one sound source of interest and no other sound source is required, it is necessary to prepare reference signals for all sound sources. However, in reality it can be difficult. Further, in the above-mentioned document 1, only the case where the number of microphones and the number of sound sources match is mentioned, and how many reference signals should be prepared when the numbers of both do not match. Is not mentioned. Further, since IDLMA is a method of sound source separation, in order to use it for the purpose of sound source extraction, it is necessary to take a step of generating N separation results once and then leaving only one sound source. Therefore, the problem of sound source separation, which is wasteful in terms of calculation cost and memory usage, still remains.
 時間エンベロープを参照信号として用いる音源抽出としては、例えば、本発明者によって提案された特開2014-219467号公報に記載の技術等が挙げられる。この方式は本開示と同様に、参照信号とマルチチャンネルの観測信号とを用いて線形フィルターを推定する。ただし、以下の点において相違がある。
・参照信号はスペクトログラムではなく、時間エンベロープである。これは、ラフな目的音スペクトログラムに対して周波数方向に平均等の操作を適用して均一化したものに相当する。そのため、目的音の時間方向の変化が周波数ごとに異なるという特徴を持つ場合、参照信号はそれを適切に表現することができず、結果として抽出の精度が低下する可能性がある。
・参照信号は、抽出フィルターを求めるための反復処理において、初期値としてのみ反映される。反復の2回目以降は参照信号の制約を受けないため、参照信号とは異なる別の音源が抽出される可能性がある。例えば、区間内で一瞬だけ発生する音が存在する場合は、目的関数としてはそちらを抽出する方が最適であるため、反復回数によっては所望外の音が抽出される可能性がある。
Examples of the sound source extraction using the time envelope as a reference signal include the techniques described in Japanese Patent Application Laid-Open No. 2014-219467 proposed by the present inventor. This method estimates a linear filter using a reference signal and a multi-channel observation signal, as in the present disclosure. However, there are differences in the following points.
-The reference signal is a time envelope, not a spectrogram. This corresponds to a rough target sound spectrogram that is homogenized by applying an operation such as averaging in the frequency direction. Therefore, if the target sound has a characteristic that the change in the time direction differs for each frequency, the reference signal cannot properly express it, and as a result, the extraction accuracy may decrease.
-The reference signal is reflected only as an initial value in the iterative process for obtaining the extraction filter. Since the reference signal is not restricted after the second iteration, another sound source different from the reference signal may be extracted. For example, if there is a sound that occurs only for a moment in the section, it is more optimal to extract it as an objective function, so there is a possibility that an undesired sound may be extracted depending on the number of repetitions.
 このように、上述した技術では、本開示が適用され得る状況で使用するのは困難であるか、あるいは十分な精度の抽出結果が得られないという問題があった。 As described above, the above-mentioned technique has a problem that it is difficult to use it in a situation where the present disclosure can be applied, or an extraction result with sufficient accuracy cannot be obtained.
[本開示で用いられる技術]
 次に、本開示で用いられる技術について説明する。独立成分分析に基づくブラインド音源分離の手法に対して以下の要素を共に導入すると、本開示の目的に適った音源抽出技術を実現することができる。
要素1: 分離の過程において、分離結果同士の独立性だけでなく、分離結果の一つと参照信号との依存性も反映した目的関数を用意し、それを最適化する。
要素2: 同じく分離過程において、デフレーション法と呼ばれる、1音源ずつ分離を行なう手法を導入する。そして、最初の音源が分離された時点で分離処理を打ち切る。
[Technology used in this disclosure]
Next, the technique used in the present disclosure will be described. By introducing the following elements together into the blind sound source separation method based on independent component analysis, a sound source extraction technique suitable for the purpose of the present disclosure can be realized.
Element 1: In the process of separation, prepare an objective function that reflects not only the independence of the separation results but also the dependency between one of the separation results and the reference signal, and optimize it.
Element 2: Similarly, in the separation process, a method called the deflation method, which separates sound sources one by one, is introduced. Then, the separation process is terminated when the first sound source is separated.
 本開示の音源抽出技術は、複数のマイクロホンで観測されたマルチチャンネルの観測信号から、線形フィルターである抽出フィルターを適用することで、所望の1音源を抽出する。そのため、ビームフォーマー(BF)の一種と見なせる。抽出の過程においては、参照信号と抽出結果の依存性(similarity)と、抽出結果と他の分離結果との独立性(independence)とが共に反映される。そこで、本開示の音源抽出方式を、Similarity-and-Independence-aware Beamformer: SIBF と適宜、称する。 The sound source extraction technology of the present disclosure extracts one desired sound source from multi-channel observation signals observed by a plurality of microphones by applying an extraction filter which is a linear filter. Therefore, it can be regarded as a kind of beam former (BF). In the extraction process, both the similarity between the reference signal and the extraction result and the independence between the extraction result and other separation results are reflected. Therefore, the sound source extraction method of the present disclosure is appropriately referred to as Similarity-and-Independence-aware Beamformer: SIBF.
 本開示の分離過程について、図1を用いて説明する。(1-1)が付された枠内は従来の時間周波数領域 独立成分分析(特許第4449871号等)で想定している分離過程であり、その外部に存在する(1-5)および(1-6)は本開示で追加された要素である。以下では、先に(1-1)の枠内を用いて従来の時間周波数領域ブラインド音源分離について説明し、次に本開示の分離過程について説明する。 The separation process of the present disclosure will be described with reference to FIG. The frame marked with (1-1) is the separation process assumed in the conventional time-frequency domain independent component analysis (Patent No. 4449871 etc.), and exists outside (1-5) and (1). -6) is an element added in this disclosure. In the following, the conventional time-frequency domain blind sound source separation will be described first using the frame of (1-1), and then the separation process of the present disclosure will be described.
 図1において、X_1 ~ X_N は、N個のマイクロホンにそれぞれ対応した観測信号スペクトログラム(1-2)である。これらは複素数のデータであり、各マイクロホンで観測された音の波形に対して後述の短時間フーリエ変換を適用することで生成される。各スペクトログラムは縦軸が周波数、横軸が時間を表わす。時間長については、抽出したい目的音が鳴っている長さと同じ、またはそれより長いものとする。 In FIG. 1, X_1 to X_N are observation signal spectrograms (1-2) corresponding to N microphones, respectively. These are complex data and are generated by applying the short-time Fourier transform described later to the waveform of the sound observed by each microphone. In each spectrogram, the vertical axis represents frequency and the horizontal axis represents time. The time length shall be the same as or longer than the length of the target sound to be extracted.
 独立成分分析では、この観測信号スペクトログラム対し、(1-3)が付された分離行列と呼ばれる所定の正方行列を乗じることにより分離結果スペクトログラム Y_1~Y_N を生成する(1-4)。分離結果スペクトログラムの個数はN個であり、マイクロホン数と同じである。分離においては、Y_1~Y_N が統計的に独立となるように(すなわち Y_1~Y_N の差異ができる限り大きくなるように)分離行列の値を決める。そのような行列は一回では求まらないため、分離結果スペクトログラム同士の独立性が反映された目的関数(objective function)を用意し、その関数が最適(目的関数の性質によって最大または最小)となるような分離行列を反復的に求める。分離行列および分離結果スペクトログラムの結果が求まった後、分離結果スペクトログラムのそれぞれに対してフーリエ逆変換を適用して波形を生成すると、それらは混合する前の各音源を推定した信号になっている。 In independent component analysis, the separation result spectrograms Y_1 to Y_N are generated by multiplying this observed signal spectrogram by a predetermined square matrix called the separation matrix with (1-3) attached (1-4). The number of separation result spectrograms is N, which is the same as the number of microphones. In the separation, the value of the separation matrix is determined so that Y_1 to Y_N are statistically independent (that is, the difference between Y_1 to Y_N is as large as possible). Since such a matrix cannot be obtained once, prepare an objective function that reflects the independence of the separation result specograms, and that function is optimal (maximum or minimum depending on the nature of the objective function). Iteratively find such a separation matrix. After obtaining the results of the separation matrix and the separation result spectrogram, when the inverse Fourier transform is applied to each of the separation result spectrograms to generate waveforms, they are signals that estimate each sound source before mixing.
 以上は、従来の時間周波数領域における独立成分分析の分離過程の説明である。本開示では、これに対して前述の2つの要素を追加する。 The above is the explanation of the separation process of independent component analysis in the conventional time frequency domain. In the present disclosure, the above-mentioned two elements are added to this.
 追加要素の一つは、参照信号との依存性である。参照信号は、目的音のラフな振幅スペクトログラムであり、(1-5)が付された参照信号生成部によって生成される。分離過程においては、分離結果スペクトログラム同士の独立性の他に、分離結果スペクトログラムの一つである Y_1 と参照信号 R との間の依存性も考慮して分離行列を決める。すなわち、目的関数に対して以下の両方を反映し、その関数を最適化する分離行列を求める。
a) Y_1~Y_N の間の独立性(実線L1)
b) Y_1 と R との依存性(点線L2)
目的関数の具体的な数式については後述する。
One of the additional factors is the dependency on the reference signal. The reference signal is a rough amplitude spectrogram of the target sound and is generated by the reference signal generation unit marked with (1-5). In the separation process, in addition to the independence of the separation result spectrograms, the separation matrix is determined by considering the dependency between Y_1, which is one of the separation result spectrograms, and the reference signal R. That is, the separation matrix that optimizes the function is obtained by reflecting both of the following for the objective function.
a) Independence between Y_1 and Y_N (solid line L1)
b) Dependency between Y_1 and R (dotted line L2)
The specific formula of the objective function will be described later.
 独立性と依存性との両方を目的関数に反映することで、以下の利点が得られる。
利点1:通常の時間周波数領域における独立成分分析では、分離結果スペクトログラムの何番目にどの原信号が出現するかは不定であり、分離行列の初期値や観測信号(後述する混合音信号に対応する信号)における混合の程度や分離行列を求めるアルゴリズムの違いなどによって変化する。それに対して本開示は、独立性に加えて分離結果 Y_1 と参照信号 R との依存性も考慮するため、Y_1 には R と類似したスペクトログラムを必ず出現させることができる。
利点2:分離結果の一つである Y_1 を単に参照信号 R に類似させるという問題を解くだけでは、Y_1 を R に近づけることはできても抽出精度の点で参照信号 R を超える(目的音に一層近づける)ことはできない。それに対して本開示では、分離結果同士の独立性も考慮するため、分離結果 Y_1 の抽出精度が参照信号を超えることが可能である。
By reflecting both independence and dependency in the objective function, the following advantages can be obtained.
Advantage 1: In independent component analysis in the normal time frequency domain, it is uncertain which original signal appears at which position in the separation result spectrogram, and it corresponds to the initial value of the separation matrix and the observed signal (corresponding to the mixed sound signal described later). It changes depending on the degree of mixing in the signal) and the difference in the algorithm for obtaining the separation matrix. On the other hand, since the present disclosure considers the dependency between the separation result Y_1 and the reference signal R in addition to the independence, a spectrogram similar to R can always appear in Y_1.
Advantage 2: By simply solving the problem of making Y_1, which is one of the separation results, similar to the reference signal R, it is possible to bring Y_1 closer to R, but it exceeds the reference signal R in terms of extraction accuracy (for the target sound). You can't get closer). On the other hand, in the present disclosure, since the independence of the separation results is also taken into consideration, the extraction accuracy of the separation result Y_1 can exceed the reference signal.
 しかしながら、時間周波数領域における独立成分分析において参照信号との依存性を導入しても、依然として分離手法であるため、生成される信号はN個である。すなわち、所望の音源が Y_1 のみであっても、それと同時に N-1 個の信号が不要にもかかわらず生成されてしまう。 However, even if the dependency with the reference signal is introduced in the independent component analysis in the time frequency domain, the number of generated signals is N because it is still a separation method. That is, even if the desired sound source is only Y_1, at the same time, N-1 signals are generated even though they are unnecessary.
 そこで、もう一つの追加要素として、デフレーション法を導入する。デフレーション法とは、全音源を同時に分離する代わりに、原信号を一つずつ推定する方式である。デフレーション法の一般的な解説については、例えば以下の文献2の8章を参照されたい。
「(文献2)
詳解 独立成分分析―信号解析の新しい世界
アーポ ビバリネン (著), エルキ オヤ (著), ユハ カルーネン (著),
Aapo Hyvaerinen (原著), Erkki Oja (原著), Juha Karhunen (原著),
根本 幾 (翻訳), 川勝 真喜 (翻訳)
(原題)
Independent Component Analysis
Aapo Hyvaerinen (Author), Juha Karhunen (Author), Erkki Oja (Author)」
Therefore, as another additional element, the deflation method is introduced. The deflation method is a method of estimating the original signals one by one instead of separating all the sound sources at the same time. For a general explanation of the deflation method, refer to Chapter 8 of Reference 2 below, for example.
"(Reference 2)
Detailed explanation Independent component analysis-a new world of signal analysis Arpo Bibalinen (Author), Elkioya (Author), Yuha Karunen (Author),
Aapo Hyvaerinen (Original), Erkki Oja (Original), Juha Karhunen (Original),
Iku Nemoto (translation), Maki Kawakatsu (translation)
(Original title)
Independent Component Analysis
Aapo Hyvaerinen (Author), Juha Karhunen (Author), Erkki Oja (Author) "
 一般的には、デフレーション法であっても分離結果の順番は不定であるため、所望の音源が何番目に出現するかは不定である。しかし、上述のような独立性と依存性とを共に反映した目的関数を用いた音源分離に対してデフレーション法を適用すると、参照信号に類似した分離結果を必ず最初に出現させることが可能になる。すなわち、最初の1音源を分離(推定)した時点で分離処理を打ち切ればよく、不要な N-1 個の分離結果を生成する必要がなくなる。また、分離行列については全要素を推定する必要はなく、その中で Y_1 を生成するのに必要な要素のみを推定すればよい。 In general, even with the deflation method, the order of separation results is indefinite, so the order in which the desired sound source appears is indefinite. However, if the deflation method is applied to the sound source separation using the objective function that reflects both independence and dependence as described above, it is possible to make the separation result similar to the reference signal always appear first. Become. That is, the separation process may be terminated when the first one sound source is separated (estimated), and it is not necessary to generate unnecessary N-1 separation results. In addition, it is not necessary to estimate all the elements of the separation matrix, and only the elements necessary to generate Y_1 need to be estimated.
 1音源のみを推定するデフレーション法においては、図1において(1-4)が付された分離結果の内、Y_1 以外(すなわち Y_2~Y_N)は仮想的なものであり、実際には生成されない。しかし、独立性の計算については、全ての分離結果である Y_1~Y_N を用いて行なうのと等価なことが行なわれる。そのため、独立性を考慮することで Y_1 を R よりも高精度にすることができるという音源分離の利点が得られる一方で、不要な分離結果である Y_2~Y_N を生成するという無駄を回避することもできる。 In the deflation method that estimates only one sound source, among the separation results marked with (1-4) in FIG. 1, other than Y_1 (that is, Y_2 to Y_N) are virtual and are not actually generated. .. However, the calculation of independence is equivalent to using all the separation results Y_1 to Y_N. Therefore, while considering the independence, the advantage of sound source separation that Y_1 can be made more accurate than R can be obtained, while avoiding the waste of generating unnecessary separation results Y_2 to Y_N. You can also.
 デフレーション法は分離(混合前の音源を全て推定する)の方式の1つであるが、1音源を推定した時点で分離を中断した場合は、抽出(所望の1音源を推定する)の方式として使用することができる。そこで以下の説明では、分離結果 Y_1 のみを推定する操作を「抽出」と呼び、Y_1 を「(目的音)抽出結果」と適宜、称する。さらに、各分離結果は、(1-3)が付された分離行列を構成するベクトルから生成される。このベクトルを「抽出フィルター」と適宜、称する。 The deflation method is one of the separation (estimating all the sound sources before mixing) method, but if the separation is interrupted when one sound source is estimated, the extraction (estimating one desired sound source) method. Can be used as. Therefore, in the following description, the operation of estimating only the separation result Y_1 is referred to as “extraction”, and Y_1 is appropriately referred to as “(target sound) extraction result”. Further, each separation result is generated from the vectors constituting the separation matrix with (1-3). This vector is appropriately referred to as an "extraction filter".
 デフレーション法に基づく、参照信号を用いた音源抽出方式について、図2を用いて説明する。図2は、図1の詳細を示しており、デフレーション法の適用に必要な要素が追加されている。 A sound source extraction method using a reference signal based on the deflation method will be described with reference to FIG. FIG. 2 shows the details of FIG. 1, and the elements necessary for applying the deflation method are added.
 図2において(2-1)が付された観測信号スペクトログラムは、図1における(1-2)と同一であり、N個のマイクで観測された時間領域信号に短時間フーリエ変換を適用することで生成される。この観測信号スペクトログラムに(2-2)が付された無相関化という処理を適用することにより、(2-3)が付された無相関化観測信号スペクトログラムを生成する。無相関化(uncorrelation)は白色化(whitening)とも呼ばれ、各マイクロホンで観測された信号同士を無相関(uncorrelated)にする変換である。処理で用いられる具体的な数式は後述する。分離の前処理として無相関化を行なっておくと、分離においては、無相関な信号の性質を利用した効率的なアルゴリズムが適用可能となる。デフレーション法はそのようなアルゴリズムの一つである。 The observation signal spectrogram marked with (2-1) in FIG. 2 is the same as (1-2) in FIG. 1, and the short-time Fourier transform is applied to the time domain signal observed by N microphones. Is generated by. By applying the process of uncorrelatedness with (2-2) to this observation signal spectrogram, an uncorrelated observation signal spectrogram with (2-3) is generated. Uncorrelated is also called whitening, and is a conversion that makes the signals observed by each microphone uncorrelated. Specific mathematical formulas used in the process will be described later. If uncorrelated is performed as a pretreatment for separation, an efficient algorithm utilizing the properties of uncorrelated signals can be applied in separation. The deflation method is one such algorithm.
 無相関化観測信号スペクトログラムの個数はマイクロホン数と同じであり、それぞれを U_1~U_N とする。無相関化観測信号スペクトログラムの生成は、抽出フィルターを求める前の処理として1回だけ行なえばよい。図1で説明した通り、デフレーション法では、分離結果 Y_1~Y_N を同時に生成する行列を推定する代わりに、各分離結果を生成するフィルターを一つずつ推定する。本開示では、Y_1 しか生成しないため、推定するフィルターは、U_1~U_N を入力して Y_1 を生成する働きのある w_1 のみであり、Y_2~Y_N および w_2~w_N は実際には生成されない仮想的なものである。 The number of uncorrelated observation signal spectrograms is the same as the number of microphones, and each is U_1 to U_N. The generation of the uncorrelated observation signal spectrogram need only be performed once as a process before obtaining the extraction filter. As described in FIG. 1, in the deflation method, instead of estimating the matrix that simultaneously generates the separation results Y_1 to Y_N, one filter that generates each separation result is estimated. In this disclosure, since only Y_1 is generated, the only filter to be estimated is w_1, which has the function of inputting U_1 to U_N to generate Y_1, and Y_2 to Y_N and w_2 to w_N are not actually generated. It is a thing.
 (2-8)が付された参照信号 R は、図1における(1-6)と同一である。前述のように、フィルター w_1 の推定においては、Y_1~Y_N の独立性と、R と Y_1 との依存性とが共に考慮される。 The reference signal R with (2-8) is the same as (1-6) in FIG. As described above, in estimating the filter w_1, both the independence of Y_1 to Y_N and the dependency between R and Y_1 are taken into consideration.
 本開示の音源抽出方法では、1つの区間について1音源のみ推定(抽出)する。そのため、抽出したい音源すなわち目的音が複数存在し、しかもそれらが鳴っている区間に重複がある場合には、その重複している区間をそれぞれ検出し、区間ごとに参照信号を生成した上で音源抽出を行なう。その点について、図3を用いて説明する。 In the sound source extraction method of the present disclosure, only one sound source is estimated (extracted) for one section. Therefore, if there are multiple sound sources to be extracted, that is, target sounds, and there are overlaps in the sections where they are sounding, the overlapping sections are detected respectively, and a reference signal is generated for each section before the sound source. Extract. This point will be described with reference to FIG.
 図3に示す例では、目的音は人間の音声とし、目的音の音源数すなわち話者数を2としている。勿論、目的音が任意の種類の音声でもよいし、音源数も2に限定されることはない。また、抽出の対象とならない妨害音が0個以上存在しているとする。非音声の信号は妨害音であるが、音声であってもスピーカー等の機器から出力される音は妨害音として扱うとする。 In the example shown in FIG. 3, the target sound is human voice, and the number of sound sources of the target sound, that is, the number of speakers is 2. Of course, the target sound may be any kind of sound, and the number of sound sources is not limited to 2. Further, it is assumed that there are 0 or more disturbing sounds that are not the target of extraction. A non-voice signal is a disturbing sound, but even if it is a voice, a sound output from a device such as a speaker is treated as a disturbing sound.
 2人の話者をそれぞれ話者1、話者2とする。また、図3において(3-1)が付された発話および(3-2)が付された発話は話者1の発話とする。また、図3において(3-3)が付された発話および(3-4)が付された発話は話者2の発話とする。(3-5)は妨害音を表わす。図3において、縦軸は音源位置の違いを、横軸は時間を表わす。発話(3-1)と(3-3)とは発話区間の一部が重複している。これは例えば、話者1が話し終わる直前から話者2が発話を開始した場合に相当する。発話(3-2)と(3-4)とも重複があり、これは例えば、話者1が長く発話している途中で話者2が相槌のような短い発話を行なった場合に相当する。いずれも、人間同士の会話において頻繁に発生する現象である。 Let the two speakers be speaker 1 and speaker 2, respectively. Further, in FIG. 3, the utterances marked with (3-1) and the utterances marked with (3-2) are the utterances of the speaker 1. Further, in FIG. 3, the utterances marked with (3-3) and the utterances marked with (3-4) are the utterances of the speaker 2. (3-5) represents a disturbing sound. In FIG. 3, the vertical axis represents the difference in sound source position, and the horizontal axis represents time. Part of the utterance section overlaps between the utterances (3-1) and (3-3). This corresponds to, for example, the case where the speaker 2 starts speaking immediately before the speaker 1 finishes speaking. There is overlap between utterances (3-2) and (3-4), which corresponds to, for example, a case where speaker 1 makes a short utterance such as an aizuchi while speaker 1 is speaking for a long time. Both are phenomena that frequently occur in conversations between humans.
 最初に、発話(3-1)の抽出について考える。発話(3-1)がなされた時間範囲(3-6)の中には、話者1の発話(3-1)の他に、話者2の発話(3-3)の一部および妨害音(3-5)の一部の計3音源が存在している。本開示における発話(3-1)の抽出とは、発話(3-1)に対応した参照信号すなわちラフな振幅スペクトログラムと、時間範囲(3-6)の観測信号(3音源の混合)とを用いて、できる限りクリーンに近い(話者1の音声のみからなり、それ以外の音源が含まれていない)信号を生成(推定)することである。 First, consider the extraction of utterances (3-1). In the time range (3-6) in which the utterance (3-1) was made, in addition to the utterance of speaker 1 (3-1), a part of the utterance of speaker 2 (3-3) and interference. There are a total of 3 sound sources that are part of the sound (3-5). The extraction of the utterance (3-1) in the present disclosure is a reference signal corresponding to the utterance (3-1), that is, a rough amplitude spectrogram and an observation signal (mixture of three sound sources) in the time range (3-6). It is used to generate (estimate) a signal that is as clean as possible (consisting only of the voice of speaker 1 and not including other sound sources).
 同様に、話者2の発話(3-3)の抽出においては、(3-3)に対応した参照信号と、時間範囲(3-7)の観測信号とを用いて、話者2のクリーンに近い信号を推定する。このように、発話区間が重複していても、それぞれの目的音に対応した参照信号を用意することができれば、本開示では異なる抽出結果を生成することができる。 Similarly, in extracting the utterance (3-3) of the speaker 2, the reference signal corresponding to (3-3) and the observation signal in the time range (3-7) are used to clean the speaker 2. Estimate a signal close to. In this way, even if the utterance sections overlap, if reference signals corresponding to the respective target sounds can be prepared, different extraction results can be generated in the present disclosure.
 同じく、話者2の発話(3-4)は、話者1の発話(3-2)に時間範囲が完全に包含されているが、それぞれ別の参照信号を用意することで、異なる抽出結果を生成することができる。すなわち、発話(3-2)を抽出するためには発話(3-2)に対応した参照信号と時間範囲(3-8)の観測信号とを使用し、発話(3-4)を抽出するためには発話(3-4)に対応した参照信号と時間範囲(3-9)の観測信号とを使用する。 Similarly, the utterance of speaker 2 (3-4) has a time range completely included in the utterance of speaker 1 (3-2), but different extraction results can be obtained by preparing different reference signals for each. Can be generated. That is, in order to extract the utterance (3-2), the reference signal corresponding to the utterance (3-2) and the observation signal in the time range (3-8) are used, and the utterance (3-4) is extracted. For this purpose, the reference signal corresponding to the utterance (3-4) and the observation signal in the time range (3-9) are used.
 次に、フィルターの推定において使用する目的関数と、それを最適化するアルゴリズムについて、数式を用いて説明する。 Next, the objective function used in the estimation of the filter and the algorithm for optimizing it will be explained using mathematical formulas.
 k番目のマイクロホンに対応した観測信号スペクトログラム X_k は、下記の式(1)に示すようにx_k(f,t) を要素とする行列として表わされる。
Figure JPOXMLDOC01-appb-I000001
式(1)におけるf は周波数ビン番号、t はフレーム番号であり、共に短時間フーリエ変換によって出現するインデックスである。以下では、f を変化させることを「周波数方向」、t を変化させることを「時間方向」と表現する。
The observation signal spectrogram X_k corresponding to the k-th microphone is expressed as a matrix having x_k (f, t) as an element as shown in the following equation (1).
Figure JPOXMLDOC01-appb-I000001
In equation (1), f is the frequency bin number and t is the frame number, both of which are indexes that appear by the short-time Fourier transform. In the following, changing f is referred to as the "frequency direction", and changing t is referred to as the "time direction".
 無相関化観測信号スペクトログラム U_k および分離結果スペクトログラム Y_k についても、同様にそれぞれ u_k(f,t) および y_k(f,t) を要素とする行列として表現する(数式の表記は省略する。)。 The uncorrelated observation signal spectrogram U_k and the separation result spectrogram Y_k are also expressed as matrices with u_k (f, t) and y_k (f, t) as elements, respectively (the notation of mathematical formulas is omitted).
 また、特定の f, t における全マイクロホン(全チャンネル)分の観測信号を要素とするベクトル x(f,t) を下記の式(2)のように表す。
Figure JPOXMLDOC01-appb-I000002
In addition, the vector x (f, t) whose elements are the observation signals of all microphones (all channels) at specific f and t is expressed by the following equation (2).
Figure JPOXMLDOC01-appb-I000002
 無相関化観測信号および分離結果についても、同じ形状を持つ u(f,t) および y(f,t) というベクトルをそれぞれ用意する(数式の表記は省略する。)。 For the uncorrelated observation signal and the separation result, prepare the vectors u (f, t) and y (f, t) with the same shape, respectively (the notation of the mathematical formula is omitted).
 下記の式(3)は、無相関化観測信号のベクトル u(f,t) を求めるための式である。
Figure JPOXMLDOC01-appb-I000003
 このベクトルは、無相関化行列と呼ばれる P(f) と観測信号ベクトル x(f,t) との積によって生成される。無相関化行列 P(f) は下記の式(4)~式(6)によって計算される。
Figure JPOXMLDOC01-appb-I000004
Figure JPOXMLDOC01-appb-I000005
Figure JPOXMLDOC01-appb-I000006
The following equation (3) is an equation for obtaining the vector u (f, t) of the uncorrelated observation signal.
Figure JPOXMLDOC01-appb-I000003
This vector is generated by the product of P (f), which is called the uncorrelated matrix, and the observed signal vector x (f, t). The uncorrelated matrix P (f) is calculated by the following equations (4) to (6).
Figure JPOXMLDOC01-appb-I000004
Figure JPOXMLDOC01-appb-I000005
Figure JPOXMLDOC01-appb-I000006
 上述した式(4)は、f 番目の周波数ビンにおける観測信号の共分散行列 R_{xx}(f) を求める式である。右辺の <・>_t は、所定の範囲の t(フレーム番号)において平均を計算するという操作を表わす。本開示では、t の範囲はスペクトログラムの時間長すなわち目的音が鳴っている区間(あるいはその区間を含む範囲)である。また、上付きの H はエルミート転置(共役転置)を表わす。 The above equation (4) is an equation for obtaining the covariance matrix R_ {xx} (f) of the observed signal in the fth frequency bin. <・> _T on the right side represents the operation of calculating the average in a predetermined range of t (frame number). In the present disclosure, the range of t is the time length of the spectrogram, that is, the section in which the target sound is sounding (or the range including the section). The superscript H represents Hermitian transpose (conjugate transpose).
 共分散行列 R_{xx}(f) に対して固有値分解(eigen decomposition)を適用し、式(5)の右辺のような3項の積に分解する。V(f) は固有ベクトル(eigenvector)からなる行列であり、D(f) は固有値(eigenvalue)からなる対角行列である。V(f) はユニタリ行列であり、V(f) の逆行列と V(f) のエルミート転置とは同一である。 Apply the eigendecomposition to the covariance matrix R_ {xx} (f) and decompose it into a product of three terms as shown on the right side of equation (5). V (f) is a matrix consisting of eigenvectors, and D (f) is a diagonal matrix consisting of eigenvalues. V (f) is a unitary matrix, and the inverse matrix of V (f) and the Hermitian transpose of V (f) are the same.
 無相関化行列 P(f) は、式(6)によって計算される。D(f) は対角行列なので、その -1/2 乗は、各対角要素を -1/2 乗することで求められる。 The uncorrelated matrix P (f) is calculated by Eq. (6). Since D (f) is a diagonal matrix, its -1 / 2th power can be obtained by multiplying each diagonal element by -1 / 2th power.
 こうして求まった無相関化観測信号 u(f,t) は、各要素が無相関であるため、下記の式(7)によって計算される共分散行列の値は単位行列 I である。
Figure JPOXMLDOC01-appb-I000007
Since each element of the uncorrelated observation signal u (f, t) obtained in this way is uncorrelated, the value of the covariance matrix calculated by the following equation (7) is the identity matrix I.
Figure JPOXMLDOC01-appb-I000007
 下記の式(8)は、f, t における全チャンネル分の分離結果 y(f,t) を生成する式であり、分離行列 W(f) と u(f,t) との積で求められる。W(f) を求める方法については後述する。
Figure JPOXMLDOC01-appb-I000008
The following equation (8) is an equation that generates the separation result y (f, t) for all channels in f, t, and is obtained by the product of the separation matrix W (f) and u (f, t). .. The method for obtaining W (f) will be described later.
Figure JPOXMLDOC01-appb-I000008
 式(9)は、k番目の分離結果のみを生成する式であり、w_k(f) は分離行列 W(f) の k番目の行ベクトルである。本開示では Y_1 のみを抽出結果として生成するので、基本的に式(9)は k=1 に限定して使用される。
Figure JPOXMLDOC01-appb-I000009
Equation (9) is an equation that generates only the k-th separation result, and w_k (f) is the k-th row vector of the separation matrix W (f). In the present disclosure, only Y_1 is generated as the extraction result, so that equation (9) is basically limited to k = 1.
Figure JPOXMLDOC01-appb-I000009
 分離の前処理として無相関化が行なわれている場合、分離行列 W(f) はユニタリ行列の中から見つければ十分であることが証明されている。分離行列 W(f) がユニタリ行列である場合は下記の式(10)を満たし、また、W(f) を構成する行ベクトル w_k(f) は下記の式(11)を満たす。この特徴を利用することで、デフレーション法による分離が可能になる。(式(11)は式(9)と同様に、基本的に k=1 に限定して使用される。)
Figure JPOXMLDOC01-appb-I000010
Figure JPOXMLDOC01-appb-I000011
When uncorrelated is performed as a pretreatment for separation, it has been proved that it is sufficient to find the separation matrix W (f) in the unitary matrix. When the separation matrix W (f) is a unitary matrix, the following equation (10) is satisfied, and the row vector w_k (f) constituting W (f) satisfies the following equation (11). By utilizing this feature, separation by the deflation method becomes possible. (Equation (11) is basically limited to k = 1 as in Eq. (9).)
Figure JPOXMLDOC01-appb-I000010
Figure JPOXMLDOC01-appb-I000011
 参照信号 R は、式(12)のように、r(f,t) を要素とする行列として表わされる。形状自体は観測信号スペクトログラム X_k と同じだが、X_k の要素 x_k(f,t) は複素数値であるのに対し、R の要素 r(f,t) は非負の実数である。
Figure JPOXMLDOC01-appb-I000012
The reference signal R is represented as a matrix having r (f, t) as an element, as in Eq. (12). The shape itself is the same as the observation signal spectrogram X_k, but the element x_k (f, t) of X_k is a complex number, while the element r (f, t) of R is a non-negative real number.
Figure JPOXMLDOC01-appb-I000012
 本開示は、分離行列 W(f) の全ての要素を推定する代わりに、w_1(f) のみを推定する。すなわち、1番目の分離結果(目的音抽出結果)の生成で使用される要素のみを推定する。以下では、w_1(f) を推定する式の導出について説明する。式の導出は以下の3点からなり、それぞれを順に説明する。 This disclosure estimates only w_1 (f) instead of estimating all the elements of the separation matrix W (f). That is, only the elements used in the generation of the first separation result (target sound extraction result) are estimated. In the following, the derivation of the formula for estimating w_1 (f) will be described. The derivation of the equation consists of the following three points, each of which will be explained in order.
(1)目的関数
(2)音源モデル
(3)更新式
(1) Objective function (2) Sound source model (3) Update formula
(1)目的関数
 本開示で使用する目的関数は負の対数尤度であり、基本的には文献1等で使用されているものと同じである。この目的関数は、分離結果が互いに独立になったときに最小となる。ただし本開示では、抽出結果と参照信号との依存性も目的関数に反映させるため、目的関数を以下のように導出する。
(1) Objective function The objective function used in the present disclosure has a negative log-likelihood, and is basically the same as that used in Document 1 and the like. This objective function is minimized when the separation results are independent of each other. However, in the present disclosure, in order to reflect the dependency between the extraction result and the reference signal in the objective function, the objective function is derived as follows.
 上述した依存性を目的関数に反映させるため、無相関化および分離(抽出)の式を若干修正する。式(13)は無相関化の式である式(3)の修正、式(14)は分離の式である式(8)の修正である。いずれも、両辺のベクトルには参照信号 r(f,t)が追加され、右辺の行列には「参照信号の素通し」を表わす1という要素が追加されている。これらの要素が追加された行列およびベクトルは、元の行列およびベクトルにプライム記号を付けて表現する。
Figure JPOXMLDOC01-appb-I000013
Figure JPOXMLDOC01-appb-I000014
In order to reflect the above-mentioned dependence in the objective function, the uncorrelated and separation (extraction) equations are slightly modified. Equation (13) is a modification of equation (3), which is an uncorrelated equation, and equation (14) is a modification of equation (8), which is a separation equation. In each case, the reference signal r (f, t) is added to the vector on both sides, and the element 1 representing "passing of the reference signal" is added to the matrix on the right side. The matrix and vector to which these elements are added are represented by adding a prime symbol to the original matrix and vector.
Figure JPOXMLDOC01-appb-I000013
Figure JPOXMLDOC01-appb-I000014
 目的関数として、下記の式(15)で表わされる、参照信号および観測信号の負の対数尤度 L を用いる。この式において、p(・) はカッコ内の信号の確率密度関数(probability density function: 以下、pdfと適宜、称する)を表わす。pdf のカッコ内に複数の要素が記述されている場合(複数の変数が記述されている場合や、行列またはベクトルが記述されている場合)は、それらの要素が同時に発生する確率を表わす。例えば式(15)の p(R, X_1, ..., X_N) は、参照信号 R と観測信号スペクトログラム X_1~X_N とが同時に発生する確率である。
Figure JPOXMLDOC01-appb-I000015
As the objective function, the negative log-likelihood L of the reference signal and the observed signal represented by the following equation (15) is used. In this equation, p (・) represents the probability density function (hereinafter, appropriately referred to as pdf) of the signal in parentheses. When multiple elements are described in parentheses of pdf (when multiple variables are described or a matrix or vector is described), it indicates the probability that those elements occur at the same time. For example, p (R, X_1, ..., X_N) in Eq. (15) is the probability that the reference signal R and the observed signal spectrograms X_1 to X_N occur at the same time.
Figure JPOXMLDOC01-appb-I000015
 同じ p という文字を用いていても、カッコ内の変数が異なれば別の確率分布を表わすため、例えば p(R) と p(Y_1) とは別の関数である。また、以下の式に現れる確率密度関数は大部分が仮想的なものであり、具体的な式を当てはめる必要があるのは、式変形の最後で現れる p(r(f,t), y_1(r,t)) のみである。 Even if the same character p is used, different variables in parentheses represent different probability distributions, so for example, p (R) and p (Y_1) are different functions. In addition, most of the probability density functions appearing in the following equations are virtual, and it is necessary to apply a concrete equation to the ones that appear at the end of the equation transformation, p (r (f, t), y_1 ( Only r, t)).
 抽出フィルター w_1(f) について最適化(この場合は最小化)を行なうためには、負の対数尤度 L を変形し、w_1(f) が含まれるようにする必要がある。そのために、観測信号および分離結果について以下の仮定を置く。
 仮定1: 観測信号スペクトログラムは、チャンネル方向には依存関係があるが(言い換えると各マイクロホンに対応したスペクトログラムはお互いに似ているが)、時間方向および周波数方向には独立である。すなわち、一枚のスペクトログラムにおいて、各点を構成する成分はお互いに独立に発生し、他の時間や周波数の影響を受けない。
 仮定2:分離結果スペクトログラムは、時間方向および周波数方向に加え、チャンネル方向にも独立である。すなわち、分離結果の各スペクトログラムは似ていない。
 仮定3:分離結果スペクトログラムである Y_1 と、参照信号とは依存関係がある。すなわち、両者はスペクトログラムが似ている。
In order to optimize (minimize in this case) the extraction filter w_1 (f), it is necessary to transform the negative log-likelihood L so that w_1 (f) is included. To that end, we make the following assumptions about the observed signals and the separation results.
Assumption 1: The observed signal spectrograms are dependent on the channel direction (in other words, the spectrograms corresponding to each microphone are similar to each other), but are independent in the time direction and the frequency direction. That is, in one spectrogram, the components constituting each point are generated independently of each other and are not affected by other time or frequency.
Assumption 2: Separation Results The spectrogram is independent in the channel direction as well as in the time and frequency directions. That is, the spectrograms of the separation results are not similar.
Assumption 3: There is a dependency between the separation result spectrogram, Y_1, and the reference signal. That is, they have similar spectrograms.
 p(R, X_1, ..., X_N) の変形の過程を式(16)~式(21)に示す。
Figure JPOXMLDOC01-appb-I000016
Figure JPOXMLDOC01-appb-I000017
Figure JPOXMLDOC01-appb-I000018
Figure JPOXMLDOC01-appb-I000019
Figure JPOXMLDOC01-appb-I000020
Figure JPOXMLDOC01-appb-I000021
The transformation process of p (R, X_1, ..., X_N) is shown in equations (16) to (21).
Figure JPOXMLDOC01-appb-I000016
Figure JPOXMLDOC01-appb-I000017
Figure JPOXMLDOC01-appb-I000018
Figure JPOXMLDOC01-appb-I000019
Figure JPOXMLDOC01-appb-I000020
Figure JPOXMLDOC01-appb-I000021
 独立な変数同士の同時発生確率はそれぞれの pdf の積に分解できるため、仮定1によって式(16)の左辺は右辺に変形される。右辺のカッコ内は、式(13)で導入したx'(f,t) を用いて式(17)のように表わされる。 Since the probability of simultaneous occurrence of independent variables can be decomposed into the product of each pdf, the left side of equation (16) is transformed into the right side by Assumption 1. The inside of the parentheses on the right side is expressed as in equation (17) using x'(f, t) introduced in equation (13).
 式(17)は、式(14)の下段の関係を用いて式(18)および式(19)に変形される。これらの式において、det(・) はカッコ内の行列の行列式(determinant)を表わす。 Equation (17) is transformed into equation (18) and equation (19) using the relationship in the lower part of equation (14). In these equations, det (・) represents the determinant of the matrix in parentheses.
 式(20)は、デフレーション法において重要な変形である。行列 W(f)' は、分離行列 W(f) と同様にユニタリ行列であるため、その行列式は1である。また、行列 P'(f) は分離中は変化しないため、行列式は定数である。従って、両方の行列式は、あわせて const(定数)と書くことができる。 Equation (20) is an important modification of the deflation method. Since the matrix W (f)'is a unitary matrix like the separation matrix W (f), its determinant is 1. Also, since the matrix P'(f) does not change during separation, the determinant is a constant. Therefore, both determinants can be written together as a constant.
 式(21)は本開示にユニークな変形である。y'(f,t) の成分は r(f,t) および y_1(f,t) ~ y_N(f,t) であるが、仮定2および仮定3により、これらの変数を引数とする確率密度関数は、r(f,t) と y_1(f,t) との同時確率である p(r(f,t), y_1(f,t)) と、y_2(f,t) ~ y_N(f,t) の確率密度関数である p(y_2(f,t)) ~ p(y_N(f,t)) それぞれとの積に分解される。 Equation (21) is a unique variant of this disclosure. The components of y'(f, t) are r (f, t) and y_1 (f, t) to y_N (f, t), but according to Assumptions 2 and 3, the probability densities with these variables as arguments. The functions are p (r (f, t), y_1 (f, t)), which is the simultaneous probability of r (f, t) and y_1 (f, t), and y_2 (f, t) to y_N (f). , t) It is decomposed into the product of each of the probability density functions p (y_2 (f, t)) to p (y_N (f, t)).
 式(21)を式(15)に代入すると、式(22)が得られる。
Figure JPOXMLDOC01-appb-I000022
 抽出フィルター w_1(f) は、式(22)を最小値にする引数のサブセットである。式(22)の各項の内、w_1(f) が含まれるのは特定の f における y_1(f,t) のみであるため、w_1(f) は下記の式(23)の最小解として求められる。ただし、w_1(f)=0 という自明な解を排除するため、式(11)で表わされる、ベクトルのノルムが1という制約をかける。
Figure JPOXMLDOC01-appb-I000023
Substituting equation (21) into equation (15) gives equation (22).
Figure JPOXMLDOC01-appb-I000022
The extraction filter w_1 (f) is a subset of the arguments that minimize equation (22). Of the terms in equation (22), w_1 (f) is included only in y_1 (f, t) at a specific f, so w_1 (f) is obtained as the minimum solution of equation (23) below. Will be done. However, in order to eliminate the trivial solution of w_1 (f) = 0, the constraint that the norm of the vector expressed by the equation (11) is 1 is applied.
Figure JPOXMLDOC01-appb-I000023
 ノルムが1という制約を持った抽出フィルターを無相関化観測信号に適用した場合、生成される抽出結果の各周波数ビンのスケールは、真の目的音のスケールとは異なる。そのため、フィルターが推定された後、周波数ビンごとに抽出フィルターおよび抽出結果を補正する。このような後処理をリスケーリングと呼ぶ。リスケーリングの具体的な式については後述する。 When an extraction filter with a norm of 1 constraint is applied to the uncorrelated observation signal, the scale of each frequency bin of the generated extraction result is different from the scale of the true target sound. Therefore, after the filter is estimated, the extraction filter and the extraction result are corrected for each frequency bin. Such post-processing is called rescaling. The specific formula for rescaling will be described later.
 式(23)の最小化問題を解くためには、以下の2点を具体化する必要である。
・r(f,t) と y_1(f,t) との同時確率である p(r(f,t), y_1(f,t)) として、どのような式を割り当てるか。この確率密度関数を音源モデルと呼ぶ。
・どのようなアルゴリズムを用いて最小解 w_1(f) を求めるか。基本的に w_1(f) は一回では求まらず、反復的に更新する必要がある。w_1(f) の更新する式を更新式と呼ぶ。
以下、それぞれについて説明する。
In order to solve the minimization problem of equation (23), it is necessary to embody the following two points.
-What kind of expression is assigned as p (r (f, t), y_1 (f, t)), which is the simultaneous probability of r (f, t) and y_1 (f, t). This probability density function is called a sound source model.
-What algorithm is used to find the minimum solution w_1 (f)? Basically, w_1 (f) cannot be obtained once, but needs to be updated iteratively. The expression to be updated by w_1 (f) is called the update expression.
Each will be described below.
(2)音源モデル
 音源モデル p(r(f,t), y_1(f,t)) は、参照信号 r(f,t) と抽出結果 y_1(f,t) の2つの変数を引数とする pdf であり、2つの変数の依存関係(依存性)を表わす。音源モデルは、いろんなコンセプトに基づいて定式化することが可能である。本開示では以下の3通りを用いる。
(2) Sound source model The sound source model p (r (f, t), y_1 (f, t)) takes two variables, the reference signal r (f, t) and the extraction result y_1 (f, t), as arguments. It is a pdf and represents the dependency (dependency) of two variables. The sound source model can be formulated based on various concepts. In this disclosure, the following three methods are used.
a)2変量の球状分布
b)ダイバージェンスに基づくモデル
c)時間周波数可変分散モデル
以下それぞれについて説明する。
a) Bivariate spherical distribution b) Divergence-based model c) Time-frequency variable variance model Each will be described below.
a)2変量の球状分布
 球状分布とは、多変量(multi-variate)pdf の一種である。pdf の複数個の引数をベクトルと見なし、そのベクトルのノルム(L2 ノルム)を単変量(univariate)の pdf に代入することで多変量 pdf を構成する。独立成分分析において球状分布を使用すると、引数で使用されている変数同士を類似させる効果がある。例えば、特許第4449871号に記載の技術ではその性質を利用し、周波数パーミュテーション問題と呼ばれる「k 番目の分離結果にどの音源が出現するかが周波数ビンごとに異なる」という問題を解決した。
a) Bivariate spherical distribution A spherical distribution is a type of multi-variate pdf. A multivariate pdf is constructed by considering multiple arguments of a pdf as a vector and substituting the norm (L2 norm) of the vector into the univariate pdf. Using a spherical distribution in independent component analysis has the effect of resembling the variables used in the arguments. For example, the technique described in Japanese Patent No. 4449871 utilizes this property to solve a problem called a frequency permutation problem, in which "which sound source appears in the kth separation result differs for each frequency bin".
 本開示の音源モデルとして、参照信号と抽出結果とを引数とする球状分布を用いると、両者を類似させることができる。ここで使用する球状分布は下記の式(24)の一般形で表わすことができる。この式において、関数 F は任意の単変量 pdf である。また、c_1, c_2 は正の定数であり、これらの値を変更することで、参照信号が抽出結果に与える影響を調整することができる。特許第4449871号と同様に単変量 pdf としてラプラス分布を用いると、下記の式(25)が得られる。以降ではこの式を2変量ラプラス分布と呼ぶ。
Figure JPOXMLDOC01-appb-I000024
Figure JPOXMLDOC01-appb-I000025
When a spherical distribution with a reference signal and an extraction result as arguments is used as the sound source model of the present disclosure, both can be made similar. The spherical distribution used here can be expressed by the general form of the following equation (24). In this equation, the function F is any univariate pdf. In addition, c_1 and c_2 are positive constants, and the influence of the reference signal on the extraction result can be adjusted by changing these values. Using the Laplace distribution as the univariate pdf as in Japanese Patent No. 4449871, the following equation (25) is obtained. Hereinafter, this equation will be referred to as a bivariate Laplace distribution.
Figure JPOXMLDOC01-appb-I000024
Figure JPOXMLDOC01-appb-I000025
b)ダイバージェンスに基づくモデル
 別の種類の音源モデルは、距離尺度の上位概念であるダイバージェンスに基づいた pdf であり、下記の式(26)の形で表わされる。この式において divergence(r(f,t), |y_1(f,t)|) は、参照信号である r(f,t) と抽出結果の振幅である |y_1(f,t)| との間の任意のダイバージェンスを表わす。
Figure JPOXMLDOC01-appb-I000026
b) Divergence-based model Another type of sound source model is a pdf based on divergence, which is a superordinate concept of the distance scale, and is expressed in the form of the following equation (26). In this equation, divergence (r (f, t), | y_1 (f, t) |) is the reference signal r (f, t) and the amplitude of the extraction result | y_1 (f, t) | Represents any divergence between.
Figure JPOXMLDOC01-appb-I000026
 また、α は正の定数であり、式(26)の右辺が pdf の条件を満たすようにするための補正項であるが、式(23)の最小化問題においては αの値は無関係であるため、α=1 として構わない。この pdf を式(23)に代入すると、r(f,t) と |y_1(f,t)| とのダイバージェンスを最小化するという問題と等価になるため、必然的に両者は類似する。 Further, α is a positive constant and is a correction term for making the right side of equation (26) satisfy the condition of pdf, but the value of α is irrelevant in the minimization problem of equation (23). Therefore, α = 1 may be set. Substituting this pdf into equation (23) is equivalent to the problem of minimizing the divergence between r (f, t) and | y_1 (f, t) |, so they are inevitably similar.
 ダイバージェンスとしてユークリッド距離を用いた場合は下記の式(27)が得られる。また、板倉斎藤ダイバージェンスを用いた場合は下記の式(28)が得られる。板倉斎藤ダイバージェンスはパワースペクトル同士の距離尺度であるため、r(f,t) と |y_1(f,t)| は共に 2乗した値を用いる。一方、振幅スペクトルに対して板倉斎藤ダイバージェンスと同様の距離尺度を計算しても良く、その場合は下記の式(29)が得られる。
Figure JPOXMLDOC01-appb-I000027
Figure JPOXMLDOC01-appb-I000028
Figure JPOXMLDOC01-appb-I000029
When the Euclidean distance is used as the divergence, the following equation (27) is obtained. Further, when Itakura Saito divergence is used, the following equation (28) can be obtained. Itakura Saito Since divergence is a distance measure between power spectra, both r (f, t) and | y_1 (f, t) | use squared values. On the other hand, the same distance scale as Itakura Saito divergence may be calculated for the amplitude spectrum, and in that case, the following equation (29) is obtained.
Figure JPOXMLDOC01-appb-I000027
Figure JPOXMLDOC01-appb-I000028
Figure JPOXMLDOC01-appb-I000029
 下記の式(30)は、 は別のダイバージェンスに基づく pdf である。r(f,t) と |y_1(f,t)| とが類似するほど比が 1 に近づくので、その比と 1 との二乗誤差はダイバージェンスとして働く。
Figure JPOXMLDOC01-appb-I000030
Equation (30) below is a pdf based on another divergence. The more similar r (f, t) and | y_1 (f, t) | are, the closer the ratio is to 1, so the squared error between that ratio and 1 acts as divergence.
Figure JPOXMLDOC01-appb-I000030
c)時間周波数可変分散モデル
 別の音源モデルとして、時間周波数可変分散(time-frequency-varying variance: TFVV)モデルも可能である。これは、スペクトログラムを構成する各点が時間および周波数ごとに異なる分散または標準偏差を持つというモデルである。そして、参照信号であるラフな振幅スペクトログラムは各点の標準偏差(あるいは標準偏差に依存した何らかの値)を表わしていると解釈する。
c) Time-frequency variable variance model As another sound source model, a time-frequency-varying variance (TFVV) model is also possible. This is a model in which each point that makes up the spectrogram has a different variance or standard deviation over time and frequency. Then, it is interpreted that the rough amplitude spectrogram, which is a reference signal, represents the standard deviation (or some value depending on the standard deviation) of each point.
 分布として時間周波数可変分散を持ったラプラス分布(以降、TFVV ラプラス分布)を仮定すると、下記の式(31)のように表わせる。この式において、α は式(26)と同様、右辺が pdf の条件を満たすようにするための補正項であり、α=1 として構わない。β は、参照信号が抽出結果に与える影響の大きさを調整するための項である。真の TFVV ラプラス分布は β=1 に相当するが、他に 1/2 や 2 といった値を用いても良い。
Figure JPOXMLDOC01-appb-I000031
Assuming a Laplace distribution with a variable time-frequency variance (hereinafter referred to as TFVV Laplace distribution) as the distribution, it can be expressed as the following equation (31). In this equation, α is a correction term for satisfying the condition of pdf on the right side as in equation (26), and α = 1 may be set. β is a term for adjusting the magnitude of the influence of the reference signal on the extraction result. The true TFVV Laplace distribution corresponds to β = 1, but other values such as 1/2 and 2 may be used.
Figure JPOXMLDOC01-appb-I000031
 同様に、TVVF ガウス分布を仮定すると下記の式(32)が得られる。一方、TVVF Student-t 分布を仮定すると下記の式(33)の音源モデルが得られる。
Figure JPOXMLDOC01-appb-I000032
Figure JPOXMLDOC01-appb-I000033
式(33)のν(ニュー)は自由度と呼ばれるパラメーターであり、この値を変えることで分布の形状を変化させることができる。例えば、ν=1 はコーシー(cauchy)分布を表わし、ν→∞ はガウス分布を表わす。
Similarly, assuming a TVVF Gaussian distribution, the following equation (32) is obtained. On the other hand, assuming the TVVF Student-t distribution, the sound source model of the following equation (33) can be obtained.
Figure JPOXMLDOC01-appb-I000032
Figure JPOXMLDOC01-appb-I000033
Ν (new) in equation (33) is a parameter called the degree of freedom, and the shape of the distribution can be changed by changing this value. For example, ν = 1 represents the Cauchy distribution and ν → ∞ represents the Gaussian distribution.
 式(32)および式(33)の音源モデルは文献1でも使用されているが、本開示ではそれらのモデルを分離ではなく抽出のために使用するという違いがある。 The sound source models of equations (32) and (33) are also used in Reference 1, but there is a difference in the present disclosure that these models are used for extraction rather than separation.
(3)更新式
 式(23)の最小化問題の解 w_1(f) は、多くの場合に閉形式(closed form)の解(反復なしの解法)が存在せず、反復的なアルゴリズムを用いる必要がある。(ただし、音源モデルとして式(32)の TFVV ガウス分布を用いた場合は、後述のように閉形式解が存在する。)
(3) The solution of the minimization problem of the update equation (23) w_1 (f) often does not have a closed form solution (solution without iteration) and uses an iterative algorithm. There is a need. (However, when the TFVV Gaussian distribution of Eq. (32) is used as the sound source model, there is a closed form solution as described later.)
 式(25)、式(31)、式(33)については、補助関数法と呼ばれる高速かつ安定なアルゴリズムが適用可能である。一方、式(27)~式(30)については、不動点法と呼ばれる別のアルゴリズムが適用可能である。 A high-speed and stable algorithm called the auxiliary function method can be applied to the equations (25), (31), and (33). On the other hand, for equations (27) to (30), another algorithm called the fixed point method can be applied.
 以下、最初に式(32)を用いた場合の更新式について説明し、次に補助関数法および不動点法を用いた更新式についてそれぞれについて説明する。 Hereinafter, the update formula when the equation (32) is used will be described first, and then the update equation using the auxiliary function method and the fixed point method will be described.
 式(32)で表わされる TFVV ガウス分布を式(23)に代入し、さらに最小化とは無関係な項を無視すると、下記の式(34)が得られる。
Figure JPOXMLDOC01-appb-I000034
この式は u(f,t) の重みつき共分散行列の最小化問題と解釈でき、固有値分解を用いて解くことができる。
(厳密には、式(34)の右辺の中カッコ内は重みつき共分散行列そのものではなく、それの T 倍を表わしているが、その違いは式(34)の最小化問題の解には影響しないので、以降では中カッコ内のシグマを含む式そのものも重みつき共分散行列と呼ぶ。)
Substituting the TFVV Gaussian distribution represented by Eq. (32) into Eq. (23) and ignoring terms unrelated to minimization, Eq. (34) below is obtained.
Figure JPOXMLDOC01-appb-I000034
This equation can be interpreted as a minimization problem of the weighted covariance matrix of u (f, t) and can be solved using eigenvalue decomposition.
(Strictly speaking, the inside of the parentheses on the right side of equation (34) does not represent the weighted covariance matrix itself, but its T times, but the difference is in the solution of the minimization problem of equation (34). Since it has no effect, the expression itself including the sigma in the middle parenthesis is also called a weighted covariance matrix from now on.)
 行列 A を引数にとり、その行列に対して固有値分解を行なって全ての固有ベクトルを求める関数を eig(A) で表わす。この関数を用いると、式(34)の重みつき共分散行列の固有ベクトルは下記の式(35)のように書ける。
Figure JPOXMLDOC01-appb-I000035
式(35)の左辺の a_{min}(f), ..., a_{max}(f) は固有ベクトルであり、a_{min}(f) が最小の固有値に、a_{max}(f) が最大の固有値に対応する。各固有ベクトルのノルムは 1 であり、また互いに直交しているとする。式(34)を最小化する w_1(f) は、下記の式(36)に示すように最小の固有値に対応した固有ベクトルのエルミート転置である。
Figure JPOXMLDOC01-appb-I000036
Eig (A) represents a function that takes a matrix A as an argument and performs eigenvalue decomposition on the matrix to obtain all eigenvectors. Using this function, the eigenvectors of the weighted covariance matrix in equation (34) can be written as in equation (35) below.
Figure JPOXMLDOC01-appb-I000035
A_ {min} (f), ..., a_ {max} (f) on the left side of equation (35) are eigenvectors, a_ {min} (f) is the smallest eigenvalue, and a_ {max} (f) ) Corresponds to the largest eigenvalue. The norm of each eigenvector is 1, and it is assumed that they are orthogonal to each other. W_1 (f), which minimizes equation (34), is the Hermitian transpose of the eigenvectors corresponding to the smallest eigenvalues, as shown in equation (36) below.
Figure JPOXMLDOC01-appb-I000036
 次に、式(25)、式(31)、式(33)に対して補助関数法を適用して更新式を導出する方法について説明する。 Next, a method of deriving the updated equation by applying the auxiliary function method to the equations (25), (31), and (33) will be described.
 補助関数法とは、最適化問題を効率的に解く方法の一つであり、詳細については特開2011-175114号公報や特開2014-219467号公報に記載されている。 The auxiliary function method is one of the methods for efficiently solving an optimization problem, and details thereof are described in Japanese Patent Application Laid-Open No. 2011-175114 and Japanese Patent Application Laid-Open No. 2014-219467.
 式(31)で表わされる TFVV ラプラス分布を式(23)に代入し、最小化に無関係な項を無視すると、下記の式(37)が得られる。
Figure JPOXMLDOC01-appb-I000037
 この最小化問題の解は、閉形式では求められない。
Substituting the TFVV Laplace distribution represented by the equation (31) into the equation (23) and ignoring the terms irrelevant to the minimization, the following equation (37) is obtained.
Figure JPOXMLDOC01-appb-I000037
The solution to this minimization problem cannot be found in closed form.
 そこで、式(38)のような、「上から押さえる」不等式を用意する。
Figure JPOXMLDOC01-appb-I000038
Therefore, an inequality that "presses from above" such as equation (38) is prepared.
Figure JPOXMLDOC01-appb-I000038
 式(38)の右辺を補助関数と呼び、その中の b(f,t) は補助変数と呼ぶ。この不等式は、b(f,t) = |y_1(f,t)| のときに成立する。この不等式を式(37)に適用すると、下記の式(39)が得られる。以降、この不等式の右辺を G とする。
Figure JPOXMLDOC01-appb-I000039
The right-hand side of equation (38) is called the auxiliary function, and b (f, t) in it is called the auxiliary variable. This inequality holds when b (f, t) = | y_1 (f, t) |. Applying this inequality to equation (37) gives equation (39): Hereafter, let G be the right-hand side of this inequality.
Figure JPOXMLDOC01-appb-I000039
 補助関数法では、以下の2つのステップを交互に繰り返すことで、高速かつ安定に最小化問題を解く。
1.下記の式(40)に示すように、w_1(f) を固定し、G を最小にする b(f,t) を求める。
Figure JPOXMLDOC01-appb-I000040
2.下記の式(41)に示すようにb(f,t) を固定し、G を最小にする w_1(f) を求める。
Figure JPOXMLDOC01-appb-I000041
In the auxiliary function method, the minimization problem is solved quickly and stably by repeating the following two steps alternately.
1. 1. As shown in Eq. (40) below, fix w_1 (f) and find b (f, t) that minimizes G.
Figure JPOXMLDOC01-appb-I000040
2. As shown in Eq. (41) below, fix b (f, t) and find w_1 (f) that minimizes G.
Figure JPOXMLDOC01-appb-I000041
 式(40)が最小となるのは、式(38)の等号が成り立つときである。w_1(f) が変化するたびに y_1(f,t) の値も変わるため、式(9)を用いて計算する。式(41)は式(34)と同様に重みつき共分散行列の最小化問題であるため、固有値分解を用いて解くことができる。 Equation (40) is minimized when the equal sign of equation (38) holds. Since the value of y_1 (f, t) changes every time w_1 (f) changes, the calculation is performed using equation (9). Since equation (41) is a minimization problem of a weighted covariance matrix like equation (34), it can be solved by using eigenvalue decomposition.
 式(41)の重みつき共分散行列に対して下記の式(42)によって固有ベクトルを計算すると、式(41)の解である w_1(f) は、最小値に対応した固有ベクトルのエルミート転置である(式(36))。
Figure JPOXMLDOC01-appb-I000042
When the eigenvector is calculated by the following equation (42) for the weighted covariance matrix of equation (41), the solution of equation (41), w_1 (f), is the Hermitian transpose of the eigenvector corresponding to the minimum value. (Equation (36)).
Figure JPOXMLDOC01-appb-I000042
 なお、反復の初回は w_1(f) も y_1(f,t) も未知なので式(40)が適用できない。そこで、以下の何れかの方法で補助変数 b(f,t) の初期値を計算する。
a)補助変数として、参照信号を正規化した値を用いる。すなわち b(f,t) = normalize(r(f,t)) とする。
b)分離結果 y_1(f,t) として仮の値を計算し、そこから式(40)で補助変数を計算する。
c)w_1(f) に仮の値を代入して式(40)を計算する。
 上記a)の normalize() は下記の式(43)で定義される関数であり、この式の s(t) は任意の時系列信号を表わす。normalize() の働きは、信号の絶対値の二乗平均を1に正規化することである。
Figure JPOXMLDOC01-appb-I000043
At the first iteration, neither w_1 (f) nor y_1 (f, t) is known, so equation (40) cannot be applied. Therefore, the initial value of the auxiliary variable b (f, t) is calculated by one of the following methods.
a) A normalized value of the reference signal is used as an auxiliary variable. That is, b (f, t) = normalize (r (f, t)).
b) Calculate a tentative value as the separation result y_1 (f, t), and calculate the auxiliary variable from it by equation (40).
c) Substitute a temporary value for w_1 (f) to calculate equation (40).
Normalize () in a) above is a function defined by the following equation (43), and s (t) in this equation represents an arbitrary time series signal. The function of normalize () is to normalize the root mean square of the absolute value of the signal to 1.
Figure JPOXMLDOC01-appb-I000043
 上記b)の y_1(f,t) の例として、観測信号の1チャンネル分を選択したり、全チャンネル分の観測信号を平均するといった操作が考えられる。例えば後述の図5のようなマイクロホン設置形態を使用している場合は、発話している話者に割り当てられたマイクロホンが必ず存在するので、そのマイクロホンの観測信号を仮の抽出結果として使用するのが良い。マイクロホンの番号を k とすると、y_1(f,t) = normalize(x_k(f,t)) である。 As an example of y_1 (f, t) in b) above, operations such as selecting one channel of the observation signal or averaging the observation signals of all channels can be considered. For example, when using the microphone installation form as shown in FIG. 5 described later, since the microphone assigned to the speaker who is speaking always exists, the observation signal of the microphone is used as a temporary extraction result. Is good. If the microphone number is k, then y_1 (f, t) = normalize (x_k (f, t)).
 上記c)における仮の値とは、例えば全要素が同一の値であるベクトルを使用するといった簡易的な方法の他に、前回の目的音区間で推定した抽出フィルターの値を保存しておき、それを次の目的音区間を計算する際の w_1(f) の初期値として用いることも可能である。例えば、図3に示す発話(3-2)について音源抽出を行なう場合は、同じ話者の前回の発話(3-1)について推定された抽出フィルターを今回の抽出における w_1(f) の仮の値とする。あるいは上記 c) の他の方法として、初回のみ TFVV ガウス分布由来の更新式を用いて w_1(f) を求めても良い。 As the temporary value in c) above, in addition to a simple method such as using a vector in which all elements have the same value, the value of the extraction filter estimated in the previous target sound section is saved. It is also possible to use it as the initial value of w_1 (f) when calculating the next target sound section. For example, when sound source extraction is performed for the utterance (3-2) shown in FIG. 3, the extraction filter estimated for the previous utterance (3-1) of the same speaker is tentatively used for w_1 (f) in this extraction. Use as a value. Alternatively, as another method of the above c), w_1 (f) may be obtained by using the update formula derived from the TFVV Gaussian distribution only for the first time.
 式(25)で表わされる2変量ラプラス分布についても、補助関数を用いて同様に解くことができる。式(25)を式(23)に代入すると、下記の式(44)が得られる。
Figure JPOXMLDOC01-appb-I000044
The bivariate Laplace distribution represented by Eq. (25) can also be solved in the same way using an auxiliary function. By substituting the equation (25) into the equation (23), the following equation (44) is obtained.
Figure JPOXMLDOC01-appb-I000044
 ここで、下記の式(45)のような補助関数を用意する。
Figure JPOXMLDOC01-appb-I000045
Here, an auxiliary function such as the following equation (45) is prepared.
Figure JPOXMLDOC01-appb-I000045
 すると、補助変数 b(f,t) を求めるステップ(式(40)に相当)は式(46)のように表すことができる。
Figure JPOXMLDOC01-appb-I000046
Then, the step (corresponding to Eq. (40)) for finding the auxiliary variable b (f, t) can be expressed as Eq. (46).
Figure JPOXMLDOC01-appb-I000046
 抽出フィルター w_1(f) を求めるステップ(式(41)に相当)は、下記の式(47)のように表すことができる。
Figure JPOXMLDOC01-appb-I000047
The step of obtaining the extraction filter w_1 (f) (corresponding to the equation (41)) can be expressed as the following equation (47).
Figure JPOXMLDOC01-appb-I000047
 この最小化問題は下記の式(48)の固有値分解によって解くことができる。
Figure JPOXMLDOC01-appb-I000048
This minimization problem can be solved by the eigenvalue decomposition of Eq. (48) below.
Figure JPOXMLDOC01-appb-I000048
 次に、式(33)で表わされる TFVV Student-t 分布の場合について説明する。TFVV Student-t 分布に対して補助関数法を適用する例は文献1に記載されているため、更新式のみを記載する。 Next, the case of the TFVV Student-t distribution represented by the equation (33) will be described. Since an example of applying the auxiliary function method to the TFVV Student-t distribution is described in Reference 1, only the update formula is described.
 補助変数 b(f,t) を求めるステップは下記の式(49)の通りである。
Figure JPOXMLDOC01-appb-I000049
自由度νは、参照信号である r(f,t) と、反復途中の抽出結果である y_1(f,t) それぞれの影響度合いを調整するパラメーターとして機能する。ν=0 の場合は参照信号が無視され、0 以上 2 未満の場合は抽出結果の影響の方が参照信号よりも大きい。νが 2より大きい場合は参照信号の影響の方が大きく、極限である ν→∞では抽出結果が無視され、それは TFVV ガウス分布と等価である。
The step for finding the auxiliary variable b (f, t) is as shown in Eq. (49) below.
Figure JPOXMLDOC01-appb-I000049
The degree of freedom ν functions as a parameter for adjusting the degree of influence of r (f, t), which is a reference signal, and y_1 (f, t), which is an extraction result during repetition. When ν = 0, the reference signal is ignored, and when it is 0 or more and less than 2, the influence of the extraction result is larger than that of the reference signal. When ν is greater than 2, the effect of the reference signal is greater, and at the limit ν → ∞, the extraction result is ignored, which is equivalent to the TFVV Gaussian distribution.
 抽出フィルター w_1(f) を求めるステップは下記の式(50)の通りである。
Figure JPOXMLDOC01-appb-I000050
 式(50)は、2変量ラプラス分布の場合の式(47)と同一なので、抽出フィルターは式(48)によって同様に求めることができる。
The step for obtaining the extraction filter w_1 (f) is as shown in the following equation (50).
Figure JPOXMLDOC01-appb-I000050
Since the formula (50) is the same as the formula (47) in the case of the bivariate Laplace distribution, the extraction filter can be similarly obtained by the formula (48).
 次に、ダイバージェンスに基づく音源モデルである式(27)~式(30)から更新式を導出する方法について説明する。これらの pdf を式(23)に代入すると、いずれも f 番目の周波数ビンにおいてダイバージェンスの総和を最小化するという式が得られるが、各ダイバージェンスに対して適切な補助関数は見つかっていない。そこで、別の最適化アルゴリズムである不動点法を適用する。 Next, a method of deriving the update equation from the equations (27) to (30), which are sound source models based on divergence, will be described. Substituting these pdfs into equation (23) gives an equation that minimizes the sum of divergence in the fth frequency bin, but no suitable auxiliary function has been found for each divergence. Therefore, another optimization algorithm, the fixed point method, is applied.
 不動点アルゴリズムは、最適化したいパラメーター(本開示では抽出フィルターである w_1(f))が収束したときに成立している条件を式で表し、その式を変形して“w_1(f) = J(w_1(f))'';という不動点の形式にすることで更新式を導出する。本開示では、収束時に成立する条件として、パラメーターによる偏微分がゼロという式を使用し、下記の式(51)に示す偏微分を行なって具体的な式を導出する。
Figure JPOXMLDOC01-appb-I000051
The fixed point algorithm expresses the condition that is satisfied when the parameter to be optimized (w_1 (f), which is the extraction filter in this disclosure) converges, and transforms the equation to “w_1 (f) = J”. The update equation is derived by using the fixed point form (w_1 (f))'';. In this disclosure, the equation that the partial differentiation by the parameter is zero is used as the condition that holds at the time of convergence, and the following equation is used. A concrete equation is derived by performing the partial differentiation shown in (51).
Figure JPOXMLDOC01-appb-I000051
 式(51)の左辺は、conj(w_1(f)) による偏微分である。そして式(51)を変形し、式(52)の形式を得る。
Figure JPOXMLDOC01-appb-I000052
The left side of equation (51) is the partial derivative by conj (w_1 (f)). Then, the equation (51) is modified to obtain the form of the equation (52).
Figure JPOXMLDOC01-appb-I000052
 不動点アルゴリズムでは、式(52)の等号を代入に置き換えた下記の式(53)を反復的に実行する。ただし、本開示では w_1(f) について式(11)の制約を満たす必要があるため、式(53)の後で式(54)によるノルム正規化も行なう。
Figure JPOXMLDOC01-appb-I000053
Figure JPOXMLDOC01-appb-I000054
In the fixed point algorithm, the following equation (53), in which the equal sign of equation (52) is replaced with a substitution, is repeatedly executed. However, since it is necessary to satisfy the constraint of equation (11) for w_1 (f) in the present disclosure, norm normalization by equation (54) is also performed after equation (53).
Figure JPOXMLDOC01-appb-I000053
Figure JPOXMLDOC01-appb-I000054
 以下では、式(27)~式(30)に対応した更新式について説明する。いずれも式(53)に相当する式のみ記載してあるが、実際の抽出処理においては、代入を行なった後で式(54)のノルム正規化も行なう。 In the following, the update formulas corresponding to the formulas (27) to (30) will be described. In each case, only the formula corresponding to the formula (53) is described, but in the actual extraction process, the norm normalization of the formula (54) is also performed after the substitution is performed.
 ユークリッド距離に対応した pdf である式(27) から導出される更新式は下記の式(55)の通りである。
Figure JPOXMLDOC01-appb-I000055
The update formula derived from the formula (27), which is a pdf corresponding to the Euclidean distance, is as shown in the following formula (55).
Figure JPOXMLDOC01-appb-I000055
 式(55)では二段に渡って記述されているが、上段は式(9)を用いて y_1(f,t) を計算した後に使用することを想定しており、下段は y_1(f,t) を計算せずに w_1(f), u(f,t) を直接使用することを想定している。後述する式(56)~式(60)についてもその点は同様である。 Equation (55) is described in two stages, but the upper row is assumed to be used after calculating y_1 (f, t) using equation (9), and the lower row is assumed to be used after calculating y_1 (f, t). It is assumed that w_1 (f) and u (f, t) are used directly without calculating t). The same applies to the equations (56) to (60) described later.
 反復の初回のみは、抽出フィルター w_1(f) も抽出結果 y_1(f,t) も未知であるため、以下のどちらかの方法で w_1(f) を計算する。
a)分離結果 y_1(f,t) として仮の値を計算し、そこから式(55)の上段の式で w_1(f) を計算する。
b) w_1(f) に仮の値を代入し、そこから式(55)の下段の式で w_1(f) を計算する。
上記a)における y_1(f,t) の仮の値については、式(40)の説明における b)の方法が使用可能である。同様に、b)における w_1(f) の仮の値については、式(40) おけるc)の方法が使用可能である。
Since the extraction filter w_1 (f) and the extraction result y_1 (f, t) are unknown only at the first iteration, w_1 (f) is calculated by either of the following methods.
a) Calculate a tentative value as the separation result y_1 (f, t), and then calculate w_1 (f) from the upper equation of equation (55).
b) Substitute a temporary value for w_1 (f), and calculate w_1 (f) from it using the lower equation of equation (55).
For the temporary value of y_1 (f, t) in a) above, the method b) in the explanation of equation (40) can be used. Similarly, for the tentative value of w_1 (f) in b), the method of c) in equation (40) can be used.
 板倉斎藤ダイバージェンス(パワースペクトログラム版)に対応した pdf である式(28)から導出される更新式は、下記の式(56)および式(57)である。
Figure JPOXMLDOC01-appb-I000056
The update formulas derived from the formula (28), which is a pdf corresponding to Itakura Saito divergence (power spectrogram version), are the following formulas (56) and (57).
Figure JPOXMLDOC01-appb-I000056
 式(57)は下記の通りである。
Figure JPOXMLDOC01-appb-I000057
Equation (57) is as follows.
Figure JPOXMLDOC01-appb-I000057
 式52の形への変形が2通り可能であるため、更新式も2通り存在する。
式(56)下段の右辺の第2項目および式(57)下段の右辺の第3項は共に、u(f,t) と r(f,t) のみで構成されており、反復処理中は一定である。そのため、これらの項は反復前に1回だけ計算すれば良く、式(57)ではその逆行列も1回だけ計算すればよい。
Since it is possible to transform the equation 52 into two forms, there are also two update equations.
The second item on the right side of the lower part of equation (56) and the third term on the right side of the lower part of equation (57) are both composed of only u (f, t) and r (f, t), and during the iterative process. It is constant. Therefore, these terms need to be calculated only once before the iteration, and the inverse matrix of Eq. (57) needs to be calculated only once.
 板倉斎藤ダイバージェンス(振幅スペクトログラム版)に対応した pdf である式(29)から導出される更新式は、下記の式(58)および式(59)である。こちらも2通りが可能である。
Figure JPOXMLDOC01-appb-I000058
The update equations derived from the equation (29), which is a pdf corresponding to Itakura Saito divergence (amplitude spectrogram version), are the following equations (58) and (59). There are two possibilities here as well.
Figure JPOXMLDOC01-appb-I000058
 式(59)は下記の通りである。
Figure JPOXMLDOC01-appb-I000059
Equation (59) is as follows.
Figure JPOXMLDOC01-appb-I000059
 式(30)から導出される更新式は、下記の式(60)の通りである。この式についても、右辺の最後の項は反復前に一回だけ計算すれば良い。
Figure JPOXMLDOC01-appb-I000060
The update formula derived from the formula (30) is as shown in the following formula (60). Again, the last term on the right-hand side needs to be calculated only once before the iteration.
Figure JPOXMLDOC01-appb-I000060
 以上、説明した処理の内容は、次に説明される本開示の実施形態に適用される。 The contents of the processing described above are applied to the embodiment of the present disclosure described below.
<一実施形態>
[音源抽出装置の構成例]
 図4は、本実施形態に係る信号処理装置の一例である音源抽出装置(音源抽出装置100)の構成例を示す図である。音源抽出装置100は、例えば、複数のマイクロホン11、AD(Analog to Digital)変換部12、STFT(Short-Time Fourier Transform)部13、観測信号バッファー14、区間推定部15、参照信号生成部16、音源抽出部17、および、制御部18を有している。音源抽出装置100は、必要に応じて後段処理部19および区間・参照信号推定用センサー20を有している。
<One Embodiment>
[Configuration example of sound source extraction device]
FIG. 4 is a diagram showing a configuration example of a sound source extraction device (sound source extraction device 100) which is an example of the signal processing device according to the present embodiment. The sound source extraction device 100 includes, for example, a plurality of microphones 11, an AD (Analog to Digital) transform unit 12, a SFTT (Short-Time Fourier Transform) unit 13, an observation signal buffer 14, an interval estimation unit 15, and a reference signal generation unit 16. It has a sound source extraction unit 17 and a control unit 18. The sound source extraction device 100 includes a post-stage processing unit 19 and a section / reference signal estimation sensor 20 as needed.
 複数のマイクロホン11は、それぞれ異なる位置に設置されている。マイクロホンの設置形態については後述のようにいくつかのバリエーションがある。マイクロホン11により、目的音と目的音以外の音とが混合された混合音信号が入力される。 The plurality of microphones 11 are installed at different positions. There are several variations in the installation form of the microphone as described later. A mixed sound signal in which a target sound and a sound other than the target sound are mixed is input by the microphone 11.
 AD変換部12は、それぞれのマイクロホン11で取得されたマルチチャンネルの信号を、チャンネルごとにデジタル信号に変換する。この信号を(時間領域の)観測信号と適宜、称する。 The AD conversion unit 12 converts the multi-channel signal acquired by each microphone 11 into a digital signal for each channel. This signal is appropriately referred to as an observation signal (in the time domain).
 STFT部13は、観測信号に短時間フーリエ変換を適用することにより、観測信号を時間周波数領域の信号へと変換する。時間周波数領域の観測信号は、観測信号バッファー14と区間推定部15とに送られる。 The STFT unit 13 converts the observed signal into a signal in the time frequency domain by applying a short-time Fourier transform to the observed signal. The observation signal in the time frequency domain is sent to the observation signal buffer 14 and the interval estimation unit 15.
 観測信号バッファー14は、所定の時間(フレーム数)の観測信号を蓄積する。観測信号はフレームごとに保存されており、他のモジュールからどの時間範囲の観測信号が必要かのリクエストを受け取ると、その時間範囲に対応した観測信号を返す。ここで蓄積された信号は、参照信号生成部16や音源抽出部17において使用される。 The observation signal buffer 14 stores observation signals for a predetermined time (number of frames). The observation signal is stored for each frame, and when a request for which time range of the observation signal is required is received from another module, the observation signal corresponding to that time range is returned. The signal accumulated here is used in the reference signal generation unit 16 and the sound source extraction unit 17.
 区間推定部15は、混合音信号に目的音が含まれる区間を検出する。具体的には、区間推定部15は、目的音の開始時刻(鳴り始めた時刻)および終了時刻(鳴り終わった時刻)などを検出する。どのような技術を用いてこの区間推定を行なうかについては、本実施形態の使用場面やマイクロホンの設置形態に依存するため、詳細は後述する。 The section estimation unit 15 detects a section in which the target sound is included in the mixed sound signal. Specifically, the section estimation unit 15 detects the start time (time when the sound starts to sound), the end time (time when the sound ends), and the like of the target sound. The technique used to estimate this section depends on the usage scene of this embodiment and the installation mode of the microphone, and will be described in detail later.
 参照信号生成部16は、混合音信号に基づいて目的音に対応する参照信号を生成する。例えば、参照信号生成部16は、目的音のラフな振幅スペクトログラムを推定する。参照信号生成部16により行われる処理は、本実施形態の使用場面やマイクロホンの設置形態に依存するため、詳細は後述する。 The reference signal generation unit 16 generates a reference signal corresponding to the target sound based on the mixed sound signal. For example, the reference signal generation unit 16 estimates a rough amplitude spectrogram of the target sound. Since the processing performed by the reference signal generation unit 16 depends on the usage scene of this embodiment and the installation mode of the microphone, the details will be described later.
 音源抽出部17は、混合音信号から参照信号に類似し、且つ、目的音がより強調された信号を抽出する。具体的には、音源抽出部17は、目的音が鳴っている区間に対応した観測信号と参照信号とを用いて、目的音の推定結果を推定する。あるいは、そのような推定結果を観測信号から生成するための抽出フィルターを推定する。 The sound source extraction unit 17 extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized. Specifically, the sound source extraction unit 17 estimates the estimation result of the target sound by using the observation signal and the reference signal corresponding to the section in which the target sound is sounding. Alternatively, an extraction filter is estimated to generate such an estimation result from the observation signal.
 音源抽出部17の出力は、必要に応じて後段処理部19に送られる。後段処理部19で行われる後段処理の例としては、音声認識などが挙げられる。音声認識と組み合わせた場合、音源抽出部17は時間領域の抽出結果、すなわち音声波形を出力し、音声認識部はその音声波形に対して認識処理を行なう。 The output of the sound source extraction unit 17 is sent to the post-stage processing unit 19 as needed. Examples of the post-stage processing performed by the post-stage processing unit 19 include voice recognition and the like. When combined with voice recognition, the sound source extraction unit 17 outputs a time domain extraction result, that is, a voice waveform, and the voice recognition unit performs recognition processing on the voice waveform.
 なお、音声認識には音声区間検出機能を持つものもあるが、本実施形態ではそれと同等の区間推定部15を備えるため、音声認識側の音声区間検出機能は省略可能である。また、音声認識は認識処理において必要な音声特徴量を波形から抽出するためにSTFTを備えることが多いが、本実施形態と組み合わせる場合は、音声認識側のSTFTは省略してもよい。音声認識側の STFT を省略した場合、音源抽出部17は時間周波数領域の抽出結果、すなわちスペクトログラムを出力し。音声認識側において、そのスペクトログラムを音声特徴量へ変換する。 Although some voice recognition has a voice section detection function, the voice recognition side voice section detection function can be omitted because the present embodiment includes a section estimation unit 15 equivalent thereto. Further, the voice recognition often includes an SFT for extracting the voice feature amount required for the recognition process from the waveform, but when combined with the present embodiment, the SFTT on the voice recognition side may be omitted. When STFT on the voice recognition side is omitted, the sound source extraction unit 17 outputs the extraction result of the time frequency region, that is, the spectrogram. On the voice recognition side, the spectrogram is converted into a voice feature.
 制御部18は、音源抽出装置100の各部を統括的に制御する。制御部18は、例えば、上述した各部の動作を制御する。図4では省略されているが、制御部18と上述した各機能ブロックとは相互に結線されている。 The control unit 18 comprehensively controls each unit of the sound source extraction device 100. The control unit 18 controls, for example, the operation of each of the above-mentioned units. Although omitted in FIG. 4, the control unit 18 and the above-mentioned functional blocks are connected to each other.
 区間・参照信号推定用センサー20は、区間推定または参照信号生成で使用することを想定した、マイクロホン11のマイクロホンとは別のセンサーである。なお、図4において後段処理部19および区間・参照信号推定用センサー20に括弧が付されているのは、音源抽出装置100において後段処理部19および区間・参照信号推定用センサー20が省略可能であることを示している。すなわち、マイクロホン11とは異なる専用のセンサーを備えることで区間推定または参照信号生成の精度が向上できるのであれば、そのようなセンサーを用いても良い。 The section / reference signal estimation sensor 20 is a sensor different from the microphone of the microphone 11 which is supposed to be used for section estimation or reference signal generation. In FIG. 4, the post-stage processing unit 19 and the section / reference signal estimation sensor 20 are shown in parentheses because the post-stage processing unit 19 and the section / reference signal estimation sensor 20 can be omitted in the sound source extraction device 100. It shows that there is. That is, if the accuracy of section estimation or reference signal generation can be improved by providing a dedicated sensor different from the microphone 11, such a sensor may be used.
 例えば発話の区間検出の方法として、特開平10-51889号などに記載された、口唇画像を用いた方式を使用する場合は、センサーとして撮像素子(カメラ)を適用することができる。あるいは、本発明者が提案した特願2019-073542において補助センサーとして使用されている以下のセンサーを備え、それによって取得される信号を用いて区間推定または参照信号生成を行なっても良い。
・骨伝導マイクロホンや咽頭マイクロホンといった、身体に密着させて使用するタイプのマイクロホン。
・話者の口や喉付近の皮膚表面の振動を観測することができるセンサー。例えば、レーザーポインターと光センサーとの組み合わせ。
For example, when a method using a lip image described in Japanese Patent Application Laid-Open No. 10-51889 is used as a method for detecting an utterance section, an image sensor (camera) can be applied as a sensor. Alternatively, the following sensors used as auxiliary sensors in Japanese Patent Application No. 2019-073542 proposed by the present inventor may be provided, and section estimation or reference signal generation may be performed using the signals acquired thereby.
-A type of microphone that is used in close contact with the body, such as a bone conduction microphone and a pharyngeal microphone.
-A sensor that can observe the vibration of the skin surface near the speaker's mouth and throat. For example, a combination of a laser pointer and an optical sensor.
[区間推定および参照信号生成について]
 本実施形態の使用場面およびマイクロホン11の設置形態はいくつかのバリエーションが考えられ、それぞれにおいて、区間の推定や参照信号の生成のためにどのような技術を適用可能かが異なる。各バリエーションの説明のためには、目的音の区間同士の重複があり得るか否か、そして重複がある得る場合にどう対処するかについて明確化する必要がある。以下では、典型的な使用場面および設置形態として3通りほど示し、それぞれ図5~図7を用いて説明する。
[About interval estimation and reference signal generation]
There are several possible variations in the usage scene of this embodiment and the installation mode of the microphone 11, and in each case, what kind of technology can be applied for estimating the section and generating the reference signal is different. In order to explain each variation, it is necessary to clarify whether or not there may be overlap between the sections of the target sound, and how to deal with the case where there is overlap. In the following, there are three typical usage situations and installation modes, which will be described with reference to FIGS. 5 to 7, respectively.
 図5は、ある環境においてN人(二人以上)の話者が存在し、さらに話者ごとにマイクロホンが割り当てられている状況を想定した図である。マイクロホンが割り当てられているとは、各話者がピンマイクロホンやヘッドセットマイクロホン等を装着しているか、各話者の至近距離にマイクロホンが設置されているような状況である。N人の話者をS1、S2・・Sn、各話者に割り当てられたマイクロホンをM1、M2・・・Mnとする。さらに、0個以上の妨害音音源Nsが存在する。 FIG. 5 is a diagram assuming a situation in which there are N (two or more) speakers in a certain environment and a microphone is assigned to each speaker. Assigning a microphone means that each speaker is wearing a pin microphone, a headset microphone, or the like, or the microphone is installed at a close distance to each speaker. Let N speakers be S1, S2 ... Sn, and microphones assigned to each speaker be M1, M2 ... Mn. Further, there are 0 or more interfering sound sources Ns.
 このような状況としては、例えば部屋の中で会議を行なっており、その会議の議事録を自動で作成するために、各話者のマイクロホンで収音された音声に対して音声認識を行なうような場面が該当する。この場合、発話同士が重複する可能性があり、発話の重複が発生すると、各マイクロホンでは音声同士が混合した信号が観測される。また、妨害音音源として、プロジェクターやエアコンのファンの音や、スピーカーを備えた機器から発する再生音などがあり得、これらの音も各マイクロホンの観測信号には含まれる。いずれも誤認識の原因となるが、本実施形態の音源抽出技術を用いれば、各マイクロホンに対応した話者の音声のみを残し、それ以外の音源(他の話者や妨害音音源)を除去する(抑圧する)ことができるので、音声認識精度を向上させることができる。 In such a situation, for example, a conference is held in a room, and in order to automatically create the minutes of the conference, voice recognition is performed for the voice picked up by each speaker's microphone. Applicable to various situations. In this case, the utterances may overlap with each other, and when the utterances overlap, a signal in which the voices are mixed is observed in each microphone. Further, as a disturbing sound source, there may be a sound of a fan of a projector or an air conditioner, a reproduced sound emitted from a device equipped with a speaker, and the like, and these sounds are also included in the observation signal of each microphone. Both of them cause erroneous recognition, but if the sound source extraction technology of this embodiment is used, only the voice of the speaker corresponding to each microphone is left, and other sound sources (other speakers and disturbing sound sources) are removed. Since it can be suppressed (suppressed), the speech recognition accuracy can be improved.
 以下では、このような状況で使用可能な区間検出方法および参照信号生成方法について説明する。なお以降では、各マイクロホンで観測される音の内、対応する(目的とする)話者の音声を主音声または主発話、別の話者の音声を回り込み音声またはクロストークと適宜、称する。 The section detection method and reference signal generation method that can be used in such a situation will be described below. In the following, among the sounds observed by each microphone, the voice of the corresponding (target) speaker will be referred to as the main voice or the main utterance, and the voice of another speaker will be appropriately referred to as the wraparound voice or crosstalk.
 区間検出方法としては、特願2019-227192号に記載されている主発話検出が使用可能である。当該出願では、ニューラルネットワークを用いた学習を行なうことで、クロストークは無視する一方で主音声には反応する検出器を実現している。また、発話の重複にも対応しているため、発話同士が重複していても、図3のように、各発話の区間および発話者をそれぞれ推定することができる。 As the section detection method, the main utterance detection described in Japanese Patent Application No. 2019-227192 can be used. In the present application, by performing learning using a neural network, a detector that responds to the main voice while ignoring crosstalk is realized. Further, since the utterances are duplicated, even if the utterances are duplicated, the section of each utterance and the speaker can be estimated as shown in FIG.
 参照信号生成方法については、少なくとも2つの方法が可能である。一つは、話者に割り当てられたマイクロホンで観測された信号から直接生成する方法である。例えば、図5のマイクロホンM1で観測される信号は全ての音源の混合であるが、最も近くの音源である話者S1の音声が大きく収音される一方、それと比較すると他の音源は小さな音で収音されている。従って、マイクロホンM1の観測信号を話者S1の発話区間に従って切り出し、それに短時間フーリエ変換を適用した後で絶対値をとることで振幅スペクトログラムを生成すれば、それは目的音のラフな振幅スペクトログラムであり、本実施形態における参照信号として使用することができる。 At least two methods are possible for the reference signal generation method. One is to generate directly from the signal observed by the microphone assigned to the speaker. For example, the signal observed by the microphone M1 in FIG. 5 is a mixture of all sound sources, but the sound of the speaker S1, which is the nearest sound source, is picked up greatly, while the other sound sources are quieter than that. The sound is picked up by. Therefore, if an amplitude spectrogram is generated by cutting out the observation signal of the microphone M1 according to the speech section of the speaker S1, applying a short-time Fourier transform to it, and then taking an absolute value, it is a rough amplitude spectrogram of the target sound. , Can be used as a reference signal in this embodiment.
 もう一つの方法は、前述の特願2019-227192号に記載されているクロストーク低減技術を使用することである。上記出願では、ニューラルネットワークを学習することで、主音声とクロストークとが混合した信号からクロストークを除去(低減)して主音声を残すことを実現している。このニューラルネットワークの出力は、クロストーク低減結果の振幅スペクトログラムまたは時間周波数マスクであり、前者であればそのまま参照信号として使用することができる。後者であっても、観測信号の振幅スペクトログラムに対して時間周波数マスクを適用することで、クロストーク除去結果の振幅スペクトログラムを生成することができるため、それを参照信号として使用することができる。 Another method is to use the crosstalk reduction technique described in Japanese Patent Application No. 2019-227192 described above. In the above application, by learning the neural network, it is realized that the crosstalk is removed (reduced) from the signal in which the main voice and the crosstalk are mixed and the main voice is left. The output of this neural network is an amplitude spectrogram or a time-frequency mask of the crosstalk reduction result, and the former can be used as a reference signal as it is. Even in the latter case, by applying a time-frequency mask to the amplitude spectrogram of the observed signal, it is possible to generate an amplitude spectrogram of the crosstalk removal result, and thus it can be used as a reference signal.
 次、図6を用いて図5とは別の使用場面における参照信号生成処理等について説明する。図6は、1以上の話者と1個以上の妨害音音源がある環境を想定している。図5は妨害音音源Nsの存在よりも発話同士の重複の方に主眼があったが、図6に示す例では大きな妨害音の存在する騒がしい環境においてクリーンな音声を得ることに主眼がある。ただし、話者が2以上存在する場合は、発話同士の重複も課題となる。 Next, reference signal generation processing and the like in a usage scene different from that of FIG. 5 will be described with reference to FIG. FIG. 6 assumes an environment in which there are one or more speakers and one or more interfering sound sources. In FIG. 5, the focus was on the overlap of utterances rather than the existence of the disturbing sound source Ns, but in the example shown in FIG. 6, the focus is on obtaining a clean voice in a noisy environment in which a loud disturbing sound is present. However, when there are two or more speakers, duplication of utterances is also an issue.
 話者はn人であり、各話者を話者S1~話者Snとする。nは1以上とする。図6では妨害音音源Nsは1個のみ図示されているが、個数は任意である。 There are n speakers, and each speaker is speaker S1 to speaker Sn. n is 1 or more. In FIG. 6, only one disturbing sound source Ns is shown, but the number is arbitrary.
 使用するセンサーは2種類ある。一方は、各話者が装着している、あるいは各話者の至近に設置されているセンサー(区間・参照信号推定用センサー20に対応するセンサー)であり、以下ではセンサーSE(センサーSE1、SE2・・SEn)と適宜、称する。もう一方は位置が固定された複数のマイクロホン11で構成されるマイクロホンアレイ11Aである。 There are two types of sensors used. One is a sensor (sensor corresponding to the section / reference signal estimation sensor 20) worn by each speaker or installed in the immediate vicinity of each speaker, and the following are sensor SEs (sensors SE1 and SE2). .. SEn) as appropriate. The other is a microphone array 11A composed of a plurality of microphones 11 having a fixed position.
 区間・参照信号推定用センサー20は、図5のマイクロホンと同様のタイプ(気導マイクロホンと呼ばれる、大気中を伝播する音を収音するタイプのマイクロホン)を使用しても良いが、他に、図4において説明したように、骨伝導マイクロホンや咽頭マイクロホンといった、身体に密着させて使用するタイプのマイクロホン、あるいは、話者の口や喉付近の皮膚表面の振動を観測可能なセンサーを使用しても良い。いずれにしても、センサーSEはマイクロホンアレイよりも各話者に近接または密着しているため、各センサーに対応する話者の発話を高いSN比で収録することができる。 The section / reference signal estimation sensor 20 may use the same type as the microphone shown in FIG. 5 (a type microphone called an air conduction microphone that collects sound propagating in the atmosphere). As described in FIG. 4, using a microphone of the type that is used in close contact with the body, such as a bone conduction microphone or a pharyngeal microphone, or a sensor that can observe the vibration of the skin surface near the speaker's mouth and throat. Is also good. In any case, since the sensor SE is closer to or in close contact with each speaker than the microphone array, the utterances of the speakers corresponding to each sensor can be recorded at a high SN ratio.
 マイクロホンアレイ11Aとしては、1つの装置に複数のマイクロホンが設置されている形態の他に、分散マイクロホン(distributed microphones)と呼ばれる、空間内の複数の場所にマイクロホンを設置する形態も可能である。分散マイクロホンの例として、部屋の壁面や天井面にマイクロホンを設置する形態や、自動車内の座席・壁面・天井・ダッシュボード等にマイクロホンを設置する形態などが考えられる。 As the microphone array 11A, in addition to the form in which a plurality of microphones are installed in one device, a form in which microphones are installed in a plurality of places in a space called distributed microphones is also possible. As an example of the distributed microphone, a form in which the microphone is installed on the wall surface or the ceiling surface of the room, a form in which the microphone is installed on the seat, the wall surface, the ceiling, the dashboard, etc. in the automobile can be considered.
 本例においては、区間推定および参照信号生成については区間・参照信号推定用センサー20に対応するセンサーSE1~SEnで取得された信号を使用し、音源抽出についてはマイクロホンアレイ11Aから取得されたマルチチャンネル観測信号を使用する。センサーSEとして気導マイクロホンを使用した場合の区間推定方法および参照信号生成方法については、図5を用いて説明した方法と同様の方法が使用可能である。 In this example, the signals acquired by the sensors SE1 to SEn corresponding to the interval / reference signal estimation sensor 20 are used for interval estimation and reference signal generation, and the multi-channel acquired from the microphone array 11A is used for sound source extraction. Use observation signals. As for the interval estimation method and the reference signal generation method when the air conduction microphone is used as the sensor SE, the same method as the method described with reference to FIG. 5 can be used.
 一方、密着型マイクロホンを使用した場合は、図5に示した方法と同様の方法の他にも、妨害音や他者の発話の混入の少ない信号が取得可能という特徴を利用した方法も使用可能である。例えば、区間推定としては、入力信号のパワーの閾値で判別する方法も使用可能であり、参照信号としては、入力信号から生成した振幅スペクトログラムがそのまま使用可能である。密着型マイクロホンで収録される音は、高域が減衰している上に、嚥下音などの体内で発生する音も収録される場合があるため、音声認識等への入力として使用するのは必ずしも適切ではないが、区間推定用や参照信号生成用としては有効に利用することができる。 On the other hand, when a close-contact microphone is used, in addition to the same method as that shown in FIG. Is. For example, as the interval estimation, a method of discriminating by the threshold value of the power of the input signal can be used, and as the reference signal, the amplitude spectrogram generated from the input signal can be used as it is. The sound recorded by the close-fitting microphone is not always used as an input for voice recognition, etc., because the high frequencies are attenuated and sounds generated in the body such as swallowing sounds may also be recorded. Although it is not appropriate, it can be effectively used for interval estimation and reference signal generation.
 センサーSEとして光センサーなどマイクロホン以外のセンサーを用いた場合には、出願番号2019-227192号に記載された方法が使用可能である。当該特許出願では、気導マイクロホンで取得された音(目的音と妨害音との混合)と、補助センサーで取得された信号(目的音に対応した何らかの信号)とからクリーンな目的音への対応関係を予めニューラルネットワークに学習させておき、推論時には、気導マイクロホンおよび補助センサーで取得された信号をニューラルネットワークに入力することで、クリーンに近い目的音を生成する。そのニューラルネットワークの出力は振幅スペクトログラム(あるいは時間周波数マスク)であるため、それを本実施形態の参照信号として使用する(あるいは参照信号を生成する)ことができる。また、変形例として、クリーンな目的音を生成すると同時に、目的音が鳴っている区間も推定する方法についても言及しているため、区間検出手段としても使用可能である。 When a sensor other than a microphone such as an optical sensor is used as the sensor SE, the method described in Application No. 2019-227192 can be used. In the patent application, a clean target sound is supported from the sound acquired by the air conduction microphone (mixture of the target sound and the disturbing sound) and the signal acquired by the auxiliary sensor (some signal corresponding to the target sound). The relationship is learned in advance by the neural network, and at the time of inference, the signal acquired by the air conduction microphone and the auxiliary sensor is input to the neural network to generate a near-clean target sound. Since the output of the neural network is an amplitude spectrogram (or time-frequency mask), it can be used as a reference signal (or a reference signal is generated) in this embodiment. Further, as a modified example, since a method of generating a clean target sound and at the same time estimating a section in which the target sound is sounding is also mentioned, it can be used as a section detecting means.
 音源抽出は、基本的にマイクロホンアレイ11Aで取得された観測信号を用いて行なう。ただし、センサーSEとして気導マイクロホンを使用している場合は、それによって取得された観測信号を追加することも可能である。すなわち、マイクロホンアレイ11AがN個のマイクロホンで構成されているとすると、m個の区間・参照信号推定用センサーと合わせた(N+m)チャンネルの観測信号を用いて音源抽出を行なっても良い。またその場合、N=1でも複数の気導マイクロホンが存在するため、マイクロホンアレイ11Aの代わりに単一のマイクロホンが用いられても良い。 Sound source extraction is basically performed using the observation signal acquired by the microphone array 11A. However, when an air conduction microphone is used as the sensor SE, it is possible to add the observation signal acquired by the air conduction microphone. That is, assuming that the microphone array 11A is composed of N microphones, sound source extraction may be performed using observation signals of (N + m) channels combined with m sections / reference signal estimation sensors. Further, in that case, since there are a plurality of air conduction microphones even when N = 1, a single microphone may be used instead of the microphone array 11A.
 同様に、区間推定や参照信号生成においても、センサーSEに加えてマイクロホンアレイ由来の信号を使用しても良い。マイクロホンアレイ11Aはどの話者からも離れているため、話者の発話は必ずクロストークとして観測される。その信号と区間・参照信号推定用マイクロホンの信号とを比較することで、区間推定の精度、特に、発話同士の重複が発生しているときの区間推定精度を向上させることが期待できる。 Similarly, in section estimation and reference signal generation, a signal derived from a microphone array may be used in addition to the sensor SE. Since the microphone array 11A is far from any speaker, the speaker's utterance is always observed as crosstalk. By comparing the signal with the signal of the interval / reference signal estimation microphone, it can be expected to improve the accuracy of interval estimation, especially the accuracy of interval estimation when utterances overlap.
 図7は、図6とは別のマイクロホン設置形態である。1人以上の話者と1個以上の妨害音音源がある環境を想定している点は図6と同じであるが、使用するマイクロホンはマイクロホンアレイ11Aのみであり、各話者の至近に設置されたセンサーは存在しない。マイクロホンアレイ11Aの形態は、図6と同様に、1つの装置に設置された複数のマイクロホンや、空間内に設置された複数のマイクロホン(分散マイクロホン)などが適用可能である。 FIG. 7 shows a microphone installation form different from that of FIG. It is the same as FIG. 6 in that it assumes an environment with one or more speakers and one or more interfering sound sources, but the microphone used is only the microphone array 11A, and it is installed in the immediate vicinity of each speaker. There is no sensor that has been used. Similar to FIG. 6, the form of the microphone array 11A can be applied to a plurality of microphones installed in one device, a plurality of microphones (distributed microphones) installed in a space, and the like.
 このような状況では、本開示の音源抽出において前提となる、発話区間の推定および参照信号の推定をどのように行なうかが課題となるが、音声同士の混合の発生頻度が低いか高いかによって、適用可能な技術が異なる。以下、それぞれについて説明する。 In such a situation, how to estimate the utterance section and the reference signal, which is a premise in the sound source extraction of the present disclosure, becomes an issue, but it depends on whether the frequency of mixing of voices is low or high. , Applicable technology is different. Each will be described below.
 音声同士の混合の発生頻度が低い場合とは、ある環境において話者は一人だけ(すなわち話者S1のみ)存在し、さらに妨害音音源Nsが非音声と見なせる場合である。その場合、区間推定方法としては、特許4182444号等に記載された「音声らしさ」に着目した音声区間検出技術が適用可能である。すなわち、図7の環境において、「音声らしい」信号が話者S1の発話のみであると考えられる場合は、非音声の信号は無視し、音声らしい信号が含まれている個所(タイミング)を目的音の区間として検出する。 The case where the mixing of voices is low is the case where there is only one speaker (that is, only the speaker S1) in a certain environment, and the disturbing sound source Ns can be regarded as non-voice. In that case, as the section estimation method, a voice section detection technique focusing on "voice-likeness" described in Japanese Patent No. 4182444 or the like can be applied. That is, in the environment of FIG. 7, when it is considered that the "voice-like" signal is only the utterance of the speaker S1, the non-voice signal is ignored and the purpose is the location (timing) in which the voice-like signal is included. Detected as a section of sound.
 参照信号生成方法としては、文献3に記載されているようなデノイズ(denoise)と呼ばれる手法、すなわち音声と非音声とが混合した信号を入力し、非音声を除去して音声を残すような処理が適用可能である。デノイズは非常に様々な方法が適用可能であるが、例えば以下の方法はニューラルネットワークを用いており、その出力は振幅スペクトログラムであるため、出力をそのまま参照信号として使用することができる。
「文献3
・Liu, D. & Smaragdis, P. & Kim, M.,
"Experiments on deep learning for speech denoising,"
Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH2014. p.2685-2689. 」
As a reference signal generation method, a method called denoise as described in Document 3, that is, a process of inputting a signal in which voice and non-sound are mixed, removing non-sound, and leaving sound. Is applicable. A wide variety of methods can be applied to denoising. For example, the following method uses a neural network, and its output is an amplitude spectrogram, so that the output can be used as a reference signal as it is.
"Reference 3
・ Liu, D. & Smaragdis, P. & Kim, M.,
"Experiments on deep learning for speech denoising,"
Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH2014. P.2685-2689. "
 一方、音声同士の混合の発声頻度が高い場合とは、ある環境において複数の話者が会話をしていて発話同士の重複が発生する場合や、話者が一人でも妨害音音源が音声である場合などである。後者の例として、テレビやラジオ等のスピーカーから音声が出力されている場合などがある。このような場合、音声同士の混合に対しても適用可能な方式を発話区間検出として使用する必要がある。例えば以下のような技術が適用可能である。
a)音源方向推定を利用した音声区間検出
(例えば、特開2010-121975号公報や特開2012-150237号公報に記載されている方法)
b)顔画像(口唇画像)を利用した音声区間検出
(例えば、特開平10-51889号公報や特開2011-191423号公報に記載されている方法)
On the other hand, when the frequency of utterances in which voices are mixed is high, there are cases where multiple speakers are talking in a certain environment and the utterances overlap each other, or even if there is only one speaker, the disturbing sound source is voice. For example. As an example of the latter, there is a case where sound is output from a speaker such as a television or a radio. In such a case, it is necessary to use a method applicable to mixing voices as the utterance section detection. For example, the following techniques can be applied.
a) Speech section detection using sound source direction estimation (for example, the method described in JP-A-2010-121975 and JP-A-2012-150237).
b) Speech section detection using a face image (lip image) (for example, the method described in JP-A-10-51889 and JP-A-2011-191423).
 図7に示すマイクロホン設置形態ではマイクロホンアレイが存在するため、a)の前提となる音源方向推定が適用可能である。また、図4に示す例においての区間・参照信号推定用センサー20として撮像素子(カメラ)を用いれば、b)も適用可能である。いずれの方式も、発話区間が検出された時点でその発話の方向も分かる(上記b)の方法では、画像内における口唇の位置から発話方向を計算することができる)ので、その値を参照信号生成のために使用することができる。以下では、発話区間推定において推定された音源方向をθと適宜、称する。 Since the microphone array exists in the microphone installation form shown in FIG. 7, the sound source direction estimation which is the premise of a) can be applied. Further, if an image sensor (camera) is used as the section / reference signal estimation sensor 20 in the example shown in FIG. 4, b) can also be applied. In either method, the direction of the utterance can be known when the utterance section is detected (in the method (b) above, the utterance direction can be calculated from the position of the lips in the image), so refer to that value as a reference signal. Can be used for generation. Hereinafter, the sound source direction estimated in the utterance section estimation is appropriately referred to as θ.
 参照信号生成方法についても音声同士の混合に対応している必要があり、そのような技術として以下が適用可能である。
a)音源方向を用いた時間周波数マスキング
(特開2014-219467号公報において使用されている参照信号生成方法である。音源方向θに対応したステアリングベクトルを計算し、それと観測信号ベクトル(上述した(式(2))との間でコサイン類似度を計算すると、方向θから到来する音を残し、その方向以外から到来する音を減衰するマスクとなる。そのマスクを観測信号の振幅スペクトログラムに適用し、そうして生成された信号を参照信号として使用する。
b)Speaker Beam や Voice Filter 等の、ニューラルネットワークベースの選択的聴取技術
ここでいう選択的聴取技術とは、複数の音声が混同したモノラルの信号から、指定した一人の音声を抽出する技術である。抽出したい話者について、他の話者と混合していないクリーンな音声(混合音声とは別の発話内容で良い)を予め録音しておき、混合信号とクリーン音声とを共にニューラルネットワークに入力すると、混合信号の中に含まれる指定話者の音声が出力される。正しくは、そのようなスペクトログラムを生成するための時間周波数マスクが出力される。そのように出力されたマスクを観測信号の振幅スペクトログラムに適用すると、それは本実施形態の参照信号として使用することができる。
なお、Speaker Beam, Voice Filter の詳細については、それぞれ以下の文献4、文献5に記載されている。
「文献4:
・M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa and T. Nakatani,
"Single channel target speaker extraction and recognition with speaker beam,"
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.」
文献5:
・Author: Quan Wang, Hannah Muckenhire, Kevin Wilson, Prashant Sridhar, Zelin Wu,
John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno
"VOICEFILTER: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking," arXiv:1810.04826v3 [eess.AS] 27 Oct 2018
 https://arxiv.org/abs/1810.04826」
The reference signal generation method also needs to support mixing of voices, and the following can be applied as such a technique.
a) Time frequency masking using the sound source direction (This is a reference signal generation method used in Japanese Patent Application Laid-Open No. 2014-219467. A steering vector corresponding to the sound source direction θ is calculated, and the observation signal vector (described above (described above). When the cosine similarity is calculated with Eq. (2)), it becomes a mask that leaves the sound arriving from the direction θ and attenuates the sound arriving from other directions. The mask is applied to the amplitude spectrogram of the observed signal. , The signal thus generated is used as a reference signal.
b) Neural network-based selective listening technology such as Speaker Beam and Voice Filter The selective listening technology here is a technology that extracts the voice of a specified person from a monaural signal in which multiple voices are confused. .. For the speaker you want to extract, if you pre-record a clean voice that is not mixed with other speakers (the utterance content may be different from the mixed voice), and input both the mixed signal and the clean voice into the neural network. , The voice of the designated speaker included in the mixed signal is output. Correctly, a time-frequency mask is output to generate such a spectrogram. When the mask thus output is applied to the amplitude spectrogram of the observed signal, it can be used as the reference signal of the present embodiment.
The details of Speaker Beam and Voice Filter are described in Documents 4 and 5 below, respectively.
"Reference 4:
・ M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa and T. Nakatani,
"Single channel target speaker extraction and recognition with speaker beam,"
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. "
Reference 5:
・ Author: Quan Wang, Hannah Muckenhire, Kevin Wilson, Prashant Sridhar, Zelin Wu,
John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno
"VOICEFILTER: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking," arXiv: 1810.4826v3 [eess.AS] 27 Oct 2018
https://arxiv.org/abs/1810.04826 "
(音源抽出部の詳細について)
 次に、図8を用いて音源抽出部17の詳細について説明する。音源抽出部17は、例えば、前処理部17A、抽出フィルター推定部17B、後処理部17Cを有する。
(Details of sound source extraction unit)
Next, the details of the sound source extraction unit 17 will be described with reference to FIG. The sound source extraction unit 17 includes, for example, a pre-processing unit 17A, an extraction filter estimation unit 17B, and a post-processing unit 17C.
 前処理部17Aは、式(3)~式(7)に示した無相関化処理、すなわち、時間周波数領域観測信号に対して無相関化処理等を行う。 The preprocessing unit 17A performs uncorrelated processing shown in equations (3) to (7), that is, uncorrelated processing on the time frequency domain observation signal.
 抽出フィルター推定部17Bは、目的音がより強調された信号を抽出するフィルターを推定する。具体的には、抽出フィルター推定部17Bは、音源抽出のための抽出フィルターの推定や抽出結果の生成を行う。より具体的には、抽出フィルター推定部17Bは、参照信号と抽出フィルターによる抽出結果との依存性、および、出結果と他の仮想的な音源の分離結果との独立性を反映させた目的関数を最適化する解として、抽出フィルターを推定する。 The extraction filter estimation unit 17B estimates a filter that extracts a signal in which the target sound is emphasized. Specifically, the extraction filter estimation unit 17B estimates the extraction filter for sound source extraction and generates the extraction result. More specifically, the extraction filter estimation unit 17B is an objective function that reflects the dependency between the reference signal and the extraction result by the extraction filter, and the independence between the output result and the separation result of another virtual sound source. Estimate the extraction filter as a solution to optimize.
 抽出フィルター推定部17Bは、上述したように、目的関数に含まれる、参照信号と抽出結果との依存性を表わす音源モデルとして、
・抽出結果と参照信号との2変量球状分布
・参照信号を時間周波数ごとの分散に対応した値と見なす時間周波数可変分散モデル
・抽出結果の絶対値と参照信号とのダイバージェンスを用いたモデル
 の何れかを使用する。また、2変量球状分布として2変量ラプラス分布を使用してもよい。また、時間周波数可変分散モデルとして、時間周波数可変分散ガウス分布、時間周波数可変分散ラプラス分布、時間周波数可変分散 Student-t分布の何れかを使用してもよい。また、ダイバージェンスを用いたモデルのダイバージェンスとして、抽出結果の絶対値と参照信号とのユークリッド距離または二乗誤差、抽出結果のパワースペクトルと絶対値のパワースペクトルとの板倉斎藤距離、抽出結果の振幅スペクトルと絶対値の振幅スペクトルとの板倉斎藤距離、抽出結果の絶対値と参照信号との比と、1との間の二乗誤差の何れかを使用してもよい。
As described above, the extraction filter estimation unit 17B serves as a sound source model that represents the dependence between the reference signal and the extraction result included in the objective function.
-A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that regards the reference signal as a value corresponding to the variance for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal To use. Further, the bivariate Laplace distribution may be used as the bivariate spherical distribution. Further, as the time-frequency variable variance model, any one of the time-frequency variable variance Gaussian distribution, the time-frequency variable variance Laplace distribution, and the time-frequency variable variance Student-t distribution may be used. In addition, as the divergence of the model using divergence, the Euclidean distance or square error between the absolute value of the extraction result and the reference signal, the Itakura Saito distance between the power spectrum of the extraction result and the power spectrum of the absolute value, and the amplitude spectrum of the extraction result. Either the Itakura Saito distance to the absolute amplitude spectrum, the ratio of the absolute value of the extraction result to the reference signal, or the square error between 1 may be used.
 後処理部17Cは、少なくとも混合音信号への抽出フィルターの適用処理を行なう。後処理部17Cは、後述するリスケーリング処理の他、抽出結果スペクトログラムにフーリエ逆変換を適用して抽出結果波形を生成する処理を行ってもよい。 The post-processing unit 17C performs at least the processing of applying the extraction filter to the mixed sound signal. In addition to the rescaling process described later, the post-processing unit 17C may perform a process of applying an inverse Fourier transform to the extraction result spectrogram to generate an extraction result waveform.
[音源抽出装置で行われる処理の流れ]
(全体の流れ)
 次に、図9に示すフローチャートを参照しつつ、音源抽出装置100で行われる処理の流れ(全体の流れ)について説明する。なお、以下に説明する処理は、特に断らない限りは制御部18によって行われる。
[Flow of processing performed by the sound source extractor]
(Overall flow)
Next, the flow of processing (overall flow) performed by the sound source extraction device 100 will be described with reference to the flowchart shown in FIG. The processing described below is performed by the control unit 18 unless otherwise specified.
 ステップST11では、AD変換部12により、マイクロホン11に入力されたアナログの観測信号(混合音信号)がデジタル信号に変換される。この時点の観測信号は時間領域である。そして、処理がステップST12に進む。 In step ST11, the AD conversion unit 12 converts the analog observation signal (mixed sound signal) input to the microphone 11 into a digital signal. The observed signal at this point is in the time domain. Then, the process proceeds to step ST12.
 ステップST12では、STFT部13が、時間領域の観測信号に対して短時間フーリエ変換(STFT)を適用し、時間周波数領域の観測信号を得る。入力はマイクロホンからの他に、必要に応じてファイルやネットワークなどから行なってもよい。STFT部13で行われる具体的な処理の詳細については後述する。本実施形態では、入力チャンネルが複数(マイクロホンの個数分)あるため、AD変換やSTFTもチャンネル数だけ行われる。そして処理がステップST13に進む。 In step ST12, the STFT unit 13 applies a short-time Fourier transform (STFT) to the observation signal in the time domain to obtain the observation signal in the time frequency domain. Input may be performed not only from the microphone but also from a file or network as needed. Details of the specific processing performed by the FTFT unit 13 will be described later. In the present embodiment, since there are a plurality of input channels (for the number of microphones), AD conversion and RTM are also performed for the number of channels. Then, the process proceeds to step ST13.
 ステップST13では、STFTによって時間周波数領域に変換された観測信号を、所定の時間分(所定のフレーム数)だけ蓄積する処理(バッファリング)が行われる。そして、処理がステップST14に進む。 In step ST13, a process (buffering) is performed in which the observation signal converted into the time frequency domain by the FTFT is accumulated for a predetermined time (a predetermined number of frames). Then, the process proceeds to step ST14.
 ステップST14では、区間推定部15が、目的音の開始時刻(鳴り始めた時刻)および終了時刻(鳴り終わった時刻)を推定する。さらに、発話同士の重複が発生し得る環境で使用される場合は、どの話者の発話なのかを特定可能な情報も合わせて推定する。例えば図5や図6に示した使用形態においては、各話者に割り当てられたマイクロホン(センサー)の番号も推定し、図7に示した使用形態においては、発話の方向も推定する。 In step ST14, the section estimation unit 15 estimates the start time (time when the sound starts to sound) and the end time (time when the sound ends) of the target sound. Furthermore, when used in an environment where utterances may overlap with each other, information that can identify which speaker is the utterance is also estimated. For example, in the usage pattern shown in FIGS. 5 and 6, the microphone (sensor) number assigned to each speaker is also estimated, and in the usage pattern shown in FIG. 7, the direction of utterance is also estimated.
 音源抽出およびそれにともなう処理は、目的音の区間ごとに行なわれる。そのため、区間が検出された場合のみ処理がステップST16に進み、検出されなかった場合はステップST16~ST19をスキップして、処理がステップST20に進む。 Sound source extraction and associated processing are performed for each section of the target sound. Therefore, the process proceeds to step ST16 only when the section is detected, and if not detected, steps ST16 to ST19 are skipped and the process proceeds to step ST20.
 区間が検出された場合は、ステップST16において、参照信号生成部16が、その区間で鳴っている目的音のラフな振幅スペクトログラムを生成する。参照信号の生成で使用可能な方式は、図5~図7を参照して説明した通りである。そして、処理がステップST17に進む。 When a section is detected, in step ST16, the reference signal generation unit 16 generates a rough amplitude spectrogram of the target sound sounding in that section. The methods that can be used to generate the reference signal are as described with reference to FIGS. 5 to 7. Then, the process proceeds to step ST17.
 ステップST17では、音源抽出部17が、ステップST16で求まった参照信号と目的音区間の時間範囲に対応した観測信号とを用いて、目的音の抽出結果を生成する。処理の詳細は後述する。 In step ST17, the sound source extraction unit 17 generates the extraction result of the target sound by using the reference signal obtained in step ST16 and the observation signal corresponding to the time range of the target sound section. The details of the process will be described later.
 ステップST18では、ステップST16およびステップST17に係る処理を所定の回数だけ反復するか否かが判断される。この反復の意味は、音源抽出処理によって観測信号や参照信号よりも高精度の抽出結果が生成されたら、次にその抽出結果から参照信号を再度生成し、それを用いて音源抽出処理を再度実行すれば、前回よりもさらに高精度な抽出結果が得られることを意味している。 In step ST18, it is determined whether or not the processing related to step ST16 and step ST17 is repeated a predetermined number of times. The meaning of this iteration is that if the sound source extraction process generates an extraction result with higher accuracy than the observed signal or reference signal, then the reference signal is regenerated from the extraction result, and the sound source extraction process is executed again using it. This means that the extraction result can be obtained with higher accuracy than the previous time.
 例えば、観測信号をニューラルネットワークに入力して参照信号を生成している場合、観測信号の代わりに1回目の抽出結果をニューラルネットに入力すると、その出力は1回目のニューラルネットワークの出力より高精度である可能性が高い。従ってそれを参照信号として用いて2回目の抽出結果を生成すると、それは1回目よりも高精度である可能性が高く、さらに反復することで一層高精度の抽出結果を得ることも可能である。文献1と異なり、本実施形態では分離処理ではなく抽出処理において反復を行なっていることが特徴的である。なお、この反復は、ステップST17に係る音源抽出処理の内部において補助関数法や不動点法でフィルターを推定する際に使用される反復とは別物である点に注意されたい。ステップST18に係る処理の後に処理がステップST19に進む。 For example, when an observation signal is input to a neural network to generate a reference signal, if the first extraction result is input to the neural network instead of the observation signal, the output will be more accurate than the output of the first neural network. Is likely to be. Therefore, when the second extraction result is generated by using it as a reference signal, it is likely that the extraction result is more accurate than the first extraction result, and it is possible to obtain a more accurate extraction result by repeating the process. Unlike Document 1, the present embodiment is characterized in that the extraction process is repeated instead of the separation process. It should be noted that this iteration is different from the iteration used when estimating the filter by the auxiliary function method or the fixed point method inside the sound source extraction process according to step ST17. After the process according to step ST18, the process proceeds to step ST19.
 ステップST19では、ステップST17で生成された抽出結果を用いて後処理部17Cによる後段処理が行なわれる。後段処理の例としては音声認識や、さらにその認識結果を用いた音声対話用応答生成などが考えられる。そして、処理がステップST20に進む。 In step ST19, the post-processing is performed by the post-processing unit 17C using the extraction result generated in step ST17. As an example of the post-stage processing, voice recognition and response generation for voice dialogue using the recognition result can be considered. Then, the process proceeds to step ST20.
 ステップST20では、処理を継続するか否かが判定され、継続する場合は処理がステップST11に戻り、継続しない場合は、処理が終了する。 In step ST20, it is determined whether or not to continue the process. If it continues, the process returns to step ST11, and if it does not continue, the process ends.
(STFTについて)
 次に、図10を参照して、STFT部13で行われる短時間フーリエ変換について説明する。本実施形態では、マイクロホン観測信号は複数の信号で観測されたマルチチャンネルの信号であるため、STFTはチャンネル毎に行なわれる。以下は k 番目のチャンネルにおけるSTFTの説明である。
(About FTFT)
Next, the short-time Fourier transform performed by the STFT unit 13 will be described with reference to FIG. In the present embodiment, since the microphone observation signal is a multi-channel signal observed by a plurality of signals, the SFTT is performed for each channel. The following is a description of the STFT in the kth channel.
 ステップST11に係るAD変換処理によって得られたマイクロホン収録信号の波形から一定長を切り出し、それらにハニング窓やハミング窓等の窓関数を適用する(図10A参照)。この切り出した単位をフレームと呼ぶ。1フレーム分のデータに短時間フーリエ変換を適用することにより(図10B参照)、時間周波数領域の観測信号として x_k(1,t)~x_k(F,t) を得る。ただし、t はフレーム番号、F は周波数ビンの総数を表わす(図10C参照)。 A certain length is cut out from the waveform of the microphone recording signal obtained by the AD conversion process according to step ST11, and a window function such as a humming window or a humming window is applied to them (see FIG. 10A). This cut out unit is called a frame. By applying the short-time Fourier transform to the data for one frame (see FIG. 10B), x_k (1, t) to x_k (F, t) are obtained as observation signals in the time frequency domain. However, t represents the frame number and F represents the total number of frequency bins (see FIG. 10C).
 切り出すフレームの間には重複があってもよく、そうすることで連続するフレーム間で時間周波数領域の信号の変化が滑らかになる。図10では、1フレーム分のデータである x_k(1,t)~x_k(F,t) をまとめて1本のベクトル x_k(t) として記述している(図10C差参照)。x_k(t) はスペクトルと呼ばれ、複数のスペクトルを時間方向に並べたデータ構造はスペクトログラムと呼ばれる There may be overlap between the frames to be cut out, so that the change of the signal in the time frequency domain becomes smooth between consecutive frames. In FIG. 10, data for one frame, x_k (1, t) to x_k (F, t), is collectively described as one vector x_k (t) (see FIG. 10C difference). x_k (t) is called a spectrum, and a data structure in which multiple spectra are arranged in the time direction is called a spectrogram.
 図10Cでは、横軸がフレーム番号を、縦軸が周波数ビン番号を表わし、切り出された観測信号51、52、53のそれぞれから3本のスペクトル51A、52A、53Aがそれぞれ生成されている。 In FIG. 10C, the horizontal axis represents the frame number and the vertical axis represents the frequency bin number, and three spectra 51A, 52A, and 53A are generated from each of the cut out observation signals 51, 52, and 53, respectively.
(音源抽出処理)
 次に、図11に示すフローチャートを参照して本実施形態に係る音源抽出処理について説明する。
(Sound source extraction process)
Next, the sound source extraction process according to the present embodiment will be described with reference to the flowchart shown in FIG.
 ステップST31では、前処理部17Aによる前処理が行われる。前処理の例として、式(3)~式(6)で表わされる無相関化がある。また、フィルター推定で用いられる更新式によっては初回のみ特別な処理をするものがあるが、そのような処理も前処理として行なう。そして、処理がステップST32に進む。 In step ST31, preprocessing is performed by the preprocessing unit 17A. As an example of pretreatment, there is uncorrelatedness represented by equations (3) to (6). In addition, some update formulas used in filter estimation perform special processing only for the first time, but such processing is also performed as preprocessing. Then, the process proceeds to step ST32.
 ステップST32では抽出フィルターを推定する処理が行われる。そしてステップST33に進む。ステップST32、ST33は抽出フィルターを推定するための反復を表わす。音源モデルとして式(32)TFVV ガウス分布を用いた場合を除き、抽出フィルターは閉形式では求まらないため、抽出フィルターおよび抽出結果が収束するまでの間、あるいは所定の回数だけ、ステップST32に係る処理を繰り返す。 In step ST32, a process of estimating the extraction filter is performed. Then, the process proceeds to step ST33. Steps ST32 and ST33 represent iterations for estimating the extraction filter. Except when the equation (32) TFVV Gaussian distribution is used as the sound source model, the extraction filter cannot be obtained in the closed form. Therefore, until the extraction filter and the extraction result converge, or a predetermined number of times, the step ST32 is performed. This process is repeated.
 ステップST32に係る抽出フィルター推定処理は、抽出フィルター w_1(f) を求める処理であり、具体的な式は音源モデルごとに異なる。 The extraction filter estimation process according to step ST32 is a process for obtaining the extraction filter w_1 (f), and the specific formula differs for each sound source model.
 例えば、音源モデルとして式(32)の TFVV ガウス分布を用いた場合は、参照信号r(f,t) と無相関化観測信号 u(f,t) とを用いて式(35)の右辺にある重みつき共分散行列を計算し、次に固有値分解を用いて固有ベクトルを求める。そして式(36)のように、最小の固有値に対応した固有ベクトルに対してエルミート転置を適用すると、それが求める抽出フィルター w_1(f) である。この処理を、全ての周波数ビンすなわち f=1~F について行なう。 For example, when the TFVV Gaussian distribution of equation (32) is used as the sound source model, the reference signal r (f, t) and the uncorrelated observation signal u (f, t) are used on the right side of equation (35). Compute a weighted covariance matrix and then use eigenvalue decomposition to find the eigenvectors. Then, when the Hermitian transpose is applied to the eigenvector corresponding to the smallest eigenvalue as in Eq. (36), the extraction filter w_1 (f) obtained by the transpose is obtained. This process is performed for all frequency bins, that is, f = 1 to F.
 同様に、音源モデルとして式(31)の TFVV ラプラス分布を用いた場合は、まず式(40)に従い、参照信号 r(f,t) と無相関化観測信号 u(f,t) とを用いて補助変数 b(f,t) を計算する。次に、式(42)の右辺にある重みつき共分散行列を計算し、それに固有値分解を適用して固有ベクトルを求める。最後に、式(36)によって抽出フィルター w_1(f) を得る。この時点の w_1(f) の抽出フィルターはまだ収束していないため、式(40)に戻って補助変数の計算を再度行なう。これらの処理を w_1(f) が収束するまで、あるいは所定の回数だけ実行する。 Similarly, when the TFVV Laplace distribution of equation (31) is used as the sound source model, first, the reference signal r (f, t) and the uncorrelated observation signal u (f, t) are used according to equation (40). To calculate the auxiliary variable b (f, t). Next, the weighted covariance matrix on the right side of equation (42) is calculated, and the eigenvalue decomposition is applied to it to obtain the eigenvector. Finally, the extraction filter w_1 (f) is obtained by the equation (36). Since the extraction filter of w_1 (f) at this point has not yet converged, the process returns to equation (40) and the auxiliary variable is calculated again. These processes are executed until w_1 (f) converges or a predetermined number of times.
 音源モデルとして式(25)の2変量ラプラス分布を用いた場合も同様に、補助変数 b(f,t) の計算(式(46))と抽出フィルターの計算(式(48)および式(36))とを交互に行なう。 Similarly, when the bivariate Laplace distribution of Eq. (25) is used as the sound source model, the calculation of the auxiliary variables b (f, t) (Equation (46)) and the calculation of the extraction filter (Equation (48) and Eq. (36)) )) And are performed alternately.
 一方、音源モデルとして、式(26)で表わされるダイバージェンスに基づくモデルを用いた場合は、各モデルに対応した更新式(式(55)~式(60))と、ノルムを1に正規化する式(式(54))とを交互に行なう。 On the other hand, when a model based on divergence represented by the equation (26) is used as the sound source model, the update equations (equations (55) to (60)) corresponding to each model and the norm are normalized to 1. The equation (formula (54)) is alternately performed.
 抽出フィルターが収束するまで、あるいは所定の回数の反復を行なったら、処理がステップST34に進む。 The process proceeds to step ST34 until the extraction filter converges or the repetition is performed a predetermined number of times.
 ステップST34では、後処理部17Cによる後処理が行われる。後処理では、抽出結果に対してリスケーリングを行なう。さらに、必要に応じてフーリエ逆変換を行なうことで、時間領域の波形を生成する。リスケーリングとは、抽出結果の周波数ビンごとのスケールを調整する処理である。抽出フィルター推定においては、効率的なアルゴリズムを適用するためにフィルターのノルムが1という制約を置いているが、この制約を持った抽出フィルターを適用して生成される抽出結果は、理想的な目的音とはスケールが異なる。そこで、無相関化前の観測信号を用いて抽出結果のスケールを調整する。 In step ST34, post-processing is performed by the post-processing unit 17C. In the post-processing, the extraction result is rescaled. Further, the inverse Fourier transform is performed as necessary to generate a waveform in the time domain. Rescaling is a process of adjusting the scale of each frequency bin of the extraction result. In the extraction filter estimation, the norm of the filter is restricted to 1 in order to apply an efficient algorithm, but the extraction result generated by applying the extraction filter with this constraint is an ideal purpose. The scale is different from the sound. Therefore, the scale of the extraction result is adjusted using the observation signal before uncorrelatedness.
 リスケーリング処理は以下の通りである。
 まず、式(9)において k=1 として、収束済みの抽出フィルター w_1(f) からリスケーリング前の抽出結果である y_1(f,t) を計算する。リスケーリングの係数 γ(f) は下記の式(61)を最小化する値として求めることができ、具体的な式は式(62)の通りである。
Figure JPOXMLDOC01-appb-I000061
Figure JPOXMLDOC01-appb-I000062
 この式の x_i(f,t) は、リスケーリングの目標となる(無相関化前の)観測信号である。x_i(f,t) の選び方については後述する。こうして求まった係数 γ(f) を下記の式(63)のように抽出結果に乗じる。リスケーリング後の抽出結果 y_1(f,t) は、i 番目のマイクロホンの観測信号における目的音由来の成分に相当する。すなわち、目的音以外の音源が存在しなかった場合に i 番目のマイクロホンで観測される信号とほぼ等しい。
Figure JPOXMLDOC01-appb-I000063
 さらに必要に応じ、リスケーリング済み抽出結果にフーリエ逆変換を適用することで、抽出結果の波形を得る。前述のように、後段処理によってはフーリエ逆変換を省略することができる。
The rescaling process is as follows.
First, with k = 1 in equation (9), y_1 (f, t), which is the extraction result before rescaling, is calculated from the converged extraction filter w_1 (f). The rescaling coefficient γ (f) can be obtained as a value that minimizes the following equation (61), and the specific equation is as shown in equation (62).
Figure JPOXMLDOC01-appb-I000061
Figure JPOXMLDOC01-appb-I000062
X_i (f, t) in this equation is the observed signal (before uncorrelated) that is the target of rescaling. How to select x_i (f, t) will be described later. The coefficient γ (f) thus obtained is multiplied by the extraction result as shown in the following equation (63). The extraction result y_1 (f, t) after rescaling corresponds to the component derived from the target sound in the observation signal of the i-th microphone. That is, it is almost equal to the signal observed by the i-th microphone when there is no sound source other than the target sound.
Figure JPOXMLDOC01-appb-I000063
Further, if necessary, the waveform of the extraction result is obtained by applying the inverse Fourier transform to the rescaled extraction result. As described above, the inverse Fourier transform can be omitted depending on the post-stage processing.
 ここで、リスケーリングの目標となる観測信号 x_i(f,t) の選び方について説明する。これは、マイクロホンの設置形態に依存する。マイクロホン設置形態によっては、目的音を強く収音するマイクロホンが存在する。例えば図5の設置形態においては、話者ごとにマイクロホンが割り当てられているため、話者 i の発話はマイクロホン i で最も強く収音される。従って、マイクロホン i の観測信号 x_i(f,t) をリスケーリングの目標として使用することができる。 Here, we will explain how to select the observation signal x_i (f, t) that is the target of rescaling. This depends on how the microphone is installed. Depending on the microphone installation form, there are microphones that strongly collect the target sound. For example, in the installation mode shown in FIG. 5, since a microphone is assigned to each speaker, the speaker i's utterance is picked up most strongly by the microphone i. Therefore, the observation signal x_i (f, t) of the microphone i can be used as the target of rescaling.
 図6の設置形態において、センサーSEとしてピンマイクロホン等の気導マイクロホンを使用した場合についても、同様の方法が適用可能である。一方、センサーSEとして骨伝導マイクロホン等の密着型マイクロホンを使用した場合や、光センサー等の、マイクロホン以外のセンサーを使用した場合は、それらのマイクロホンで収音された信号はリスケーリングの目標としては不適切であるため、これから説明する図7と同様の方法を用いる。 The same method can be applied to the case where an air-conducting microphone such as a pin microphone is used as the sensor SE in the installation mode shown in FIG. On the other hand, when a close-fitting microphone such as a bone conduction microphone is used as the sensor SE, or when a sensor other than the microphone such as an optical sensor is used, the signal picked up by those microphones is a target for rescaling. Since it is inappropriate, the same method as in FIG. 7 described below is used.
 図7の設置形態では、各話者に割り当てられたマイクロホンが存在しないため、リスケーリングの目標は別の方法で見つける必要がある。以下では、マイクロホンアレイを構成するマイクロホンが1個の装置に固定されている場合と、空間内に設置されている場合(分散マイクロホン)とについてそれぞれ説明する。 In the installation mode shown in FIG. 7, since there is no microphone assigned to each speaker, the rescaling target needs to be found by another method. Hereinafter, a case where the microphones constituting the microphone array are fixed to one device and a case where the microphones are installed in the space (distributed microphones) will be described.
 マイクロホンが1個の装置に固定されている場合、各マイクロホンの SN 比(目的音とそれ以外の信号とのパワー比)はほぼ同一と考えられる。そこで、リスケーリングの目標である x_i(f,t) として、任意のマイクロホンの観測信号を選んでも良い。 When the microphones are fixed to one device, the SN ratio (power ratio of the target sound and other signals) of each microphone is considered to be almost the same. Therefore, the observation signal of any microphone may be selected as the target of rescaling, x_i (f, t).
 あるいは、特開2014-219467号公報に記載の技術で使用されている、遅延和(delay and sum)を用いたリスケーリングも適用可能である。図7で説明したように、区間検出処理において発話同士の重複に対応した方法を用いている場合は、発話区間の他に発話方向θも同時に推定されている。マイクロホンアレイで観測された信号と発話方向θとを用いると、その方向から到来する音がある程度強調された信号を遅延和によって生成することができる。方向θに対して遅延和を行なった結果を z(f, t, θ) と書くことにすると、リスケーリング係数は下記の式(64)で計算される。
Figure JPOXMLDOC01-appb-I000064
Alternatively, rescaling using delay and sum, which is used in the technique described in Japanese Patent Application Laid-Open No. 2014-219467, can also be applied. As described with reference to FIG. 7, when the method corresponding to the duplication of utterances is used in the section detection process, the utterance direction θ is estimated at the same time in addition to the utterance section. By using the signal observed by the microphone array and the utterance direction θ, it is possible to generate a signal in which the sound coming from that direction is emphasized to some extent by the delay sum. If the result of performing the sum of delays with respect to the direction θ is written as z (f, t, θ), the rescaling coefficient is calculated by the following equation (64).
Figure JPOXMLDOC01-appb-I000064
 マイクロホンアレイが分散マイクロホンである場合は、別の方法を用いる。分散マイクロホンでは観測信号の SN 比はマイクロホンごとに異なり、話者と近いマイクロホンでは SN 比は高く、遠いマイクロホンでは低いと予想される。そのため、リスケーリングの目標となる観測信号として、話者に近いマイクロホンのものを選択することが望ましい。そこで、各マイクロホンの観測信号に対してリスケーリングを行ない、リスケーリング結果のパワーが最大となるものを採用する。 If the microphone array is a distributed microphone, use another method. With distributed microphones, the signal-to-noise ratio of the observed signal differs from microphone to microphone, and it is expected that the signal-to-noise ratio will be high for microphones close to the speaker and low for microphones far away. Therefore, it is desirable to select a microphone that is close to the speaker as the observation signal that is the target of rescaling. Therefore, the observation signal of each microphone is rescaled, and the one that maximizes the power of the rescaling result is adopted.
 リスケーリング結果のパワーの大小はリスケーリング係数の絶対値の大小のみで決まる。そこで、下記の式(65)によってマイクロホン番号 i ごとにリスケーリング係数を計算し、その中で絶対値が最大のものを γ_{max} として下記の式(66)によってリスケーリングを行なう。
Figure JPOXMLDOC01-appb-I000065
Figure JPOXMLDOC01-appb-I000066
The magnitude of the power of the rescaling result is determined only by the magnitude of the absolute value of the rescaling coefficient. Therefore, the rescaling coefficient is calculated for each microphone number i by the following equation (65), and the one with the maximum absolute value is set as γ_ {max}, and the rescaling is performed by the following equation (66).
Figure JPOXMLDOC01-appb-I000065
Figure JPOXMLDOC01-appb-I000066
 γ_{max} を決定する際に、どのマイクロホンが話者の発話を最も大きく収音しているかも判明する。各マイクロホンの位置が既知である場合は、空間内において話者がおおよそどの辺りに位置しているかが判明するため、その情報を後段処理で活用することも可能である。 When determining γ_ {max}, it is also known which microphone picks up the speaker's utterance most. If the position of each microphone is known, it is possible to know about where the speaker is located in the space, and that information can be used in the subsequent processing.
 例えば、後段処理が音声対話である場合、すなわち音声対話システムにおいて本開示の技術が使用されている場合は、対話システムからの応答の音声を話者から最も近いと推測されるスピーカーから出力したり、あるいは、話者の位置に応じてシステムの応答を変えるといったことも可能である。 For example, when the subsequent processing is a voice dialogue, that is, when the technique of the present disclosure is used in the voice dialogue system, the voice of the response from the dialogue system may be output from the speaker presumed to be the closest to the speaker. Alternatively, it is possible to change the response of the system according to the position of the speaker.
[本実施形態で得られる効果]
 本実施形態によれば、例えば、下記の効果を得ることができる。
 本実施形態の参照信号付き音源抽出では、目的音の鳴っている区間のマルチチャンネル観測信号と、その区間の目的音のラフな振幅スペクトログラムとを入力し、そのラフな振幅スペクトログラムを参照信号として使用することで、参照信号よりも高精度すなわち真の目的音に近い抽出結果を推定する。
[Effects obtained in this embodiment]
According to this embodiment, for example, the following effects can be obtained.
In the sound source extraction with the reference signal of the present embodiment, the multi-channel observation signal of the section where the target sound is sounding and the rough amplitude spectrogram of the target sound in the section are input, and the rough amplitude spectrogram is used as the reference signal. By doing so, the extraction result with higher accuracy than the reference signal, that is, closer to the true target sound is estimated.
 処理においては、参照信号と抽出結果との依存性と、抽出結果と仮想的な他の分離結果との独立性との両方を反映した目的関数を用意し、それを最適化する解として抽出フィルターを求める。ブラインド音源分離で使用されるデフレーション法を用いることで、出力される信号は参照信号に対応した1音源分のみとすることができる。 In the processing, an objective function that reflects both the dependency between the reference signal and the extraction result and the independence between the extraction result and other virtual separation results is prepared, and the extraction filter is used as a solution to optimize it. Ask for. By using the deflation method used in blind sound source separation, the output signal can be limited to one sound source corresponding to the reference signal.
 このような特徴により、従来技術と比べて以下のような利点がある。
(1)ブラインド音源分離と比べて
 観測信号にブラインド音源分離を適用して複数の分離結果を生成し、その中から参照信号と最も類似している1音源分を選択するという方法と比べ、以下の利点がある。
 ・複数の分離結果を生成する必要がない。
 ・原理上、ブラインド音源分離では参照信号は選択のためだけに使用され、分離精度の向上には寄与しないが、本開示の音源抽出では参照信号が抽出精度の向上にも寄与する。
(2)従来の適応ビームフォーマーと比べて
 区間外の観測信号が存在しなくても、抽出を行なうことができる。すなわち、妨害音だけが鳴っているタイミングで取得された観測信号を別途用意しなくても抽出を行なうことができる。
(3)参照信号ベース音源抽出(例えば、特開2014-219467等に記載された技術)と比べて
 ・特開2014-219467等に記載された技術における参照信号は時間エンベロープであり、目的音の時間方向の変化は全周波数ビンで共通であると想定していた。それに対し、本実施形態の参照信号は振幅スペクトログラムである。そのため、目的音の時間方向の変化が周波数ビンごとに大きく異なる場合に抽出精度の向上が期待できる。
 ・上記文献に記載された技術における参照信号は反復の初期値としてのみ用いられていたため、反復の結果として参照信号とは異なる音源が抽出される可能性があった。それに対して本実施形態では、参照信号は音源モデルの一部として反復中ずっと使用されるため、参照信号と異なる音源が抽出される可能性が小さい。
(4)独立深層学習行列分析(IDLMA)と比べて
 ・IDLMA では音源ごとに異なる参照信号を用意する必要があるため、不明な音源がある場合は IDLMA が適用できなかった。また、マイクロホン数と音源数とが一致する場合にしか適用できなかった。それに対して本実施形態では、抽出したい1音源の参照信号が用意できれば適用可能である。
Due to such features, there are the following advantages as compared with the prior art.
(1) Compared with blind sound source separation Compared with the method of applying blind sound source separation to the observed signal to generate multiple separation results and selecting one sound source that is most similar to the reference signal from among them, the following There are advantages of.
-There is no need to generate multiple separation results.
-In principle, in blind sound source separation, the reference signal is used only for selection and does not contribute to the improvement of separation accuracy, but in the sound source extraction of the present disclosure, the reference signal also contributes to the improvement of extraction accuracy.
(2) Compared with the conventional adaptive beam former, extraction can be performed even if there is no observation signal outside the section. That is, the extraction can be performed without separately preparing the observation signal acquired at the timing when only the disturbing sound is sounding.
(3) Compared with the reference signal-based sound source extraction (for example, the technique described in JP-A-2014-219467 etc.)-The reference signal in the technique described in JP-A-2014-219467 etc. is a time envelope and is a target sound. It was assumed that the change in the time direction was common to all frequency bins. In contrast, the reference signal of this embodiment is an amplitude spectrogram. Therefore, improvement in extraction accuracy can be expected when the change in the time direction of the target sound differs greatly for each frequency bin.
-Since the reference signal in the technique described in the above document was used only as the initial value of the iteration, there was a possibility that a sound source different from the reference signal was extracted as a result of the iteration. On the other hand, in the present embodiment, since the reference signal is used throughout the iteration as part of the sound source model, it is unlikely that a sound source different from the reference signal will be extracted.
(4) Compared with independent deep learning matrix analysis (IDLMA) -IDLMA cannot be applied when there is an unknown sound source because it is necessary to prepare a different reference signal for each sound source. Moreover, it could be applied only when the number of microphones and the number of sound sources match. On the other hand, in the present embodiment, it is applicable if the reference signal of one sound source to be extracted can be prepared.
<変形例>
 以上、本開示の一実施形態について具体的に説明したが、本開示の内容は上述した実施形態に限定されるものではなく、本開示の技術的思想に基づく各種の変形が可能である。なお、変形例の説明において、上述した説明における同一または同質の構成については同一の参照符号を付し、重複した説明が適宜、省略される。
<Modification example>
Although one embodiment of the present disclosure has been specifically described above, the content of the present disclosure is not limited to the above-described embodiment, and various modifications based on the technical idea of the present disclosure are possible. In the description of the modified example, the same or homogeneous configurations in the above description are designated by the same reference numerals, and duplicated descriptions are appropriately omitted.
 (無相関化とフィルター推定処理との統合)
 抽出フィルターの更新式のうち、固有値分解を使用するものについては、一般化固有値分解を用いて無相関化とフィルター推定とを一つの式にまとめることができる。その場合、無相関化に相当する処理をスキップすることができる。
(Integration of uncorrelatedness and filter estimation processing)
For the update formulas of the extraction filter that use the eigenvalue decomposition, the uncorrelatedness and the filter estimation can be combined into one formula by using the generalized eigenvalue decomposition. In that case, the process corresponding to uncorrelated can be skipped.
 以下では、両者を統合した式を導出する過程について、式(32)の TFVV ガウス分布を例に説明する。 In the following, the process of deriving an equation that integrates the two will be described using the TFVV Gaussian distribution of equation (32) as an example.
 式(9)において k=1 とした式を、下記の式(67)のように書き直す。
Figure JPOXMLDOC01-appb-I000067
The formula in which k = 1 is set in the formula (9) is rewritten as the following formula (67).
Figure JPOXMLDOC01-appb-I000067
 q_1(f) は、無相関化前の観測信号から(無相関化観測信号を経由せずに)抽出結果を直接生成するフィルターである。TFVV ガウス分布に対応した最適化問題を表わす式(34)に対し、式(67)および式(3)~式(6)を用いて変形を行なうと、q_1(f) についての最適化問題である式(68)が得られる。
Figure JPOXMLDOC01-appb-I000068
q_1 (f) is a filter that directly generates the extraction result from the observation signal before uncorrelated observation (without going through the uncorrelated observation signal). When Eq. (34), which represents the optimization problem corresponding to the TFVV Gaussian distribution, is transformed using Eqs. (67) and Eqs. (3) to (6), the optimization problem for q_1 (f) is obtained. A formula (68) is obtained.
Figure JPOXMLDOC01-appb-I000068
 この式は式(34)とは別の制約付き最小化問題であるが、ラグランジュの未定乗数法を用いて解くことができる。ラグランジュ未定乗数をλとし、式(68)で最適化したい式および制約を表わす式を一つにまとめて目的関数を作ると下記の式(69)のように書ける。
Figure JPOXMLDOC01-appb-I000069
This equation is a constrained minimization problem different from equation (34), but it can be solved by using Lagrange's undetermined multiplier method. If the Lagrange undetermined multiplier is λ and the equations to be optimized in Eq. (68) and the equations representing the constraints are put together to create an objective function, it can be written as Eq. (69) below.
Figure JPOXMLDOC01-appb-I000069
 式(69)を conj(q_1(f)) で偏微分し、=0 を追加してから変形すると、式(70)が得られる。
Figure JPOXMLDOC01-appb-I000070
Equation (69) is obtained by partially differentiating equation (69) with conj (q_1 (f)), adding = 0, and then transforming it.
Figure JPOXMLDOC01-appb-I000070
 式(70)は一般化固有値問題(generalized eigenvalue problem)を表わしており、λは固有値の内の一つである。さらに、式(70)の両辺に左から q_1(f) を乗じると、下記の式(71)が得られる。
Figure JPOXMLDOC01-appb-I000071
Equation (70) represents a generalized eigenvalue problem, where λ is one of the eigenvalues. Further, by multiplying both sides of the equation (70) by q_1 (f) from the left, the following equation (71) is obtained.
Figure JPOXMLDOC01-appb-I000071
 式(71)の右辺は式(68)において最小化したい関数そのものである。従って、式(71)の最小値は式(70)を満たす固有値の内で最小のものであり、求める抽出フィルター q_1(f) はその最小固有値に対応した固有ベクトルのエルミート転置である。 The right side of equation (71) is the function itself that we want to minimize in equation (68). Therefore, the minimum value of the equation (71) is the smallest among the eigenvalues satisfying the equation (70), and the extraction filter q_1 (f) to be obtained is the Hermitian transpose of the eigenvector corresponding to the minimum eigenvalue.
 2つの行列 A, B を引数にとり、その2つの行列についての一般化固有値問題を解いて全ての固有ベクトルを返す関数を gev(A, B) と表わす。この関数を用いると、式(70)の固有ベクトルは下記の式(72)のように書ける。
Figure JPOXMLDOC01-appb-I000072
A function that takes two matrices A and B as arguments, solves the generalized eigenvalue problem for the two matrices, and returns all the eigenvectors is expressed as gev (A, B). Using this function, the eigenvector of equation (70) can be written as in equation (72) below.
Figure JPOXMLDOC01-appb-I000072
 式(36)と同様に、式(72)におけるv_{min}(f), ..., v_{max}(f) は固有ベクトルであり、v_{min}(f) が最小固有値に対応した固有ベクトルである。抽出フィルターq_1(f) は、式(73)のように、v_{min}(f) のエルミート転置である。
Figure JPOXMLDOC01-appb-I000073
Similar to equation (36), v_ {min} (f), ..., v_ {max} (f) in equation (72) are eigenvectors, and v_ {min} (f) corresponds to the minimum eigenvalue. It is an eigenvector. The extraction filter q_1 (f) is the Hermitian transpose of v_ {min} (f) as in equation (73).
Figure JPOXMLDOC01-appb-I000073
 同様に、音源モデルとして式(31)の TFVV ラプラス分布を用いた場合は、式(74)、式(75)が得られる。
Figure JPOXMLDOC01-appb-I000074
Figure JPOXMLDOC01-appb-I000075
Similarly, when the TFVV Laplace distribution of Eq. (31) is used as the sound source model, Eqs. (74) and (75) can be obtained.
Figure JPOXMLDOC01-appb-I000074
Figure JPOXMLDOC01-appb-I000075
 すなわち、式(4)によって補助変数 b(f,t) を計算し、次に式(75)によって2つの行列に対応した固有ベクトルを求めると、抽出フィルター q_1(f) は、最小の固有値に対応した固有ベクトル v_{min}(f) のエルミート転置である(式(73))。q_1(f) は1回では収束しないので、収束するまであるいは所定の回数だけ、式(74)~式(75)および式(73)を実行する。 That is, when the auxiliary variable b (f, t) is calculated by the equation (4) and then the eigenvectors corresponding to the two matrices are obtained by the equation (75), the extraction filter q_1 (f) corresponds to the minimum eigenvalue. This is the Hermitian transpose of the eigenvector v_ {min} (f) (Equation (73)). Since q_1 (f) does not converge once, the equations (74) to (75) and (73) are executed until they converge or a predetermined number of times.
 音源モデルとして式(33)の TFVV Student-t 分布を用いた場合と、式(25)の二変量ラプラス分布を用いた場合とについては、導出される式の一部が共通であるため、合わせて説明する。補助変数 b(f,t) を計算する式は両者で異なり、TFVV Student-t 分布では下記の式(76)を、二変量ラプラス分布では下記の式(77)を用いる。
Figure JPOXMLDOC01-appb-I000076
Figure JPOXMLDOC01-appb-I000077
When the TFVV Student-t distribution of Eq. (33) is used as the sound source model and the case of using the bivariate Laplace distribution of Eq. (25), some of the derived equations are common, so they are combined. I will explain. The formula for calculating the auxiliary variable b (f, t) is different between the two, and the following formula (76) is used for the TFVV Student-t distribution and the following formula (77) is used for the bivariate Laplace distribution.
Figure JPOXMLDOC01-appb-I000076
Figure JPOXMLDOC01-appb-I000077
 一方、抽出フィルター q_1(f,t) を求める式は両者ともに下記の式(78)および式(73)を用いる。抽出フィルター q_1(f,t) は1回では収束しないので、所定の回数だけ反復を行なう点は他のモデルと同様である。
Figure JPOXMLDOC01-appb-I000078
On the other hand, the following equations (78) and (73) are used for both equations for obtaining the extraction filter q_1 (f, t). Since the extraction filter q_1 (f, t) does not converge once, it repeats a predetermined number of times, which is the same as other models.
Figure JPOXMLDOC01-appb-I000078
[その他の変形例]
 上述の実施形態および変形例において挙げた構成、方法、工程、形状、材料および数値などはあくまでも例に過ぎず、必要に応じてこれと異なる構成、方法、工程、形状、材料および数値などを用いてもよく、公知のもので置き換えることも可能である。また、実施形態および変形例における構成、方法、工程、形状、材料および数値などは、技術的な矛盾が生じない範囲において、互いに組み合わせることが可能である。
[Other variants]
The configurations, methods, processes, shapes, materials, numerical values, etc. given in the above-described embodiments and modifications are merely examples, and different configurations, methods, processes, shapes, materials, numerical values, etc. may be used as necessary. Alternatively, it may be replaced with a known one. In addition, the configurations, methods, processes, shapes, materials, numerical values, and the like in the embodiments and modifications can be combined with each other as long as there is no technical contradiction.
 なお、本明細書中で例示された効果により本開示の内容が限定して解釈されるものではない。 It should be noted that the content of the present disclosure is not construed as being limited by the effects exemplified in this specification.
 本開示は、以下の構成も採ることができる。
(1)
 異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
 前記混合音信号に基づいて前記目的音に対応する参照信号を生成する参照信号生成部と、
 前記混合音信号から前記参照信号に類似し、且つ、前記目的音がより強調された信号を抽出する音源抽出部と
 を有する
 信号処理装置。
(2)
 前記混合音信号に前記目的音が含まれる区間を検出する区間検出部を有する
 (1)に記載の信号処理装置。
(3)
 前記音源抽出部は、
 前記目的音がより強調された信号を抽出するフィルターを推定する抽出フィルター推定部を有する
 (1)または(2)に記載の信号処理装置。
(4)
 前記抽出フィルター推定部は、
 前記参照信号と前記フィルターによる抽出結果との依存性、および、前記抽出結果と他の仮想的な音源の分離結果との独立性を反映させた目的関数を最適化する解として、前記フィルターを推定する
 (3)に記載の信号処理装置。
(5)
 前記目的関数に含まれる、前記参照信号と前記抽出結果との依存性を表わす音源モデルとして、
・抽出結果と参照信号との2変量球状分布
・参照信号を時間周波数ごとの分散に対応した値と見なす時間周波数可変分散モデル
・抽出結果の絶対値と参照信号とのダイバージェンスを用いたモデル
 の何れかを使用する
 (4)に記載の信号処理装置。
(6)
 前記2変量球状分布として2変量ラプラス分布を使用する
 (5)に記載の信号処理装置。
(7)
 前記時間周波数可変分散モデルとして、
・時間周波数可変分散ガウス分布
・時間周波数可変分散ラプラス分布
・時間周波数可変分散 Student-t分布
 の何れかを使用する
 (5)に記載の信号処理装置。
(8)
 前記ダイバージェンスを用いたモデルのダイバージェンスとして、
・抽出結果の絶対値と参照信号とのユークリッド距離または二乗誤差
・抽出結果のパワースペクトルと絶対値のパワースペクトルとの板倉斎藤距離
・抽出結果の振幅スペクトルと絶対値の振幅スペクトルとの板倉斎藤距離
・抽出結果の絶対値と参照信号との比と、1との間の二乗誤差
 の何れかを使用する
 (5)に記載の信号処理装置。
(9)
 前記音源抽出部は、
 前記抽出フィルター推定部による処理の前処理として時間周波数領域観測信号に対して無相関化処理を行なう前処理部と、
 少なくとも前記混合音信号への前記フィルターの適用処理を行なう後処理部と
 を有する
 (3)から(8)までの何れかに記載の信号処理装置。
(10)
 前記参照信号生成部は、
 音声同士が混合した信号と、その信号とは別のタイミングで取得された所定の話者のクリーンな音声とを入力して前記話者の音声を抽出するニューラルネットワークを備え、前記混合音信号および前記クリーン音声を前記ニューラルネットワークに入力し、前記ニューラルネットワークの出力から生成される振幅スペクトログラムを前記参照信号として生成する
 (1)から(9)までの何れかに記載の信号処理装置。
(11)
 前記参照信号生成部は、
 目的音の到来方向を推定し、所定の方向から到来する音を残してそれ以外の方向から到来する音を低減する作用のある時間周波数マスクを生成し、前記時間周波数マスクを前記混合音信号の振幅スペクトログラムに適用することで生成される振幅スペクトログラムを前記参照信号として生成する
 (1)から(9)までの何れかに記載の信号処理装置。
(12)
 前記参照信号生成部は、
 前記マイクロホンとは異なるセンサーを使用して前記参照信号を生成する
 (1)から(11)までの何れかに記載の信号処理装置。
(13)
 前記参照信号生成部は、
 前記抽出フィルター推定部によって推定されたフィルターによる抽出結果をニューラルネットワークに入力することにより参照信号を生成する
 (1)から(12)までの何れかに記載の信号処理装置。
(14)
 前記マイクロホンは、話者毎に割り当てられたマイクロホンである
 (1)から(13)までの何れかに記載の信号処理装置。
(15)
 前記マイクロホンは、話者に装着されたマイクロホンである
 (14)に記載の信号処理装置。
(16)
 異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
 参照信号生成部が、前記混合音信号に基づいて前記目的音に対応する参照信号を生成し、
 音源抽出部が、前記混合音信号から前記参照信号に類似し、且つ、前記目的音がより強調された信号を抽出する
 信号処理方法。
(17)
 異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
 参照信号生成部が、前記混合音信号に基づいて前記目的音に対応する参照信号を生成し、
 音源抽出部が、前記混合音信号から前記参照信号に類似し、且つ、前記目的音がより強調された信号を抽出する
 信号処理方法をコンピュータに実行させるプログラム。
The present disclosure may also adopt the following configuration.
(1)
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
A reference signal generation unit that generates a reference signal corresponding to the target sound based on the mixed sound signal,
A signal processing device including a sound source extraction unit that extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
(2)
The signal processing device according to (1), which has a section detection unit that detects a section in which the target sound is included in the mixed sound signal.
(3)
The sound source extraction unit
The signal processing device according to (1) or (2), which has an extraction filter estimation unit that estimates a filter that extracts a signal in which the target sound is more emphasized.
(4)
The extraction filter estimation unit
The filter is estimated as a solution that optimizes the objective function that reflects the dependence of the reference signal on the extraction result by the filter and the independence of the extraction result and the separation result of other virtual sound sources. The signal processing device according to (3).
(5)
As a sound source model that represents the dependency between the reference signal and the extraction result included in the objective function,
-A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that considers the reference signal as a value corresponding to the dispersion for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal The signal processing apparatus according to (4).
(6)
The signal processing apparatus according to (5), wherein the bivariate Laplace distribution is used as the bivariate spherical distribution.
(7)
As the time-frequency variable dispersion model,
-The signal processing apparatus according to (5), which uses any of the time-frequency variable dispersion Gaussian distribution, the time-frequency variable dispersion Laplace distribution, and the time-frequency variable dispersion Student-t distribution.
(8)
As the divergence of the model using the divergence,
・ Euclidean distance or square error between the absolute value of the extraction result and the reference signal ・ Itakura Saito distance between the power spectrum of the extraction result and the absolute value power spectrum ・ Itakura Saito distance between the amplitude spectrum of the extraction result and the absolute value amplitude spectrum -The signal processing apparatus according to (5), which uses either the ratio of the absolute value of the extraction result to the reference signal and the squared error between 1.
(9)
The sound source extraction unit
A pre-processing unit that performs uncorrelated processing on the time-frequency domain observation signal as pre-processing for processing by the extraction filter estimation unit, and a pre-processing unit.
The signal processing apparatus according to any one of (3) to (8), which has at least a post-processing unit for applying the filter to the mixed sound signal.
(10)
The reference signal generator
It is provided with a neural network that extracts a speaker's voice by inputting a signal in which voices are mixed and a predetermined speaker's clean voice acquired at a timing different from the signal, and the mixed sound signal and The signal processing device according to any one of (1) to (9), wherein the clean sound is input to the neural network, and an amplitude spectrogram generated from the output of the neural network is generated as the reference signal.
(11)
The reference signal generator
The arrival direction of the target sound is estimated, a time frequency mask having an effect of leaving the sound arriving from a predetermined direction and reducing the sound arriving from other directions is generated, and the time frequency mask is used as the mixed sound signal. The signal processing apparatus according to any one of (1) to (9), which generates an amplitude spectrogram generated by applying it to an amplitude spectrogram as the reference signal.
(12)
The reference signal generator
The signal processing device according to any one of (1) to (11), which generates the reference signal by using a sensor different from the microphone.
(13)
The reference signal generator
The signal processing apparatus according to any one of (1) to (12), which generates a reference signal by inputting an extraction result by a filter estimated by the extraction filter estimation unit into a neural network.
(14)
The signal processing device according to any one of (1) to (13), wherein the microphone is a microphone assigned to each speaker.
(15)
The signal processing device according to (14), wherein the microphone is a microphone worn by a speaker.
(16)
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
A signal processing method in which a sound source extraction unit extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
(17)
A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
A program in which a sound source extraction unit causes a computer to execute a signal processing method for extracting a signal similar to the reference signal and in which the target sound is more emphasized from the mixed sound signal.
15・・・区間推定部
16・・・参照信号推定部
17・・・音源抽出部
17A・・・前処理部
17B・・・抽出フィルター推定部
17C・・・後処理部
20・・・制御部
100・・・音源抽出装置
15 ... Section estimation unit 16 ... Reference signal estimation unit 17 ... Sound source extraction unit 17A ... Pre-processing unit 17B ... Extraction filter estimation unit 17C ... Post-processing unit 20 ... Control unit 100 ... Sound source extraction device

Claims (17)

  1.  異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
     前記混合音信号に基づいて前記目的音に対応する参照信号を生成する参照信号生成部と、
     前記混合音信号から前記参照信号に類似し、且つ、前記目的音がより強調された信号を抽出する音源抽出部と
     を有する
     信号処理装置。
    A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
    A reference signal generation unit that generates a reference signal corresponding to the target sound based on the mixed sound signal,
    A signal processing device including a sound source extraction unit that extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
  2.  前記混合音信号に前記目的音が含まれる区間を検出する区間検出部を有する
     請求項1に記載の信号処理装置。
    The signal processing device according to claim 1, further comprising a section detection unit that detects a section in which the target sound is included in the mixed sound signal.
  3.  前記音源抽出部は、
     前記目的音がより強調された信号を抽出するフィルターを推定する抽出フィルター推定部を有する
     請求項1に記載の信号処理装置。
    The sound source extraction unit
    The signal processing device according to claim 1, further comprising an extraction filter estimation unit that estimates a filter that extracts a signal in which the target sound is more emphasized.
  4.  前記抽出フィルター推定部は、
     前記参照信号と前記フィルターによる抽出結果との依存性、および、前記抽出結果と他の仮想的な音源の分離結果との独立性を反映させた目的関数を最適化する解として、前記フィルターを推定する
     請求項3に記載の信号処理装置。
    The extraction filter estimation unit
    The filter is estimated as a solution that optimizes the objective function that reflects the dependence of the reference signal on the extraction result by the filter and the independence of the extraction result and the separation result of other virtual sound sources. The signal processing apparatus according to claim 3.
  5.  前記目的関数に含まれる、前記参照信号と前記抽出結果との依存性を表わす音源モデルとして、
    ・抽出結果と参照信号との2変量球状分布
    ・参照信号を時間周波数ごとの分散に対応した値と見なす時間周波数可変分散モデル
    ・抽出結果の絶対値と参照信号とのダイバージェンスを用いたモデル
     の何れかを使用する
     請求項4に記載の信号処理装置。
    As a sound source model that represents the dependency between the reference signal and the extraction result included in the objective function,
    -A bivariate spherical distribution of the extraction result and the reference signal-A time-frequency variable dispersion model that considers the reference signal as a value corresponding to the dispersion for each time frequency-Any of the models that use the divergence between the absolute value of the extraction result and the reference signal The signal processing apparatus according to claim 4.
  6.  前記2変量球状分布として2変量ラプラス分布を使用する
     請求項5に記載の信号処理装置。
    The signal processing apparatus according to claim 5, wherein the bivariate Laplace distribution is used as the bivariate spherical distribution.
  7.  前記時間周波数可変分散モデルとして、
    ・時間周波数可変分散ガウス分布
    ・時間周波数可変分散ラプラス分布
    ・時間周波数可変分散 Student-t分布
     の何れかを使用する
     請求項5に記載の信号処理装置。
    As the time-frequency variable dispersion model,
    The signal processing apparatus according to claim 5, which uses any of a time-frequency variable dispersion Gaussian distribution, a time-frequency variable dispersion Laplace distribution, and a time-frequency variable dispersion Student-t distribution.
  8.  前記ダイバージェンスを用いたモデルのダイバージェンスとして、
    ・抽出結果の絶対値と参照信号とのユークリッド距離または二乗誤差
    ・抽出結果のパワースペクトルと絶対値のパワースペクトルとの板倉斎藤距離
    ・抽出結果の振幅スペクトルと絶対値の振幅スペクトルとの板倉斎藤距離
    ・抽出結果の絶対値と参照信号との比と、1との間の二乗誤差
     の何れかを使用する
     請求項5に記載の信号処理装置。
    As the divergence of the model using the divergence,
    ・ Euclidean distance or square error between the absolute value of the extraction result and the reference signal ・ Itakura Saito distance between the power spectrum of the extraction result and the absolute value power spectrum ・ Itakura Saito distance between the amplitude spectrum of the extraction result and the absolute value amplitude spectrum The signal processing apparatus according to claim 5, wherein any of the ratio of the absolute value of the extraction result to the reference signal and the squared error between 1 is used.
  9.  前記音源抽出部は、
     前記抽出フィルター推定部による処理の前処理として時間周波数領域観測信号に対して無相関化処理を行なう前処理部と、
     少なくとも前記混合音信号への前記フィルターの適用処理を行なう後処理部と
     を有する
     請求項3に記載の信号処理装置。
    The sound source extraction unit
    A pre-processing unit that performs uncorrelated processing on the time-frequency domain observation signal as pre-processing for processing by the extraction filter estimation unit, and a pre-processing unit.
    The signal processing apparatus according to claim 3, further comprising a post-processing unit for applying the filter to the mixed sound signal.
  10.  前記参照信号生成部は、
     音声同士が混合した信号と、その信号とは別のタイミングで取得された所定の話者のクリーンな音声とを入力して前記話者の音声を抽出するニューラルネットワークを備え、前記混合音信号および前記クリーン音声を前記ニューラルネットワークに入力し、前記ニューラルネットワークの出力から生成される振幅スペクトログラムを前記参照信号として生成する
     請求項1に記載の信号処理装置。
    The reference signal generator
    It is provided with a neural network that extracts a speaker's voice by inputting a signal in which voices are mixed and a predetermined speaker's clean voice acquired at a timing different from the signal, and the mixed sound signal and The signal processing device according to claim 1, wherein the clean sound is input to the neural network, and an amplitude spectrogram generated from the output of the neural network is generated as the reference signal.
  11.  前記参照信号生成部は、
     目的音の到来方向を推定し、所定の方向から到来する音を残してそれ以外の方向から到来する音を低減する作用のある時間周波数マスクを生成し、前記時間周波数マスクを前記混合音信号の振幅スペクトログラムに適用することで生成される振幅スペクトログラムを前記参照信号として生成する
     請求項1に記載の信号処理装置。
    The reference signal generator
    The arrival direction of the target sound is estimated, a time frequency mask having an effect of leaving the sound arriving from a predetermined direction and reducing the sound arriving from other directions is generated, and the time frequency mask is used as the mixed sound signal. The signal processing apparatus according to claim 1, wherein an amplitude spectrogram generated by applying the spectrogram to the amplitude spectrogram is generated as the reference signal.
  12.  前記参照信号生成部は、
     前記マイクロホンとは異なるセンサーを使用して前記参照信号を生成する
     請求項1に記載の信号処理装置。
    The reference signal generator
    The signal processing device according to claim 1, wherein a sensor different from the microphone is used to generate the reference signal.
  13.  前記参照信号生成部は、
     前記抽出フィルター推定部によって推定されたフィルターによる抽出結果をニューラルネットワークに入力することにより参照信号を生成する
     請求項1に記載の信号処理装置。
    The reference signal generator
    The signal processing apparatus according to claim 1, wherein a reference signal is generated by inputting an extraction result by a filter estimated by the extraction filter estimation unit into a neural network.
  14.  前記マイクロホンは、話者毎に割り当てられたマイクロホンである
     請求項1に記載の信号処理装置。
    The signal processing device according to claim 1, wherein the microphone is a microphone assigned to each speaker.
  15.  前記マイクロホンは、話者に装着されたマイクロホンである
     請求項14に記載の信号処理装置。
    The signal processing device according to claim 14, wherein the microphone is a microphone worn on a speaker.
  16.  異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
     参照信号生成部が、前記混合音信号に基づいて前記目的音に対応する参照信号を生成し、
     音源抽出部が、前記混合音信号から前記参照信号に類似し、且つ、前記目的音がより強調された信号を抽出する
     信号処理方法。
    A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
    The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
    A signal processing method in which a sound source extraction unit extracts a signal similar to the reference signal from the mixed sound signal and in which the target sound is more emphasized.
  17.  異なる位置に配置されたマイクロホンで収録され、目的音と目的音以外の音とが混合された混合音信号が入力され、
     参照信号生成部が、前記混合音信号に基づいて前記目的音に対応する参照信号を生成し、
     音源抽出部が、前記混合音信号から前記参照信号に類似し、且つ、前記目的音がより強調された信号を抽出する
     信号処理方法をコンピュータに実行させるプログラム。
    A mixed sound signal that is recorded by microphones placed at different positions and is a mixture of the target sound and sounds other than the target sound is input.
    The reference signal generation unit generates a reference signal corresponding to the target sound based on the mixed sound signal, and then generates a reference signal.
    A program in which a sound source extraction unit causes a computer to execute a signal processing method for extracting a signal similar to the reference signal and in which the target sound is more emphasized from the mixed sound signal.
PCT/JP2021/009764 2020-03-25 2021-03-11 Signal processing device, signal processing method, and program WO2021193093A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020-053542 2020-03-25
JP2020053542A JP2021152623A (en) 2020-03-25 2020-03-25 Signal processing device, signal processing method and program

Publications (1)

Publication Number Publication Date
WO2021193093A1 true WO2021193093A1 (en) 2021-09-30

Family

ID=77887359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/009764 WO2021193093A1 (en) 2020-03-25 2021-03-11 Signal processing device, signal processing method, and program

Country Status (2)

Country Link
JP (1) JP2021152623A (en)
WO (1) WO2021193093A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023127058A1 (en) * 2021-12-27 2023-07-06 日本電信電話株式会社 Signal filtering device, signal filtering method, and program
CN115775564B (en) * 2023-01-29 2023-07-21 北京探境科技有限公司 Audio processing method, device, storage medium and intelligent glasses

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011199474A (en) * 2010-03-18 2011-10-06 Hitachi Ltd Sound source separation device, sound source separating method and program for the same, video camera apparatus using the same and cellular phone unit with camera
JP2012234150A (en) * 2011-04-18 2012-11-29 Sony Corp Sound signal processing device, sound signal processing method and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011199474A (en) * 2010-03-18 2011-10-06 Hitachi Ltd Sound source separation device, sound source separating method and program for the same, video camera apparatus using the same and cellular phone unit with camera
JP2012234150A (en) * 2011-04-18 2012-11-29 Sony Corp Sound signal processing device, sound signal processing method and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DELCROIX, MARC ET AL.: "SINGLE CHANNEL TARGET SPEAKER EXTRACTION AND RECOGNITION WITH SPEAKER BEAM", PROC. ICASSP 2018, April 2018 (2018-04-01), pages 5554 - 5558, XP033401925, DOI: 10.1109/ICASSP.2018.8462661 *
KITAMURA DAICHI, SUMINO HAYATO, TAKAMUNE NORIHIRO, TAKAMICHI SHINNOSUKE, SARUWATARI HIROSHI, ONO NOBUTAKA: "Experimental Evaluation of Multichannel Audio Source Separation Based on IDLMA", IEICE TECHNICAL REPORT, 12 March 2018 (2018-03-12), pages 13 - 20, XP055858824, Retrieved from the Internet <URL:http://d-kitamura.net/pdf/paper/Kitamura2018EA03.pdf> [retrieved on 20211108] *

Also Published As

Publication number Publication date
JP2021152623A (en) 2021-09-30

Similar Documents

Publication Publication Date Title
JP7191793B2 (en) SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, AND PROGRAM
US9357298B2 (en) Sound signal processing apparatus, sound signal processing method, and program
US7533015B2 (en) Signal enhancement via noise reduction for speech recognition
US9668066B1 (en) Blind source separation systems
JP5230103B2 (en) Method and system for generating training data for an automatic speech recognizer
EP1993320B1 (en) Reverberation removal device, reverberation removal method, reverberation removal program, and recording medium
US8880396B1 (en) Spectrum reconstruction for automatic speech recognition
JP2011215317A (en) Signal processing device, signal processing method and program
JP2012234150A (en) Sound signal processing device, sound signal processing method and program
WO2021193093A1 (en) Signal processing device, signal processing method, and program
US8666737B2 (en) Noise power estimation system, noise power estimating method, speech recognition system and speech recognizing method
US8401844B2 (en) Gain control system, gain control method, and gain control program
Nesta et al. Blind source extraction for robust speech recognition in multisource noisy environments
Delcroix et al. Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds
KR20220022286A (en) Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder
Nesta et al. Robust Automatic Speech Recognition through On-line Semi Blind Signal Extraction
EP3847645B1 (en) Determining a room response of a desired source in a reverberant environment
Kulkarni et al. A review of speech signal enhancement techniques
EP1216527B1 (en) Apparatus and method for de-esser using adaptive filtering algorithms
Ishii et al. Blind noise suppression for Non-Audible Murmur recognition with stereo signal processing
JP3916834B2 (en) Extraction method of fundamental period or fundamental frequency of periodic waveform with added noise
US20240155290A1 (en) Signal processing apparatus, signal processing method, and program
US20220189498A1 (en) Signal processing device, signal processing method, and program
CN110675890B (en) Audio signal processing device and audio signal processing method
Acero et al. Speech/noise separation using two microphones and a VQ model of speech signals.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21774220

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21774220

Country of ref document: EP

Kind code of ref document: A1