WO2020110228A1 - Information processing device, program and information processing method - Google Patents

Information processing device, program and information processing method Download PDF

Info

Publication number
WO2020110228A1
WO2020110228A1 PCT/JP2018/043747 JP2018043747W WO2020110228A1 WO 2020110228 A1 WO2020110228 A1 WO 2020110228A1 JP 2018043747 W JP2018043747 W JP 2018043747W WO 2020110228 A1 WO2020110228 A1 WO 2020110228A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
observation
microphone
time
spectral component
Prior art date
Application number
PCT/JP2018/043747
Other languages
French (fr)
Japanese (ja)
Inventor
訓 古田
松岡 文啓
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to JP2020557460A priority Critical patent/JP6840302B2/en
Priority to PCT/JP2018/043747 priority patent/WO2020110228A1/en
Publication of WO2020110228A1 publication Critical patent/WO2020110228A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/18Methods or devices for transmitting, conducting or directing sound
    • G10K11/26Sound-focusing or directing, e.g. scanning
    • G10K11/34Sound-focusing or directing, e.g. scanning using electrical steering of transducer arrays, e.g. beam steering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • the present invention relates to an information processing device, a program, and an information processing method.
  • These hands-free voice operation systems, hands-free call systems, and abnormal sound monitoring systems are designed to produce a target sound such as a sound or abnormal sound under various noise environments such as in a moving vehicle, factory, office, or living room at home.
  • a microphone is installed to collect the.
  • such a microphone collects not only the target sound but also ambient noise other than the target sound and other voices (hereinafter, referred to as disturbing sound).
  • a method by beam forming such as directing the direction of the target sound by signal processing, or directing a blind spot to the disturbing sound, or
  • a method of estimating a mixing matrix by independent component analysis there is a method of estimating a mixing matrix by independent component analysis.
  • beamforming is excellent in suppressing noise, it is not so effective in separating voice, and independent component analysis has a problem that performance is deteriorated due to the influence of reverberation or noise.
  • the number of noise sources of the interfering sound is not limited to one, and there is a constraint that it is difficult to separate more sound sources than the number of microphones.
  • Binary masking is an effective method for suppressing directional interference that is easy to implement.
  • Patent Document 1 discloses a method of increasing the accuracy of binary masking for mixed speech in which sparseness is not guaranteed by intentionally causing an amplitude difference between power spectra.
  • the conventional method has a problem that an error occurs in the mask coefficient because a power difference is intentionally generated between the power spectra of the main microphone input signal and the sub microphone input signal.
  • One or more aspects of the present invention have been made to solve such a problem, and an object thereof is to make it possible to easily obtain a high quality target signal.
  • An information processing apparatus is based on a first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and the observation sound.
  • the first observation digital signal receives the input of the second observation analog signal generated by the second microphone and converting each of the first observation analog signal and the second observation analog signal into a digital signal
  • the first observation digital signal An analog/digital conversion unit that generates a signal and a second observed digital signal, and converts each of the first observed digital signal and the second observed digital signal into a frequency domain signal,
  • a time/frequency conversion unit that generates a spectrum component and a second spectrum component and a cross-correlation function of the first spectrum component and the second spectrum component, the observed sound is transmitted to the first microphone.
  • a mask generation unit that calculates a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction, based on a time difference between the arrival time and the arrival time at the second microphone; By masking the first spectrum component using the filtering coefficient, a masking filter unit that separates the spectrum component, and by converting the separated spectrum component into a signal in the time domain, And a time/frequency inverse converter that generates an output digital signal.
  • a program causes a computer to generate a first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and the observation sound. Based on the input of the second observation analog signal generated by the second microphone based on the above, the first observation analog signal and the second observation analog signal are each converted into a digital signal, thereby performing the first observation.
  • An analog/digital converter that generates a digital signal and a second observed digital signal, each of the first observed digital signal and the second observed digital signal is converted into a signal in the frequency domain, and The observed sound arrives at the first microphone using a time/frequency conversion unit that generates a spectral component and a second spectral component, and a cross-correlation function of the first spectral component and the second spectral component.
  • a mask generation unit that calculates a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction, based on a time difference between the time when the sound arrives at the second microphone and the time when the sound arrives at the second microphone;
  • a masking filter unit that separates the spectral component of 1 by masking the spectral component by using the filtering coefficient, and by converting the separated spectral component into a signal in the time domain, output It is characterized in that it functions as a time/frequency inverse conversion unit that generates a digital signal.
  • An information processing method is based on a first observed analog signal generated by a first microphone based on an observed sound including a target sound coming from a first direction, and the observed sound.
  • a spectral component is generated, and using the cross-correlation function of the first spectral component and the second spectral component, the time at which the observed sound arrives at the first microphone and the arrival at the second microphone.
  • the filtering coefficient for masking the spectral component of the sound arriving from the direction different from the first direction is calculated according to the time difference from the time to be used, and the filtering coefficient is used for the first spectral component.
  • An output digital signal is generated by separating the spectral components by performing masking and converting the separated spectral components into a signal in the time domain.
  • FIG. 4 is a block diagram schematically showing an internal configuration of a mask generation unit in the first to third embodiments. It is a schematic diagram for demonstrating arrangement
  • (A) to (C) are graphs for explaining the utterance amount ratio when the target speaker and the disturbing speaker speak.
  • (A) And (B) is a graph for explaining the effect in the first embodiment.
  • It is a block diagram which shows the 1st hardware structural example of a sound source separation apparatus. It is a block diagram which shows the 2nd hardware structural example of a sound source separation device.
  • FIG. 5 is a block diagram schematically showing a configuration of an information processing system including a sound source separation device according to a second embodiment.
  • FIG. It is a schematic diagram which shows an example of the method of excluding the influence of noise other than a target sound and a disturbance sound.
  • FIG. 1 is a block diagram schematically showing the configuration of a sound source separation device 100 as an information processing device according to the first embodiment.
  • the sound source separation device 100 includes an analog/digital conversion unit (hereinafter referred to as A/D conversion unit) 103, a time/frequency conversion unit (hereinafter referred to as T/F conversion unit) 104, a mask generation unit 105, and a masking filter.
  • a unit 110, a time/frequency inverse conversion unit (hereinafter referred to as T/F inverse conversion unit) 111, and a digital/analog conversion unit (hereinafter referred to as D/A conversion unit) 112 are provided.
  • the sound source separation device 100 is connected to a first microphone 101 and a second microphone 102.
  • FIG. 2 is a block diagram schematically showing the internal configuration of the mask generation unit 105.
  • the mask generation unit 105 includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 107, a gain calculation unit 108, and a mask correction unit 109.
  • the sound source separation device 100 forms a masking filter based on the signal in the frequency domain generated from the signals in the time domain acquired by the first microphone 101 and the second microphone 102, and uses the masking filter as the first microphone. By multiplying the signal in the frequency domain corresponding to the signal acquired in 101, the output signal of the target sound from which the interfering sound is removed is obtained.
  • the first observed analog signal acquired by the first microphone 101 is also referred to as a first channel Ch1
  • the second observed analog signal acquired by the second microphone 102 is also referred to as a second channel Ch2. ..
  • the first microphone 101 and the second microphone 102 are located on the same horizontal plane, and their positions are known. Yes, and does not change over time. Further, it is assumed that the direction range in which the target sound and the disturbing sound can arrive does not change with time. The direction in which the target sound arrives is also called the first direction, and the direction in which the disturbing sound arrives is also called the second direction. Here, it is assumed that the target sound and the disturbing sound are voices from different single speakers.
  • the first microphone 101 generates a first observation analog signal by converting the observation sound into an electric signal.
  • the first observed analog signal is given to the A/D conversion unit 103.
  • the second microphone 102 generates a second observed analog signal by converting the observed sound into an electric signal.
  • the second observed analog signal is provided to the A/D conversion unit 103.
  • the A/D conversion unit 103 performs analog/digital conversion () on each of the first observed analog signal given from the first microphone 101 and the second observed analog signal given from the second microphone 102. Hereinafter, each is converted into a digital signal by performing A/D conversion) to generate a first observed digital signal and a second observed digital signal.
  • the A/D conversion unit 103 samples the first observation analog signal given from the first microphone 101 at a predetermined sampling frequency and converts it into a digital signal divided into frame units. By doing so, the first observed digital signal is generated.
  • the A/D conversion unit 103 samples the second observation analog signal supplied from the second microphone 102 at a predetermined sampling frequency to obtain a digital signal divided into frame units. The second observed digital signal is generated by the conversion.
  • the sampling frequency is, for example, 16 kHz
  • the frame unit is, for example, 16 ms.
  • the first observed digital signal generated from the first observed analog signal in the frame interval corresponding to the sample number t is represented by a code x 1 (t)
  • the second observed digital signal in the frame interval corresponding to the sample number t is represented.
  • the second observed digital signal generated from the observed analog signal is represented by the code x 2 (t).
  • the first observed digital signal x 1 (t) and the second observed digital signal x 2 (t) are provided to the T/F conversion unit 104.
  • the T/F conversion unit 104 receives the first observed digital signal x 1 (t) and the second observed digital signal x 2 (t), and receives the first observed digital signal x 1 (t) in the time domain and The second observed digital signal x 2 (t) is converted into a first short-time spectrum component X 1 ( ⁇ , ⁇ ) and a second short-time spectrum component X 2 ( ⁇ , ⁇ ) in the frequency domain.
  • represents a spectrum number that is a discrete frequency
  • represents a frame number.
  • the T/F conversion unit 104 performs, for example, Fast Fourier Transform of 512 points on the first observed digital signal x 1 (t), and thus the first short-time spectrum component X 1 Generate ( ⁇ , ⁇ ). Similarly, the T/F conversion unit 104 generates a second short-time spectrum component X 2 ( ⁇ , ⁇ ) from the second observed digital signal x 2 (t).
  • the short-time spectrum component of the current frame is simply described as a spectrum component and its description is omitted.
  • the mask generation unit 105 receives the first spectral component X 1 ( ⁇ , ⁇ ) and the second spectral component X 2 ( ⁇ , ⁇ ) and receives a time that is a filtering coefficient for performing masking for separating the target sound.
  • the frequency filter coefficient b mod ( ⁇ , ⁇ ) is calculated.
  • the mask generation unit 105 uses the cross-correlation function of the first spectral component X 1 ( ⁇ , ⁇ ) and the second spectral component X 2 ( ⁇ , ⁇ ) to determine that the observation sound is the first microphone 101. And a time difference between the second microphone 102 and the second microphone 102, a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction in which the target sound arrives is calculated. ..
  • the first microphone 101 In obtaining the time-frequency filter coefficient b mod ( ⁇ , ⁇ ), as shown in FIG. 3, in the horizontal plane where the first microphone 101 and the second microphone 102 are provided, the first microphone 101 with respect to the vertical direction V 2 of the vertical V 1 and second microphones 102, from a direction included in a predetermined angle theta, it is assumed that the target sound comes. Incidentally, interference sound is the vertical direction V 2 of the vertical V 1 and second microphones 102 of the first microphone 101, it is assumed that the target sound coming from the opposite side.
  • the vertical direction V 2 of the vertical V 1 and second microphones 102 of the first microphone 101 to the straight line connecting the first microphone 101 and second microphone 102, which are perpendicular
  • the vertical direction V 2 of the vertical V 1 and second microphones 102 of the first microphone 101 a reference direction is predetermined, not necessarily vertical.
  • the distance between the first microphone 101 and the second microphone 102 is the distance d.
  • the arrival of sound using the signals from the first microphone 101 and the second microphone 102 In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is the target sound or the disturbing sound, the arrival of sound using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is within the desired range.
  • the time difference that occurs between the signals from the first microphone 101 and the second microphone 102 is determined by the angle ⁇ , the arrival direction can be estimated by using this time difference. This will be described below with reference to FIGS. 2 and 3.
  • the mask coefficient calculation unit 106 first calculates the cross-correlation function of the first spectral component X 1 ( ⁇ , ⁇ ) and the second spectral component X 2 ( ⁇ , ⁇ ) as shown in the following equation (1).
  • the cross spectrum D( ⁇ , ⁇ ) is calculated.
  • the mask coefficient calculation unit 106 gives the calculated cross spectrum D( ⁇ , ⁇ ) to the utterance amount ratio calculation unit 107.
  • the mask coefficient calculation unit 106 obtains the phase ⁇ D ( ⁇ , ⁇ ) of the cross spectrum D( ⁇ , ⁇ ) using the following equation (2).
  • Q( ⁇ , ⁇ ) and K( ⁇ , ⁇ ) represent the imaginary part and the real part of the cross spectrum D( ⁇ , ⁇ ), respectively.
  • the phase ⁇ D ( ⁇ , ⁇ ) obtained by the above equation (2) means the phase angle for each spectral component of the first channel Ch1 and the second channel Ch2, which is defined by the discrete frequency ⁇ .
  • the division represents the time delay between the two signals. That is, the time difference ⁇ ( ⁇ , ⁇ ) between the first channel Ch1 and the second channel Ch2 can be expressed by the following equation (3).
  • the mask coefficient b( ⁇ , ⁇ ) for performing masking for separating the target sound can be expressed by the following equation (5).
  • the mask coefficient calculation unit 106 uses the cross-correlation function of the first spectral component X 1 ( ⁇ , ⁇ ) and the second spectral component X 2 ( ⁇ , ⁇ ) to determine that the target sound is the first A first time difference between the time of arrival at the microphone 101 and the time of arrival at the second microphone 102, the time of arrival of an interfering sound at the first microphone 101, and the time of arrival at the second microphone 102.
  • the mask coefficient for separating the spectrum component of the sound coming from the direction included in the first range from the spectrum component of the sound coming from the direction included in the second range is calculated.
  • the mask coefficient b( ⁇ , ⁇ ) shown in the equation (5) is 1 when the target sound is estimated and is M when the disturbing sound is estimated.
  • M the mask coefficient has a binary value of 1 or 0 (binary), so a filter having such a mask coefficient is called a binary mask.
  • a decimal other than binary may be used as the filter coefficient, and the filter in this case is also called a soft mask.
  • the mask coefficient calculation unit 106 gives the mask coefficient b( ⁇ , ⁇ ) to the mask correction unit 109.
  • the utterance amount ratio calculation unit 107 calculates the first spectrum component X 1 ( ⁇ , ⁇ ) of the first channel Ch1, the second spectrum component X 2 ( ⁇ , ⁇ ) of the second channel Ch2, and the cross spectrum.
  • the utterance amount ratio which is the ratio between the utterance amount of the target sound speaker and the utterance amount of the disturbing sound speaker, is calculated.
  • the utterance amount ratio is the interference of the amount of the spectral component of the sound coming from the first range including the first direction in which the target sound comes in the first spectral component X 1 ( ⁇ , ⁇ ). It is the ratio to the amount of the spectral component of the sound coming from the second range including the second direction in which the sound comes.
  • the utterance amount ratio calculation unit 107 obtains the first power spectrum P 1 ( ⁇ , ⁇ ) of the first channel Ch1 from the first spectrum component X 1 ( ⁇ , ⁇ ) of the first channel Ch1. It is calculated from the following equation (6). However, X Re is the real part of the first spectral component X 1 ( ⁇ , ⁇ ), and X Im is the imaginary part of the first spectral component X 1 ( ⁇ , ⁇ ).
  • the utterance amount ratio calculation unit 107 uses the sign of the imaginary part Q( ⁇ , ⁇ ) of the cross spectrum D( ⁇ , ⁇ ) shown in the above equation (1) to observe the analog of the target voice. It is determined whether the signal comes from the target sound side or the interfering sound side. Then, the utterance amount ratio calculation unit 107 adds the first power spectrum P1( ⁇ , ⁇ ) of the first channel Ch1 according to the code determination result, as shown in the following expression (7), The utterance amount s Tgt ( ⁇ ) of the target speaker and the utterance amount s Int ( ⁇ ) of the disturbing speaker are respectively obtained.
  • the utterance amount ratio calculation unit 107 obtains the utterance amount ratio SR( ⁇ ) from the obtained two utterance amounts s Tgt ( ⁇ ) and s Int ( ⁇ ) by the following equation (8).
  • FIG. 4A to 4C are graphs for explaining the utterance amount ratio SR( ⁇ ) when the target speaker and the disturbing speaker speak.
  • FIG. 4A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101.
  • FIG. 4B is a graph showing an example of the time variation of the utterance amount between the target sound speaker and the disturbing sound speaker.
  • FIG. 4C is a graph showing an example of temporal variation of the utterance amount ratio SR( ⁇ ) obtained from the utterance amount of the target sound speaker and the utterance amount of the disturbing sound speaker.
  • the target sound with high separation accuracy and less distortion can be obtained. Separation is possible. More specifically, for example, in a frame in which the utterance amount ratio SR( ⁇ ) is small, the masking filter coefficient is increased to strongly suppress the interfering sound and enhance the separation performance, thereby increasing the utterance amount ratio SR( ⁇ ). In the case of a large frame, it is possible to reduce the distortion of the target sound by reducing the numerical value of the masking filter coefficient.
  • the gain calculation unit 108 uses the utterance amount ratio SR( ⁇ ) obtained by the above equation (8) to determine the constant in the mask coefficient b( ⁇ , ⁇ ) of the above equation (5).
  • a correction gain g( ⁇ , ⁇ ) for correcting M is calculated by the following formula (9).
  • G Tgt , G Int, and G DT are predetermined modified gain constants
  • G Tgt is a constant when there is a high possibility that the observed analog signal is only the target sound
  • G Int is the observed analog signal.
  • G DT is a constant when there is a high possibility that both the target sound and the interfering sound are present in the observed analog signal.
  • M in the above equation (5) is controlled to be large, in other words, the suppression amount of the mask is controlled to be small.
  • the corrected M is limited to a value of 1 or less.
  • M in the above equation (5) is controlled to be further reduced, in other words, the amount of suppression of the disturbing sound is controlled to be further increased. That is, the gain calculation unit 108 calculates a correction gain for correcting the mask coefficient so that the higher the utterance amount ratio is, the lower the masking strength is.
  • the calculation cost is low because only the utterance volume ratio obtained from a simple calculation of the power of the observed analog signal and the conditional expression based on the comparison of the utterance volume ratio are sufficient, and the mask coefficient is efficiently corrected. It is possible to
  • K( ⁇ ) is a frequency correction coefficient represented by a positive number of 1 or less, and is set so that the value increases as the frequency increases, as shown in the following expression (10).
  • the frequency correction coefficient of the equation (10) is corrected so that the value increases as the frequency increases, but the frequency correction coefficient of the equation (10) is not limited to such an example.
  • the constant value of the correction gain or the constant threshold value of the utterance amount ratio SR( ⁇ ) described above is not limited to the case of Expression (9), and is appropriately adjusted according to the mode of the target sound or the disturbing sound. can do. Further, the condition for determining the correction gain is not limited to three stages as in the equation (9), and may be set in more stages.
  • the mask correction unit 109 applies the correction gain g obtained by the equation (9) to the mask coefficient b( ⁇ , ⁇ ) obtained by the above equation (5).
  • the correction is performed using ( ⁇ , ⁇ ) to obtain the time-frequency filter coefficient b mod ( ⁇ , ⁇ ).
  • the masking filter unit 110 uses the above formula ( 1 ) as the first spectral component X 1 ( ⁇ , ⁇ ) on the first microphone 101 side, as shown in the following formula (12).
  • the time frequency filter coefficient b mod ( ⁇ , ⁇ ) obtained in 11) is multiplied to calculate the spectral component Y( ⁇ , ⁇ ).
  • the masking filter unit 110 sends the calculated spectral component Y( ⁇ , ⁇ ) to the T/F inverse transform unit 111.
  • the spectral component Y( ⁇ , ⁇ ) separated here is also referred to as a target spectral component.
  • the target spectrum component is a spectrum component including a target sound.
  • the T/F inverse transform unit 111 performs, for example, inverse fast Fourier transform on the spectral component Y( ⁇ , ⁇ ) to calculate the output digital signal y(t).
  • the T/F inverse conversion unit 111 gives the calculated output digital signal y(t) to the D/A conversion unit 112.
  • the D/A conversion unit 112 generates an output signal by converting the output digital signal y(t) into an analog signal.
  • the generated output signal is output to an external device such as a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device.
  • FIGS. 5A and 5B are graphs for explaining the effect in the first embodiment. Similar to FIG. 4A, FIG. 5A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101. FIG. 5B is a graph showing an example of time variation of the output signal output from the D/A conversion unit 112. As is clear from FIGS. 5A and 5B, it can be seen that the interfering sound is almost removed from the output signal and only the target sound is separated.
  • the hardware configuration of the sound source separation device 100 described above can be realized by a computer with a built-in CPU (Central Processing Unit), such as a tablet-type portable computer or a microcomputer for use in a device such as a car navigation system.
  • the hardware configuration of the sound source separation apparatus 100 may be a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable Integrated) integrated circuit (LGA) such as an FPGA (Field-Integrated Gate) integrated (LGA) integrated circuit (LGA). May be done.
  • FIG. 6 is a block diagram showing a hardware configuration example of the sound source separation device 100 configured by using an LSI such as DSP, ASIC or FPGA.
  • LSI such as DSP, ASIC or FPGA.
  • the sound source separation device 100 includes a signal input/output unit 131, a signal processing circuit 132, a recording medium 133, and a signal path 134 such as a bus.
  • the signal input/output unit 131 is an interface circuit that realizes a connection function with the microphone circuit 140 and the external device 141.
  • the microphone circuit 140 corresponds to the first microphone 101 and the second microphone 102, and for example, a device that captures acoustic vibration and converts it into an electric signal can be used.
  • Each function of the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse conversion unit 111 shown in FIG. 1 can be realized by the signal processing circuit 132 and the recording medium 133. it can. Further, the A/D conversion unit 103 and the D/A conversion unit 112 in FIG. 1 can be realized by the signal input/output unit 131.
  • the recording medium 133 is used to store various setting data of the signal processing circuit 132 and various data such as signal data.
  • a volatile memory such as SDRAM (Synchronous Dynamic Random Access Memory), a non-volatile memory such as HDD (Hard Disk Drive) or SSD (Solid State Drive) can be used.
  • the recording medium 133 can store the initial state of the sound source separation process, various setting data, constant data for control, and the like.
  • the output digital signal subjected to the sound source separation processing in the signal processing circuit 132 is sent from the signal input/output unit 131 to the external device 141.
  • the external device 141 for example, a voice recognition device, a hands-free call device, or It corresponds to an abnormal sound monitoring device.
  • FIG. 7 is a block diagram showing a hardware configuration example of the sound source separation device 100 configured by using a computing device such as a computer.
  • the sound source separation device 100 includes a signal input/output unit 131, a processor 136 including a CPU 135, a memory 137, a recording medium 138, and a signal path 134 such as a bus.
  • the signal input/output unit 131 is an interface circuit that realizes a connection function with the microphone circuit 140 and the external device 141.
  • the memory 137 is a program memory that stores various programs for implementing sound source separation processing, a work memory used when the processor 136 performs data processing, and a ROM (Read Only) that is used as a memory that expands signal data. Memory) and RAM (Random Access Memory).
  • the functions of the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse conversion unit 111 can be realized by the processor 136, the memory 137, and the recording medium 138. Further, the A/D conversion unit 103 and the D/A conversion unit 112 can be realized by the signal input/output unit 131.
  • the recording medium 138 is used for accumulating various data such as various setting data and signal data of the processor 136.
  • a volatile memory such as SDRAM or a non-volatile memory such as HDD or SSD can be used. It is possible to store programs including an OS (Operating System), various setting data, and various data such as audio signal data.
  • the data in the memory 137 can be stored in the recording medium 138.
  • the processor 136 uses the memory 137 as a working memory, and operates according to the computer program read from the memory 137, whereby the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse unit. It can function as the conversion unit 111.
  • the output signal generated by the sound source separation processing performed by the processor 136 is sent from the signal input/output unit 131 to the external device 141.
  • the external device 141 include a voice recognition device, a hands-free call device, or It corresponds to an abnormal sound monitoring device.
  • the program executed by the processor 136 may be stored in a storage device inside a computer that executes the software program, or may be in a format distributed in a storage medium such as a CD-ROM. It is also possible to acquire the program from another computer through a wireless or wired network such as a LAN (Local Area Network). Such a program may be provided as a program product, for example.
  • various data may be transmitted and received as digital signals through a wireless or wired network without passing through conversion of analog signals and digital signals.
  • the program executed by the processor 136 is a program executed by the external device 141, for example, a program executed by the computer to cause the computer to function as a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device. It is also possible that they are combined with each other and run on the same computer, or they can be distributed and run on multiple computers.
  • the external device 141 may include the sound source separation device 100. That is, the voice recognition device, the hands-free communication device, or the abnormal sound monitoring device may be configured to include the sound source separation device 100.
  • FIG. 8 is a flowchart showing the operation of the sound source separation device 100.
  • the A/D conversion unit 103 sets each of the first observed analog signal and the second observed analog signal input from each of the first microphone 101 and the second microphone 102 in a predetermined frame.
  • the first observed digital signal x 1 (t) and the second observed digital signal x 2 (t) are generated by taking in at intervals and A/D converting each, and the T/F conversion unit 104 generates them. (S10).
  • the output from the A/D conversion unit 103 is repeatedly performed when the sample number t is smaller than the predetermined value T (No in S11).
  • step S12 the T/F conversion unit 104 performs, for example, fast Fourier transform of 512 points on each of the first observed digital signal x 1 (t) and the second observed digital signal x 2 (t). Then, the first spectral component X 1 ( ⁇ , ⁇ ) and the second spectral component X 2 ( ⁇ , ⁇ ) are calculated. Then, the T/F conversion unit 104 gives the first spectral component X 1 ( ⁇ , ⁇ ) and the second spectral component X 2 ( ⁇ , ⁇ ) to the mask generation unit 105, and the first spectral component X 1 ( ⁇ , ⁇ ) is given to the masking filter unit 110.
  • the mask generation unit 105 uses the first frequency component X 1 ( ⁇ , ⁇ ) and the second spectrum component X 2 ( ⁇ , ⁇ ) to mask the time-frequency filter coefficient b mod ( ⁇ , ⁇ ) is calculated (S13).
  • S13 time-frequency filter coefficient
  • step S13A the mask coefficient calculation unit 106 determines the cross spectrum D( ⁇ , ⁇ ) from the cross-correlation function of the first spectrum component X 1 ( ⁇ , ⁇ ) and the second spectrum component X 2 ( ⁇ , ⁇ ). And the mask coefficient b( ⁇ , ⁇ ) is calculated based on the obtained cross spectrum D( ⁇ , ⁇ ).
  • the mask coefficient calculation unit 106 gives the cross spectrum D( ⁇ , ⁇ ) to the utterance amount ratio calculation unit 107, and gives the mask coefficient b( ⁇ , ⁇ ) to the mask correction unit 109. Then, the process proceeds to step S13B.
  • step S13B the utterance amount ratio calculation unit 107 determines the target sound from the first spectrum component X 1 ( ⁇ , ⁇ ), the second spectrum component X 2 ( ⁇ , ⁇ ) and the cross spectrum D( ⁇ , ⁇ ).
  • An utterance amount ratio SR( ⁇ ) which is a ratio between the utterance amount of the speaker and the utterance amount of the interfering sound speaker is calculated.
  • the utterance amount ratio calculation unit 107 gives the utterance amount ratio SR( ⁇ ) to the gain calculation unit 108. Then, the process proceeds to step S13C.
  • step S13C the gain calculation unit 108 calculates the correction gain g( ⁇ , ⁇ ) for correcting the mask coefficient b( ⁇ , ⁇ ) using the utterance amount ratio SR( ⁇ ).
  • the gain calculation unit 108 gives the correction gain g( ⁇ , ⁇ ) to the mask correction unit 109. Then, the process proceeds to step S13D.
  • step S13D the mask correction unit 109 corrects the mask coefficient b( ⁇ , ⁇ ) using the correction gain g( ⁇ , ⁇ ) to obtain the time frequency filter coefficient b mod ( ⁇ , ⁇ ). Then, the mask correction unit 109 gives the time-frequency filter coefficient b mod ( ⁇ , ⁇ ) to the masking filter unit 110.
  • the masking filter unit 110 multiplies the first spectral component X 1 ( ⁇ , ⁇ ) by the time-frequency filter coefficient b mod ( ⁇ , ⁇ ), and the spectral component Y( ⁇ , ⁇ ) of the output digital signal y(t). ) Is calculated (S14). Then, the masking filter unit 110 provides the spectral component Y( ⁇ , ⁇ ) to the T/F inverse transform unit 111.
  • the T/F inverse transform unit 111 transforms the spectral component Y( ⁇ , ⁇ ) into an output digital signal y(t) in the time domain by performing an inverse fast Fourier transform on the spectral component Y( ⁇ , ⁇ ). Yes (S15).
  • the D/A conversion unit 112 converts the output digital signal y(t) into an output signal which is an analog signal by D/A conversion, and outputs the output signal to the outside (S16). Then, the output from the D/A conversion unit 112 is repeatedly performed when the sample number t is smaller than the predetermined value T (Yes in S17).
  • the sound source separation device 100 can create a masking filter with high separation performance at low calculation cost. Therefore, the target sound can be accurately acquired, and it is possible to provide a high-accuracy voice recognition device, a high-quality hands-free communication device, and an abnormal sound monitoring device with high detection accuracy.
  • Embodiment 2 Although the first embodiment exemplifies the configuration based on the voice, the second embodiment will be described as an embodiment that can be applied to the case where there is noise other than the voice that becomes the disturbing sound.
  • FIG. 9 is a block diagram schematically showing a configuration of an information processing system 250 including the sound source separation device 200 according to the second embodiment.
  • the information processing system 250 shown here is an example of a car navigation system, and shows a case where a speaker seated in a driver seat and a speaker seated in a passenger seat speak in a moving automobile.
  • a speaker seated in the driver's seat will be referred to as a target sound speaker
  • a speaker seated in the passenger seat will be referred to as an interfering sound speaker.
  • the information processing system 250 includes a first microphone 101, a second microphone 102, a sound source separation device 200, and an external device 141.
  • the first microphone 101 and the second microphone 102 in the second embodiment are the same as the first microphone 101 and the second microphone 102 in the first embodiment.
  • the external device 141 is similar to the external device 141 described with reference to FIG. 6 or 7.
  • noise such as vehicle running noise, and during hands-free communication It is the received voice of the far-end speaker transmitted from the speaker, the guide voice transmitted by the car navigation, or the acoustic echo around which the music of the car audio or the like goes around. Voices other than the voices of the target speaker and the disturbing speaker are noise.
  • the noise signal is a noise signal.
  • the sound arriving from a direction not included in the first range including the first direction in which the target sound arrives and a second range including the second direction in which the disturbing sound arrives is excluded by excluding the spectrum component and calculating the utterance amount ratio.
  • the external device 141 is, for example, a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device, as described above.
  • the external device 141 performs, for example, a voice recognition process, a hands-free call process, or an abnormal sound detection process, and obtains an output result according to each process.
  • the sound source separation device 200 includes an A/D conversion unit 103, a T/F conversion unit 104, a mask generation unit 205, a masking filter unit 110, and a T/F inverse conversion unit 111.
  • the A/D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, and the T/F inverse conversion unit 111 of the sound source separation device 200 according to the second embodiment are A of the sound source separation device 100 of the first embodiment. It is the same as the /D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, and the T/F inverse conversion unit 111. However, in the sound source separation device 200 according to the second embodiment, the output digital signal y(t) generated by the T/F inverse conversion unit 111 is given to the external device 141.
  • the mask generation unit 205 includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 207, a gain calculation unit 108, and a mask correction unit 109.
  • the mask coefficient calculation unit 106, the gain calculation unit 108 and the mask correction unit 109 of the mask generation unit 205 according to the second embodiment are the same as the mask coefficient calculation unit 106, the gain calculation unit 108 and the mask correction of the mask generation unit 105 according to the first embodiment. It is similar to the unit 109.
  • the utterance amount ratio calculation unit 207 excludes the disturbing sound signal from the calculation of the utterance amount ratio SR( ⁇ ) by using the equation (13) obtained by modifying the equation (7) described in the first embodiment.
  • the arrival direction of the target sound is determined by the sign of the imaginary part Q( ⁇ , ⁇ ) of the cross spectrum D( ⁇ , ⁇ ) of the equation (1).
  • the target speaker and the interfering sound are calculated from the calculation of the utterance amount. The influence of noise other than the speaker can be excluded.
  • ⁇ ⁇ DT and ⁇ ⁇ DN are threshold values of the time difference of the observed analog signal for excluding from the calculation of the utterance amount, and are predetermined constants obtained by converting the arrival direction angle into the time difference.
  • ⁇ ⁇ DT is assumed when the arrival time difference of observed analog signals is extremely small and it is difficult to determine whether the arrival direction is the target sound direction or the disturbing sound direction, or when noise is coming from the front direction. Is a threshold value for excluding from the calculation of the utterance amount.
  • ⁇ ⁇ DN is highly likely to deviate from the expected arrival directions of the target sound and the interfering sound, in other words, the observed analog signal is directional noise such as wind noise mixed from a window, or from the speaker. This is a threshold value for excluding such a case from the calculation of the utterance amount when the possibility of released music or the like is high.
  • FIG. 10 is a schematic diagram illustrating an example of a method for excluding the influence of noise other than the target sound and the disturbing sound in Expression (13).
  • the exclusion range is described based on the first channel Ch1.
  • the influence of noise other than the target sound and the interfering sound can be excluded, so that the calculation accuracy of the utterance amount ratio is improved and the quality of the utterance amount is further improved. It is possible to configure a high sound source separation device.
  • the sound source separation device 200 is configured as described above, it is possible to create a masking filter with high separation performance at low calculation cost even under various noise conditions. Therefore, since the target sound can be accurately acquired even under the noise in the vehicle, a high-accuracy voice recognition device, a high-quality hands-free communication device, or an abnormal sound monitoring device for detecting an abnormal sound in the vehicle. Can be provided.
  • Embodiment 3 In the first and second embodiments, only the current frame information is used for calculating the utterance amount ratio, but the embodiment is not limited to such an example, and the past frame information is used for the calculation. It is also possible.
  • the sound source separation device 300 includes an A/D conversion unit 103, a T/F conversion unit 104, a mask generation unit 305, a masking filter unit 110, and The T/F inverse conversion unit 111 and the D/A conversion unit 112 are provided.
  • the A/D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, the T/F inverse conversion unit 111, and the D/A conversion unit 112 of the sound source separation device 300 according to the third embodiment are the same as those of the first embodiment. This is the same as the A/D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, the T/F inverse conversion unit 111, and the D/A conversion unit 112 of the sound source separation device 100 according to.
  • the mask generation unit 305 includes a mask coefficient calculation unit 106, a speech amount ratio calculation unit 307, a gain calculation unit 108, and a mask correction unit 109.
  • the mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit 109 of the mask generation unit 305 according to the third embodiment are the same as the mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit of the mask generation unit 105 according to the first embodiment. It is similar to the unit 109.
  • the utterance amount ratio calculation unit 307 calculates the utterance amount ratio SR( ⁇ ) by using the above equation (8), and further calculates the calculated SR( ⁇ ) by using the following equation (14). Smoothing is performed with the utterance amount ratio SR( ⁇ -1) before the frame.
  • is a smoothing coefficient
  • the speech volume ratio calculated in the past is used to smooth the last calculated speech volume ratio, so that even if noise is mixed in the observed analog signal, it will be stable. It is possible to calculate the utterance amount ratio by using the above method, and it is possible to perform sound source separation with higher accuracy.
  • the utterance amount ratio calculation unit 207 calculates the utterance amount of each signal by using the equation (13). However, as a modification, the utterance amount ratio calculation unit 207 performs the calculation. To a predetermined frame section, in other words, by calculating the integral value of the power spectrum of a predetermined frame section, the occupation ratio of the target sound and the disturbing sound in the predetermined frame section, specifically, , It is possible to analyze which one is speaking long or which one is loud. Therefore, it is possible to determine which voice is dominant in the double talk of the target sound and the disturbing sound, and more accurate sound source separation can be performed.
  • the information processing system 250 includes a remote voice recognition system such as a smart speaker or a TV installed in a general home or an office, a voice call system of a TV conference system, a voice recognition dialogue system of a robot, or an abnormal sound of a factory. It can also be applied to monitoring systems. Even in such a case, the effects described in the second embodiment are similarly exerted on noise or acoustic echo generated in these acoustic environments.
  • the frequency bandwidth of the input signal is 16 kHz, but the first to third embodiments are not limited to such an example.
  • the first to third embodiments can be applied to an acoustic signal in a wider band such as 24 kHz.
  • the sound source separation devices 100 to 300 can perform high quality sound source separation at low calculation cost, any one of the voice recognition system, the voice communication system, and the abnormal sound monitoring system can be used. Can be introduced to. As a result, it is possible to improve the recognition rate of a remote voice recognition system such as a car navigation system or a television, and to improve the quality of a hands-free call system such as a mobile phone or an intercom system, a TV conference system or an abnormal sound monitoring system.
  • a remote voice recognition system such as a car navigation system or a television
  • a hands-free call system such as a mobile phone or an intercom system, a TV conference system or an abnormal sound monitoring system.
  • 100, 200, 300 sound source separation device 101 first microphone, 102 second microphone, 103 A/D conversion unit, 104 T/F conversion unit, 105, 205, 305 mask generation unit, 106 mask coefficient calculation unit, 107, 207, 307 utterance volume ratio calculation unit, 108 gain calculation unit, 109 mask correction unit, 110 masking filter unit, 111 T/F inverse conversion unit, 112 D/A conversion unit, 250 information processing system.

Abstract

The present invention is provided with: a time/frequency conversion unit (104) which generates a first spectrum component and a second spectrum component by converting each of a first observation digital signal and a second observation digital signal, generated from observed sound, into a frequency domain signal; a mask generation unit (105) which, using a cross-correlation function of the first spectrum component and the second spectrum component and on the basis of a time difference between the time of arrival of the observed sound at a first microphone and the time of arrival of the observed sound at a second microphone, calculates a filtering coefficient for masking a spectrum component of sound arriving from a direction different from a first direction from which target sound arrives; and a masking filter unit (110) which separates the spectrum component by performing masking with respect to the first spectrum component using the filtering coefficient.

Description

情報処理装置、プログラム及び情報処理方法Information processing apparatus, program, and information processing method
 本発明は、情報処理装置、プログラム及び情報処理方法に関する。 The present invention relates to an information processing device, a program, and an information processing method.
 近年のデジタル信号処理技術の進展に伴い、自動車内又は家庭のリビングでの音声認識によるハンズフリー音声操作、又は、手ぶらで電話するためのハンズフリー通話が広く普及している。また、機械の発する異常音又は人の悲鳴等の音を捉えて検知する異常音監視システムも開発されてきている。 With the progress of digital signal processing technology in recent years, hands-free voice operation by voice recognition in the automobile or in the living room of the home, or hands-free call for making a hand-held call has become widespread. Further, an abnormal sound monitoring system has also been developed, which catches and detects an abnormal sound generated by a machine or a sound such as a scream of a person.
 これらハンズフリー音声操作システム、ハンズフリー通話システム又は異常音監視システムは、走行する自動車内、工場内、オフィス、又は、家庭のリビング等の様々な雑音環境下において、音声又は異常音等の目的音を収集するためにマイクロホンが設置される。しかしながら、そのようなマイクロホンは、目的音だけでなく、その目的音以外の周囲雑音及び他の音声(以下、妨害音と称する)を収集してしまう。 These hands-free voice operation systems, hands-free call systems, and abnormal sound monitoring systems are designed to produce a target sound such as a sound or abnormal sound under various noise environments such as in a moving vehicle, factory, office, or living room at home. A microphone is installed to collect the. However, such a microphone collects not only the target sound but also ambient noise other than the target sound and other voices (hereinafter, referred to as disturbing sound).
 音声から個別に目的音を取り出す方法として、例えば、複数のマイクロホンを用いる場合、信号処理により目的音方向に指向性を向けたり、あるいは妨害音に死角を向けたりするようなビームフォーミングによる方法、又は、独立成分分析により混合行列を推定する方法等がある。但し、ビームフォーミングは、ノイズの抑圧には優れているが、音声の分離にはあまり有効でなく、独立成分分析は、残響又は騒音の影響で性能が低下する問題がある。更に、一般に実環境においては、妨害音の騒音源の数も1つとは限らず、マイクロホン数より多くの音源を分離するのに対応困難であるという制約がある。 As a method of individually extracting the target sound from the voice, for example, when using a plurality of microphones, a method by beam forming such as directing the direction of the target sound by signal processing, or directing a blind spot to the disturbing sound, or , There is a method of estimating a mixing matrix by independent component analysis. However, although beamforming is excellent in suppressing noise, it is not so effective in separating voice, and independent component analysis has a problem that performance is deteriorated due to the influence of reverberation or noise. Further, generally, in a real environment, the number of noise sources of the interfering sound is not limited to one, and there is a constraint that it is difficult to separate more sound sources than the number of microphones.
 これらに対し、目的音信号と妨害音信号とが時間周波数領域上で互いに重ならないというスパース性の仮定の下で、目的音以外の周波数成分をマスクして音源信号を分離する、バイナリマスキングと呼ばれる方法が提案されている。バイナリマスキングは、実装が容易で方向性を有する妨害音を抑圧するのに有効な方法である。 On the other hand, under the assumption of sparseness that the target sound signal and the interfering sound signal do not overlap each other in the time frequency domain, frequency components other than the target sound are masked to separate the sound source signal, which is called binary masking. A method has been proposed. Binary masking is an effective method for suppressing directional interference that is easy to implement.
 このバイナリマスキングに基づく方法として、特許文献1に開示されている技術がある。特許文献1には、パワースペクトルの振幅差を意図的に生じさせることで、スパース性が保証されない混合音声に対するバイナリマスキングの精度を高める方法が開示されている。 As a method based on this binary masking, there is a technique disclosed in Patent Document 1. Patent Document 1 discloses a method of increasing the accuracy of binary masking for mixed speech in which sparseness is not guaranteed by intentionally causing an amplitude difference between power spectra.
特開2010-239424号公報JP, 2010-239424, A
 しかしながら、従来の方法では、主マイク入力信号と副マイク入力信号のパワースペクトル間に意図的にパワー差を生じさせるため、マスク係数に誤差が生ずる問題がある。 However, the conventional method has a problem that an error occurs in the mask coefficient because a power difference is intentionally generated between the power spectra of the main microphone input signal and the sub microphone input signal.
 本発明の1又は複数の態様は、かかる問題を解決するためになされたもので、高品質な目的信号を容易に得ることができるようにすることを目的とする。 One or more aspects of the present invention have been made to solve such a problem, and an object thereof is to make it possible to easily obtain a high quality target signal.
 本発明の1態様に係る情報処理装置は、第1の方向から到来する目的音を含む観測音に基づいて第1のマイクロホンで生成された第1の観測アナログ信号、及び、前記観測音に基づいて第2のマイクロホンで生成された第2の観測アナログ信号の入力を受けて、第1の観測アナログ信号及び第2の観測アナログ信号の各々をデジタル信号に変換することで、第1の観測デジタル信号及び第2の観測デジタル信号を生成するアナログ/デジタル変換部と、前記第1の観測デジタル信号及び前記第2の観測デジタル信号の各々を、周波数領域の信号に変換することで、第1のスペクトル成分及び第2のスペクトル成分を生成する時間/周波数変換部と、前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との時間差により、前記第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部と、前記第1のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部と、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間/周波数逆変換部と、を備えることを特徴とする。 An information processing apparatus according to one aspect of the present invention is based on a first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and the observation sound. Receiving the input of the second observation analog signal generated by the second microphone and converting each of the first observation analog signal and the second observation analog signal into a digital signal, the first observation digital signal An analog/digital conversion unit that generates a signal and a second observed digital signal, and converts each of the first observed digital signal and the second observed digital signal into a frequency domain signal, By using a time/frequency conversion unit that generates a spectrum component and a second spectrum component and a cross-correlation function of the first spectrum component and the second spectrum component, the observed sound is transmitted to the first microphone. A mask generation unit that calculates a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction, based on a time difference between the arrival time and the arrival time at the second microphone; By masking the first spectrum component using the filtering coefficient, a masking filter unit that separates the spectrum component, and by converting the separated spectrum component into a signal in the time domain, And a time/frequency inverse converter that generates an output digital signal.
 本発明の1態様に係るプログラムは、コンピュータを、第1の方向から到来する目的音を含む観測音に基づいて第1のマイクロホンで生成された第1の観測アナログ信号、及び、前記観測音に基づいて第2のマイクロホンで生成された第2の観測アナログ信号の入力を受けて、第1の観測アナログ信号及び第2の観測アナログ信号の各々をデジタル信号に変換することで、第1の観測デジタル信号及び第2の観測デジタル信号を生成するアナログ/デジタル変換部、前記第1の観測デジタル信号及び前記第2の観測デジタル信号の各々を、周波数領域の信号に変換することで、第1のスペクトル成分及び第2のスペクトル成分を生成する時間/周波数変換部、前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との時間差により、前記第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部、前記第1のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部、及び、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間/周波数逆変換部、として機能させることを特徴とする。 A program according to one aspect of the present invention causes a computer to generate a first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and the observation sound. Based on the input of the second observation analog signal generated by the second microphone based on the above, the first observation analog signal and the second observation analog signal are each converted into a digital signal, thereby performing the first observation. An analog/digital converter that generates a digital signal and a second observed digital signal, each of the first observed digital signal and the second observed digital signal is converted into a signal in the frequency domain, and The observed sound arrives at the first microphone using a time/frequency conversion unit that generates a spectral component and a second spectral component, and a cross-correlation function of the first spectral component and the second spectral component. A mask generation unit that calculates a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction, based on a time difference between the time when the sound arrives at the second microphone and the time when the sound arrives at the second microphone; A masking filter unit that separates the spectral component of 1 by masking the spectral component by using the filtering coefficient, and by converting the separated spectral component into a signal in the time domain, output It is characterized in that it functions as a time/frequency inverse conversion unit that generates a digital signal.
 本発明の1態様に係る情報処理方法は、第1の方向から到来する目的音を含む観測音に基づいて第1のマイクロホンで生成された第1の観測アナログ信号、及び、前記観測音に基づいて第2のマイクロホンで生成された第2の観測アナログ信号の入力を受けて、第1の観測アナログ信号及び第2の観測アナログ信号の各々をデジタル信号に変換することで、第1の観測デジタル信号及び第2の観測デジタル信号を生成し、前記第1の観測デジタル信号及び前記第2の観測デジタル信号の各々を、周波数領域の信号に変換することで、第1のスペクトル成分及び第2のスペクトル成分を生成し、前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との時間差により、前記第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出し、前記第1のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離し、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成することを特徴とする。 An information processing method according to one aspect of the present invention is based on a first observed analog signal generated by a first microphone based on an observed sound including a target sound coming from a first direction, and the observed sound. Receiving the input of the second observation analog signal generated by the second microphone and converting each of the first observation analog signal and the second observation analog signal into a digital signal, the first observation digital signal Generating a signal and a second observed digital signal, and converting each of the first observed digital signal and the second observed digital signal into a signal in the frequency domain to obtain a first spectral component and a second spectral component. A spectral component is generated, and using the cross-correlation function of the first spectral component and the second spectral component, the time at which the observed sound arrives at the first microphone and the arrival at the second microphone. The filtering coefficient for masking the spectral component of the sound arriving from the direction different from the first direction is calculated according to the time difference from the time to be used, and the filtering coefficient is used for the first spectral component. An output digital signal is generated by separating the spectral components by performing masking and converting the separated spectral components into a signal in the time domain.
 本発明の1又は複数の態様によれば、高品質な目的信号を容易に得ることができる。 According to one or more aspects of the present invention, it is possible to easily obtain a high quality target signal.
実施の形態1及び3に係る音源分離装置の構成を概略的に示すブロック図である。It is a block diagram which shows roughly the structure of the sound source separation device which concerns on Embodiments 1 and 3. 実施の形態1~3におけるマスク生成部の内部構成を概略的に示すブロック図である。FIG. 4 is a block diagram schematically showing an internal configuration of a mask generation unit in the first to third embodiments. 第1のマイクロホン及び第2のマイクロホンの配置と、目的音の到来方向を説明するための概略図である。It is a schematic diagram for demonstrating arrangement|positioning of a 1st microphone and a 2nd microphone, and the arrival direction of a target sound. (A)~(C)は、目的音話者と妨害音話者が発話した場合の発話量比を説明するためのグラフである。(A) to (C) are graphs for explaining the utterance amount ratio when the target speaker and the disturbing speaker speak. (A)及び(B)は、実施の形態1における効果を説明するためのグラフである。(A) And (B) is a graph for explaining the effect in the first embodiment. 音源分離装置の第1のハードウェア構成例を示すブロック図である。It is a block diagram which shows the 1st hardware structural example of a sound source separation apparatus. 音源分離装置の第2のハードウェア構成例を示すブロック図である。It is a block diagram which shows the 2nd hardware structural example of a sound source separation device. 音源分離装置の動作を示すフローチャートである。It is a flow chart which shows operation of a sound source separation device. 実施の形態2に係る音源分離装置を含む情報処理システムの構成を概略的に示すブロック図である。5 is a block diagram schematically showing a configuration of an information processing system including a sound source separation device according to a second embodiment. FIG. 目的音及び妨害音以外の雑音の影響を除外する方法の一例を示す模式図である。It is a schematic diagram which shows an example of the method of excluding the influence of noise other than a target sound and a disturbance sound.
実施の形態1.
 図1は、実施の形態1に係る情報処理装置としての音源分離装置100の構成を概略的に示すブロック図である。
 音源分離装置100は、アナログ/デジタル変換部(以下、A/D変換部という)103と、時間/周波数変換部(以下、T/F変換部という)104と、マスク生成部105と、マスキングフィルタ部110と、時間/周波数逆変換部(以下、T/F逆変換部という)111と、デジタル/アナログ変換部(以下、D/A変換部という)112とを備える。
 音源分離装置100は、第1のマイクロホン101及び第2のマイクロホン102に接続されている。
Embodiment 1.
FIG. 1 is a block diagram schematically showing the configuration of a sound source separation device 100 as an information processing device according to the first embodiment.
The sound source separation device 100 includes an analog/digital conversion unit (hereinafter referred to as A/D conversion unit) 103, a time/frequency conversion unit (hereinafter referred to as T/F conversion unit) 104, a mask generation unit 105, and a masking filter. A unit 110, a time/frequency inverse conversion unit (hereinafter referred to as T/F inverse conversion unit) 111, and a digital/analog conversion unit (hereinafter referred to as D/A conversion unit) 112 are provided.
The sound source separation device 100 is connected to a first microphone 101 and a second microphone 102.
 図2は、マスク生成部105の内部構成を概略的に示すブロック図である。
 マスク生成部105は、マスク係数算出部106と、発話量比算出部107と、ゲイン算出部108と、マスク修正部109とを備える。
 以下、図1及び図2に基づいて、実施の形態1の音源分離装置100の構成及びその動作原理を説明する。音源分離装置100は、第1のマイクロホン101及び第2のマイクロホン102で取得された時間領域の信号から生成された、周波数領域における信号に基づいて、マスキングフィルタを形成し、それを第1のマイクロホン101で取得された信号に対応する周波数領域の信号に掛けることで、妨害音が除去された目的音の出力信号を得る構成となっている。
FIG. 2 is a block diagram schematically showing the internal configuration of the mask generation unit 105.
The mask generation unit 105 includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 107, a gain calculation unit 108, and a mask correction unit 109.
Hereinafter, the configuration and operation principle of the sound source separation device 100 according to the first embodiment will be described with reference to FIGS. 1 and 2. The sound source separation device 100 forms a masking filter based on the signal in the frequency domain generated from the signals in the time domain acquired by the first microphone 101 and the second microphone 102, and uses the masking filter as the first microphone. By multiplying the signal in the frequency domain corresponding to the signal acquired in 101, the output signal of the target sound from which the interfering sound is removed is obtained.
 ここで、第1のマイクロホン101で取得された第1の観測アナログ信号を第1のチャンネルCh1ともいい、第2のマイクロホン102で取得された第2の観測アナログ信号を第2のチャンネルCh2ともいう。
 また、以降の説明を簡単にするため、図3に示されているように、第1のマイクロホン101と、第2のマイクロホン102とは、同一水平面に位置し、かつ、それらの位置は既知であり、かつ、時間で変化しないものとする。さらに、目的音及び妨害音が到来し得る方向範囲についても時間で変化しないものとする。なお、目的音が到来する方向を第1の方向ともいい、妨害音が到来する方向を第2の方向ともいう。
 ここでは、目的音及び妨害音は、それぞれ別の単一話者による音声であるものとして説明する。
Here, the first observed analog signal acquired by the first microphone 101 is also referred to as a first channel Ch1, and the second observed analog signal acquired by the second microphone 102 is also referred to as a second channel Ch2. ..
Further, in order to simplify the following description, as shown in FIG. 3, the first microphone 101 and the second microphone 102 are located on the same horizontal plane, and their positions are known. Yes, and does not change over time. Further, it is assumed that the direction range in which the target sound and the disturbing sound can arrive does not change with time. The direction in which the target sound arrives is also called the first direction, and the direction in which the disturbing sound arrives is also called the second direction.
Here, it is assumed that the target sound and the disturbing sound are voices from different single speakers.
 第1のマイクロホン101は、観測音を電気信号に変換することで、第1の観測アナログ信号を生成する。第1の観測アナログ信号は、A/D変換部103に与えられる。
 第2のマイクロホン102は、観測音を電気信号に変換することで、第2の観測アナログ信号を生成する。第2の観測アナログ信号は、A/D変換部103に与えられる。
The first microphone 101 generates a first observation analog signal by converting the observation sound into an electric signal. The first observed analog signal is given to the A/D conversion unit 103.
The second microphone 102 generates a second observed analog signal by converting the observed sound into an electric signal. The second observed analog signal is provided to the A/D conversion unit 103.
 A/D変換部103は、第1のマイクロホン101から与えられた第1の観測アナログ信号及び第2のマイクロホン102から与えられた第2の観測アナログ信号のそれぞれに対して、アナログ/デジタル変換(以下、A/D変換という)を行うことで、それぞれをデジタル信号に変換し、第1の観測デジタル信号及び第2の観測デジタル信号を生成する。 The A/D conversion unit 103 performs analog/digital conversion () on each of the first observed analog signal given from the first microphone 101 and the second observed analog signal given from the second microphone 102. Hereinafter, each is converted into a digital signal by performing A/D conversion) to generate a first observed digital signal and a second observed digital signal.
 例えば、A/D変換部103は、第1のマイクロホン101から与えられた第1の観測アナログ信号に対して、予め定められたサンプリング周波数でサンプリングして、フレーム単位で分割されたデジタル信号に変換することで、第1の観測デジタル信号を生成する。同様に、A/D変換部103は、第2のマイクロホン102から与えられた第2の観測アナログ信号に対して、予め定められたサンプリング周波数でサンプリングして、フレーム単位で分割されたデジタル信号に変換することで、第2の観測デジタル信号を生成する。ここで、サンプリング周波数は、例えば、16kHzであり、フレーム単位は、例えば、16msである。 For example, the A/D conversion unit 103 samples the first observation analog signal given from the first microphone 101 at a predetermined sampling frequency and converts it into a digital signal divided into frame units. By doing so, the first observed digital signal is generated. Similarly, the A/D conversion unit 103 samples the second observation analog signal supplied from the second microphone 102 at a predetermined sampling frequency to obtain a digital signal divided into frame units. The second observed digital signal is generated by the conversion. Here, the sampling frequency is, for example, 16 kHz, and the frame unit is, for example, 16 ms.
 なお、サンプル番号tに対応するフレーム間隔における第1の観測アナログ信号から生成された第1の観測デジタル信号を、符号x(t)で表し、サンプル番号tに対応するフレーム間隔における第2の観測アナログ信号から生成された第2の観測デジタル信号を、符号x(t)で表す。
 第1の観測デジタル信号x(t)及び第2の観測デジタル信号x(t)は、T/F変換部104に与えられる。
The first observed digital signal generated from the first observed analog signal in the frame interval corresponding to the sample number t is represented by a code x 1 (t), and the second observed digital signal in the frame interval corresponding to the sample number t is represented. The second observed digital signal generated from the observed analog signal is represented by the code x 2 (t).
The first observed digital signal x 1 (t) and the second observed digital signal x 2 (t) are provided to the T/F conversion unit 104.
 T/F変換部104は、第1の観測デジタル信号x(t)及び第2の観測デジタル信号x(t)を受けて、時間領域の第1の観測デジタル信号x(t)及び第2の観測デジタル信号x(t)を、周波数領域の第1の短時間スペクトル成分X(ω,τ)及び第2の短時間スペクトル成分X(ω,τ)に変換する。但し、ωは、離散周波数であるスペクトル番号、τは、フレーム番号を表す。 The T/F conversion unit 104 receives the first observed digital signal x 1 (t) and the second observed digital signal x 2 (t), and receives the first observed digital signal x 1 (t) in the time domain and The second observed digital signal x 2 (t) is converted into a first short-time spectrum component X 1 (ω, τ) and a second short-time spectrum component X 2 (ω, τ) in the frequency domain. However, ω represents a spectrum number that is a discrete frequency, and τ represents a frame number.
 具体的には、T/F変換部104は、第1の観測デジタル信号x(t)に対して、例えば、512点の高速フーリエ変換を行うことで、第1の短時間スペクトル成分X(ω,τ)を生成する。同様に、T/F変換部104は、第2の観測デジタル信号x(t)から、第2の短時間スペクトル成分X(ω,τ)を生成する。
 なお、以下では、特に断わりのない限り、現フレームの短時間スペクトル成分は、単にスペクトル成分としてその記載を省略する。
Specifically, the T/F conversion unit 104 performs, for example, Fast Fourier Transform of 512 points on the first observed digital signal x 1 (t), and thus the first short-time spectrum component X 1 Generate (ω, τ). Similarly, the T/F conversion unit 104 generates a second short-time spectrum component X 2 (ω, τ) from the second observed digital signal x 2 (t).
In the following, unless otherwise specified, the short-time spectrum component of the current frame is simply described as a spectrum component and its description is omitted.
 マスク生成部105は、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)を受けて、目的音を分離するためのマスキングを行うフィルタリング係数である時間周波数フィルタ係数bmod(ω,τ)を算出する。例えば、マスク生成部105は、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)の相互相関関数を用いて、観測音が、第1のマイクロホン101に到来する時間と、第2のマイクロホン102に到来する時間との時間差により、目的音が到来する第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出する。 The mask generation unit 105 receives the first spectral component X 1 (ω, τ) and the second spectral component X 2 (ω, τ) and receives a time that is a filtering coefficient for performing masking for separating the target sound. The frequency filter coefficient b mod (ω, τ) is calculated. For example, the mask generation unit 105 uses the cross-correlation function of the first spectral component X 1 (ω, τ) and the second spectral component X 2 (ω, τ) to determine that the observation sound is the first microphone 101. And a time difference between the second microphone 102 and the second microphone 102, a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction in which the target sound arrives is calculated. ..
 時間周波数フィルタ係数bmod(ω,τ)を求めるにあたり、図3に示されているように、第1のマイクロホン101及び第2のマイクロホン102が設けられている水平面において、第1のマイクロホン101の垂直方向V及び第2のマイクロホン102の垂直方向Vに対して、予め定められた角度θに含まれる方向から、目的音が到来するものとする。なお、妨害音は、第1のマイクロホン101の垂直方向V及び第2のマイクロホン102の垂直方向Vに対して、目的音とは反対の側から到来するものとする。 In obtaining the time-frequency filter coefficient b mod (ω, τ), as shown in FIG. 3, in the horizontal plane where the first microphone 101 and the second microphone 102 are provided, the first microphone 101 with respect to the vertical direction V 2 of the vertical V 1 and second microphones 102, from a direction included in a predetermined angle theta, it is assumed that the target sound comes. Incidentally, interference sound is the vertical direction V 2 of the vertical V 1 and second microphones 102 of the first microphone 101, it is assumed that the target sound coming from the opposite side.
 ここで、第1のマイクロホン101の垂直方向V及び第2のマイクロホン102の垂直方向Vは、第1のマイクロホン101及び第2のマイクロホン102を結ぶ直線に対して、垂直になっているものとする。なお、第1のマイクロホン101の垂直方向V及び第2のマイクロホン102の垂直方向Vは、予め定められている基準方向であって、必ずしも垂直方向である必要はない。
 また、第1のマイクロホン101と第2のマイクロホン102との間隔は、間隔dとなっているものとする。
Here, the vertical direction V 2 of the vertical V 1 and second microphones 102 of the first microphone 101, to the straight line connecting the first microphone 101 and second microphone 102, which are perpendicular And The vertical direction V 2 of the vertical V 1 and second microphones 102 of the first microphone 101, a reference direction is predetermined, not necessarily vertical.
The distance between the first microphone 101 and the second microphone 102 is the distance d.
 第1のマイクロホン101及び第2のマイクロホン102で集音された音声が、目的音か妨害音かを判別するには、第1のマイクロホン101及び第2のマイクロホン102からの信号を用いて音声到来方向が所望の範囲であるかどうかを推定する必要がある。ここで、第1のマイクロホン101及び第2のマイクロホン102からの信号間に生じる時間差は、角度θによって決まるため、この時間差を利用することで到来方向の推定が可能となる。以下、図2及び図3を用いて説明する。 In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is the target sound or the disturbing sound, the arrival of sound using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is within the desired range. Here, since the time difference that occurs between the signals from the first microphone 101 and the second microphone 102 is determined by the angle θ, the arrival direction can be estimated by using this time difference. This will be described below with reference to FIGS. 2 and 3.
 マスク係数算出部106は、まず、下記の式(1)に示すように、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)の相互相関関数からクロススペクトルD(ω,τ)を算出する。そして、マスク係数算出部106は、算出されたクロススペクトルD(ω,τ)を、発話量比算出部107に与える。
Figure JPOXMLDOC01-appb-M000001
The mask coefficient calculation unit 106 first calculates the cross-correlation function of the first spectral component X 1 (ω, τ) and the second spectral component X 2 (ω, τ) as shown in the following equation (1). The cross spectrum D(ω, τ) is calculated. Then, the mask coefficient calculation unit 106 gives the calculated cross spectrum D(ω, τ) to the utterance amount ratio calculation unit 107.
Figure JPOXMLDOC01-appb-M000001
 次に、マスク係数算出部106は、クロススペクトルD(ω,τ)のフェイズΘ(ω,τ)を、下記の式(2)を用いて求める。
Figure JPOXMLDOC01-appb-M000002
 ここで、Q(ω,τ)及びK(ω,τ)のそれぞれは、クロススペクトルD(ω,τ)の虚数部及び実数部のそれぞれを表す。
Next, the mask coefficient calculation unit 106 obtains the phase Θ D (ω, τ) of the cross spectrum D(ω, τ) using the following equation (2).
Figure JPOXMLDOC01-appb-M000002
Here, Q(ω, τ) and K(ω, τ) represent the imaginary part and the real part of the cross spectrum D(ω, τ), respectively.
 上記の式(2)で得られたフェイズΘ(ω,τ)は、第1のチャンネルCh1及び第2のチャンネルCh2のそれぞれのスペクトル成分毎の位相角を意味し、これを離散周波数ωで除算したものは、2つの信号間の時間遅れを表す。すなわち、第1のチャンネルCh1及び第2のチャンネルCh2の時間差δ(ω,τ)は、下記の式(3)のように表すことができる。
Figure JPOXMLDOC01-appb-M000003
The phase Θ D (ω, τ) obtained by the above equation (2) means the phase angle for each spectral component of the first channel Ch1 and the second channel Ch2, which is defined by the discrete frequency ω. The division represents the time delay between the two signals. That is, the time difference δ(ω, τ) between the first channel Ch1 and the second channel Ch2 can be expressed by the following equation (3).
Figure JPOXMLDOC01-appb-M000003
 次に、音声が角度θの方向から到来するときに観測される時間差の理論値δθは、間隔dを使って、下記の式(4)のように表すことができる。但し、cは音速である。
Figure JPOXMLDOC01-appb-M000004
Then, sound theoretical value [delta] theta time difference observed when coming from the direction of an angle theta, using the distance d, may be expressed as the following equation (4). However, c is the speed of sound.
Figure JPOXMLDOC01-appb-M000004
 ここで、θ>θthを満たすθの集合を、所望の方向範囲とするならば、理論的な時間差δθ_thと、観測アナログ信号の時間差δ(ω,τ)との大小を比較することで、音声が所望の方向範囲から到来しているかどうかを推定することができる。
 そのため、目的音を分離するためのマスキングを行うマスク係数b(ω,τ)は、下記の式(5)のように表すことができる。
Figure JPOXMLDOC01-appb-M000005
Here, if the set of θ that satisfies θ>θ th is set as a desired directional range, the theoretical time difference δ θ_th and the time difference δ(ω, τ) of the observed analog signal are compared. , It is possible to estimate whether the voice comes from the desired directional range.
Therefore, the mask coefficient b(ω, τ) for performing masking for separating the target sound can be expressed by the following equation (5).
Figure JPOXMLDOC01-appb-M000005
 言い換えると、マスク係数算出部106は、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)の相互相関関数を用いて、目的音が、第1のマイクロホン101に到来する時間と、第2のマイクロホン102に到来する時間との第1の時間差、及び、妨害音が第1のマイクロホン101に到来する時間と、第2のマイクロホン102に到来する時間との第2の時間差から、観測音の内、目的音が到来する第1の方向を含む第1の範囲から到来する音と、妨害音が到来する第2の方向を含む第2の範囲から到来する音とを区別して、第1の範囲に含まれる方向から到来する音のスペクトル成分を、第2の範囲に含まれる方向から到来する音のスペクトル成分から分離するためのマスク係数を算出する。 In other words, the mask coefficient calculation unit 106 uses the cross-correlation function of the first spectral component X 1 (ω, τ) and the second spectral component X 2 (ω, τ) to determine that the target sound is the first A first time difference between the time of arrival at the microphone 101 and the time of arrival at the second microphone 102, the time of arrival of an interfering sound at the first microphone 101, and the time of arrival at the second microphone 102. From the second time difference of, the sound that comes from the first range including the first direction in which the target sound arrives and the second range that includes the second direction in which the interfering sound arrives The mask coefficient for separating the spectrum component of the sound coming from the direction included in the first range from the spectrum component of the sound coming from the direction included in the second range is calculated.
 式(5)で示されるマスク係数b(ω,τ)は、目的音と推定される場合には1、妨害音と推定される場合にはMとなる。ここで、M=0とする場合には、1又は0の二値(バイナリ)とするマスク係数となるため、そのようなマスク係数を有するフィルタは、バイナリマスクと呼ばれる。なお、フィルタ係数として、二値以外の小数が用いられてもよく、この場合のフィルタは、ソフトマスクとも呼ばれる。但し、フィルタ係数は、目的音及び妨害音のいずれも1未満の値となる。本実施の形態では、例えば、M=0.5を用いるものとする。
 マスク係数算出部106は、マスク係数b(ω,τ)を、マスク修正部109に与える。
The mask coefficient b(ω, τ) shown in the equation (5) is 1 when the target sound is estimated and is M when the disturbing sound is estimated. Here, when M=0, the mask coefficient has a binary value of 1 or 0 (binary), so a filter having such a mask coefficient is called a binary mask. A decimal other than binary may be used as the filter coefficient, and the filter in this case is also called a soft mask. However, the filter coefficient has a value of less than 1 for both the target sound and the disturbing sound. In this embodiment, for example, M=0.5 is used.
The mask coefficient calculation unit 106 gives the mask coefficient b(ω, τ) to the mask correction unit 109.
 発話量比算出部107は、第1のチャンネルCh1の第1のスペクトル成分X(ω,τ)と、第2のチャンネルCh2の第2のスペクトル成分X(ω,τ)と、クロススペクトルD(ω,τ)とを受け、目的音話者の発話量と妨害音話者の発話量との比率である発話量比を算出する。言い換えると、発話量比は、第1のスペクトル成分X(ω,τ)の内、目的音が到来する第1の方向を含む第1の範囲から到来する音のスペクトル成分の量の、妨害音が到来する第2の方向を含む第2の範囲から到来する音のスペクトル成分の量に対する比率である。 The utterance amount ratio calculation unit 107 calculates the first spectrum component X 1 (ω, τ) of the first channel Ch1, the second spectrum component X 2 (ω, τ) of the second channel Ch2, and the cross spectrum. By receiving D(ω, τ), the utterance amount ratio, which is the ratio between the utterance amount of the target sound speaker and the utterance amount of the disturbing sound speaker, is calculated. In other words, the utterance amount ratio is the interference of the amount of the spectral component of the sound coming from the first range including the first direction in which the target sound comes in the first spectral component X 1 (ω, τ). It is the ratio to the amount of the spectral component of the sound coming from the second range including the second direction in which the sound comes.
 まず、発話量比算出部107は、第1のチャンネルCh1の第1のスペクトル成分X(ω,τ)から、第1のチャンネルCh1の第1のパワースペクトルP(ω,τ)を、下記の式(6)から求める。
Figure JPOXMLDOC01-appb-M000006
 ただし、XReは、第1のスペクトル成分X(ω,τ)の実数部であり、XImは、第1のスペクトル成分X(ω,τ)の虚数部である。
First, the utterance amount ratio calculation unit 107 obtains the first power spectrum P 1 (ω, τ) of the first channel Ch1 from the first spectrum component X 1 (ω, τ) of the first channel Ch1. It is calculated from the following equation (6).
Figure JPOXMLDOC01-appb-M000006
However, X Re is the real part of the first spectral component X 1 (ω, τ), and X Im is the imaginary part of the first spectral component X 1 (ω, τ).
 続いて、発話量比算出部107は、上記の式(1)に示されているクロススペクトルD(ω,τ)の虚数部Q(ω,τ)の符号により、対象となる音声の観測アナログ信号が、目的音側から到来しているのか、妨害音側から到来しているのかを判定する。そして、発話量比算出部107は、下記の式(7)に示されているように、符号の判定結果に従って第1のチャンネルCh1の第1のパワースペクトルP1(ω,τ)を加算し、目的音話者の発話量sTgt(τ)、及び、妨害音話者の発話量sInt(τ)をそれぞれ求める。
Figure JPOXMLDOC01-appb-M000007
 ここで、Nは、離散周波数スペクトルの総数であり、例えば、N=256である。
Then, the utterance amount ratio calculation unit 107 uses the sign of the imaginary part Q(ω, τ) of the cross spectrum D(ω, τ) shown in the above equation (1) to observe the analog of the target voice. It is determined whether the signal comes from the target sound side or the interfering sound side. Then, the utterance amount ratio calculation unit 107 adds the first power spectrum P1(ω, τ) of the first channel Ch1 according to the code determination result, as shown in the following expression (7), The utterance amount s Tgt (τ) of the target speaker and the utterance amount s Int (τ) of the disturbing speaker are respectively obtained.
Figure JPOXMLDOC01-appb-M000007
Here, N is the total number of discrete frequency spectra, for example, N=256.
 そして、発話量比算出部107は、得られた2つの発話量sTgt(τ)及びsInt(τ)から、下記の式(8)により、発話量比SR(τ)を得る。
Figure JPOXMLDOC01-appb-M000008
Then, the utterance amount ratio calculation unit 107 obtains the utterance amount ratio SR(τ) from the obtained two utterance amounts s Tgt (τ) and s Int (τ) by the following equation (8).
Figure JPOXMLDOC01-appb-M000008
 図4(A)~(C)は、目的音話者と妨害音話者が発話した場合の発話量比SR(τ)を説明するためのグラフである。
 図4(A)は、第1のマイクロホン101で取得された観測アナログ信号の時間波形の一例を示すグラフである。
 図4(B)は、目的音話者と妨害音話者との発話量の時間変動の一例を示すグラフである。
 図4(C)は、目的音話者の発話量と、妨害音話者の発話量とから得られた発話量比SR(τ)の時間変動の一例を示すグラフである。
4A to 4C are graphs for explaining the utterance amount ratio SR(τ) when the target speaker and the disturbing speaker speak.
FIG. 4A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101.
FIG. 4B is a graph showing an example of the time variation of the utterance amount between the target sound speaker and the disturbing sound speaker.
FIG. 4C is a graph showing an example of temporal variation of the utterance amount ratio SR(τ) obtained from the utterance amount of the target sound speaker and the utterance amount of the disturbing sound speaker.
 図4(C)に示されているように、SR(τ)<0.3を満たすフレームの場合は、妨害音のみの可能性が高い一方、SR(τ)>0.5を満たすフレームの場合は、目的音のみの可能性が高いことが分かる。
 また、0.3≦SR(τ)≦0.5の場合は、目的音も妨害音も両方存在する場合とみなすことができる。
As shown in FIG. 4C, in the case of a frame satisfying SR(τ)<0.3, there is a high possibility that only interfering sound is present, while in the frame satisfying SR(τ)>0.5, In the case, it can be seen that there is a high possibility that only the target sound is present.
When 0.3≦SR(τ)≦0.5, it can be considered that both the target sound and the disturbing sound are present.
 よって、上記の式(8)で得られた発話量比SR(τ)を用い、観測アナログ信号の様態に応じたマスキングの強度の制御を行うことで、分離精度が高く歪みも少ない目的音の分離が可能である。より具体的には、例えば、発話量比SR(τ)が小さいフレームでは、マスキングのフィルタ係数の数値を大きくすることで強く妨害音を抑圧して分離性能を高め、発話量比SR(τ)が大きいフレームでは、マスキングのフィルタ係数の数値を小さくすることで目的音の歪みを小さくする制御が可能である。 Therefore, by controlling the masking strength according to the aspect of the observed analog signal by using the speech amount ratio SR(τ) obtained by the above equation (8), the target sound with high separation accuracy and less distortion can be obtained. Separation is possible. More specifically, for example, in a frame in which the utterance amount ratio SR(τ) is small, the masking filter coefficient is increased to strongly suppress the interfering sound and enhance the separation performance, thereby increasing the utterance amount ratio SR(τ). In the case of a large frame, it is possible to reduce the distortion of the target sound by reducing the numerical value of the masking filter coefficient.
 図2に戻り、ゲイン算出部108は、上記の式(8)で得られた発話量比SR(τ)を用いて、上記の式(5)のマスク係数b(ω,τ)中の定数Mを修正する修正ゲインg(ω,τ)を、下記の式(9)により計算する。
Figure JPOXMLDOC01-appb-M000009
 ここで、GTgt、GInt及びGDTは、予め定められた修正ゲイン定数であり、GTgtは、観測アナログ信号が目的音だけの可能性が高い場合の定数、GIntは、観測アナログ信号が妨害音だけの可能性が高い場合の定数、GDTは、観測アナログ信号に目的音及び妨害音の両者が存在する可能性が高い場合の定数である。本実施の形態においては、GTgt=1.5、GDT=0.99、GInt=0.01を好適な一例とする。
Returning to FIG. 2, the gain calculation unit 108 uses the utterance amount ratio SR(τ) obtained by the above equation (8) to determine the constant in the mask coefficient b(ω,τ) of the above equation (5). A correction gain g(ω, τ) for correcting M is calculated by the following formula (9).
Figure JPOXMLDOC01-appb-M000009
Here, G Tgt , G Int, and G DT are predetermined modified gain constants, G Tgt is a constant when there is a high possibility that the observed analog signal is only the target sound, and G Int is the observed analog signal. Is a constant when there is a high possibility that only an interfering sound is present, and G DT is a constant when there is a high possibility that both the target sound and the interfering sound are present in the observed analog signal. In the present embodiment, G Tgt =1.5, G DT =0.99, and G Int =0.01 are preferable examples.
 そして、目的音の可能性が高い場合は、上記の式(5)中のMが大きくなるように、言い換えるならば、マスクの抑圧量が小さくなるように制御される。但し、修正後のMは、1以下の値に制限される。
 一方、妨害音の可能性が高い場合には、上述の式(5)中のMが更に小さくなるように、言い換えると、妨害音の抑圧量が更に大きくなるように制御されることとなる。
 即ち、ゲイン算出部108は、発話量比が高いほど、マスキングが行われる強度が低くなるように、マスク係数を修正するための修正ゲインを算出する。
Then, when the possibility of the target sound is high, M in the above equation (5) is controlled to be large, in other words, the suppression amount of the mask is controlled to be small. However, the corrected M is limited to a value of 1 or less.
On the other hand, when the possibility of the disturbing sound is high, M in the above equation (5) is controlled to be further reduced, in other words, the amount of suppression of the disturbing sound is controlled to be further increased.
That is, the gain calculation unit 108 calculates a correction gain for correcting the mask coefficient so that the higher the utterance amount ratio is, the lower the masking strength is.
 この修正ゲインの算出にあたっては、単純な観測アナログ信号のパワー計算から求められる発話量比と、発話量比の比較による条件式のみで済むため計算コストが低くて済み、効率的にマスク係数を修正することが可能である。 To calculate this modified gain, the calculation cost is low because only the utterance volume ratio obtained from a simple calculation of the power of the observed analog signal and the conditional expression based on the comparison of the utterance volume ratio are sufficient, and the mask coefficient is efficiently corrected. It is possible to
 また、K(ω)は1以下の正の数で表現される周波数補正係数であり、下記の式(10)で示されるように、周波数が高くなるに従って値が大きくなるように設定される。
Figure JPOXMLDOC01-appb-M000010
 K(ω)による周波数補正を行うことで、高周波数でのマスキングの強度が緩和されるので、マスキングによる目的音の歪みを抑制することができる。
Further, K(ω) is a frequency correction coefficient represented by a positive number of 1 or less, and is set so that the value increases as the frequency increases, as shown in the following expression (10).
Figure JPOXMLDOC01-appb-M000010
By performing the frequency correction with K(ω), the intensity of masking at high frequencies is relaxed, so that distortion of the target sound due to masking can be suppressed.
 なお、式(10)の周波数補正係数は、周波数が高くなるに従って値が大きくなるように補正しているが、式(10)の周波数補正係数は、このような例に限定されるものではなく、観測アナログ信号の特性に応じて適宜変更することが可能である。例えば、音源分離の対象とする音響信号が音声の場合、音声において重要な周波数帯域成分であるフォルマントの抑圧を弱くするように補正が行われるとともに、それ以外の帯域成分の抑圧を強くするように補正が行われてもよい。これにより、音声に対するマスク制御の精度が向上するので、目的音を効率良く分離することが可能となる。
 また、音源分離の対象が機械の異常音であれば、その音響信号の周波数特性に応じて式(10)の周波数補正係数を変更することで、異常音を効率良く分離することが可能となる。
Note that the frequency correction coefficient of the equation (10) is corrected so that the value increases as the frequency increases, but the frequency correction coefficient of the equation (10) is not limited to such an example. , Can be changed as appropriate according to the characteristics of the observed analog signal. For example, when the acoustic signal targeted for sound source separation is speech, correction is performed to weaken the suppression of formant, which is an important frequency band component in speech, and to strengthen the suppression of other band components. Correction may be performed. As a result, the accuracy of mask control for voice is improved, so that the target sound can be efficiently separated.
Further, if the target of sound source separation is an abnormal sound of a machine, it is possible to efficiently separate the abnormal sound by changing the frequency correction coefficient of Expression (10) according to the frequency characteristic of the acoustic signal. ..
 このように周波数により補正することによる更なる効果としては、観測騒音に環境騒音が混入している場合では、目的とする音声又は異常音以外の音響信号(例えば、騒音又は音楽等)へのマスキングによる影響が少なくなるため、環境騒音に対する不必要なマスキングにより生じる不快な人工的雑音(ミュージカルトーン)が少なくなり、人工的雑音による音声認識装置又は異常音監視装置の誤動作が減少し、ハンズフリー通話時の不快な雑音が減少する副次的効果も奏する。 As a further effect of the correction by the frequency as described above, when environmental noise is mixed with the observed noise, masking to an acoustic signal other than the target sound or abnormal sound (for example, noise or music) As the influence of the noise is reduced, unpleasant artificial noise (musical tone) caused by unnecessary masking against environmental noise is reduced, and malfunction of the voice recognition device or abnormal sound monitoring device due to artificial noise is reduced, resulting in a hands-free call. It also has the side effect of reducing the unpleasant noise of time.
 なお、上記した修正ゲインの各定数値又は発話量比SR(τ)の定数閾値については、式(9)の場合に限定されることはなく、目的音又は妨害音の様態に合わせて適宜調整することができる。また、修正ゲインを決定する条件も式(9)のように3段階に限らず、更に多い段階で設定されてもよい。 The constant value of the correction gain or the constant threshold value of the utterance amount ratio SR(τ) described above is not limited to the case of Expression (9), and is appropriately adjusted according to the mode of the target sound or the disturbing sound. can do. Further, the condition for determining the correction gain is not limited to three stages as in the equation (9), and may be set in more stages.
 マスク修正部109は、下記の式(11)に示すように、上記の式(5)で得られたマスク係数b(ω,τ)に対して、式(9)で得られた修正ゲインg(ω,τ)を用いて修正し、時間周波数フィルタ係数bmod(ω,τ)を得る。
Figure JPOXMLDOC01-appb-M000011
As shown in the following equation (11), the mask correction unit 109 applies the correction gain g obtained by the equation (9) to the mask coefficient b(ω, τ) obtained by the above equation (5). The correction is performed using (ω, τ) to obtain the time-frequency filter coefficient b mod (ω, τ).
Figure JPOXMLDOC01-appb-M000011
 図1に戻り、マスキングフィルタ部110は、下記の式(12)で示されているように、第1のマイクロホン101側の第1のスペクトル成分X(ω,τ)に、上記の式(11)で得られた時間周波数フィルタ係数bmod(ω,τ)を乗算し、スペクトル成分Y(ω,τ)を算出する。そして、マスキングフィルタ部110は、算出されたスペクトル成分Y(ω,τ)をT/F逆変換部111に送る。ここで分離されたスペクトル成分Y(ω,τ)を目的スペクトル成分ともいう。目的スペクトル成分は、目的音を含むスペクトル成分である。
Figure JPOXMLDOC01-appb-M000012
Returning to FIG. 1, the masking filter unit 110 uses the above formula ( 1 ) as the first spectral component X 1 (ω, τ) on the first microphone 101 side, as shown in the following formula (12). The time frequency filter coefficient b mod (ω, τ) obtained in 11) is multiplied to calculate the spectral component Y(ω, τ). Then, the masking filter unit 110 sends the calculated spectral component Y(ω, τ) to the T/F inverse transform unit 111. The spectral component Y(ω, τ) separated here is also referred to as a target spectral component. The target spectrum component is a spectrum component including a target sound.
Figure JPOXMLDOC01-appb-M000012
 T/F逆変換部111は、スペクトル成分Y(ω,τ)に対し、例えば、逆高速フーリエ変換を行い、出力デジタル信号y(t)を算出する。T/F逆変換部111は、算出された出力デジタル信号y(t)をD/A変換部112に与える。 The T/F inverse transform unit 111 performs, for example, inverse fast Fourier transform on the spectral component Y(ω, τ) to calculate the output digital signal y(t). The T/F inverse conversion unit 111 gives the calculated output digital signal y(t) to the D/A conversion unit 112.
 D/A変換部112は、出力デジタル信号y(t)をアナログ信号に変換することで、出力信号を生成する。生成された出力信号は、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置等の外部機器へ出力される。 The D/A conversion unit 112 generates an output signal by converting the output digital signal y(t) into an analog signal. The generated output signal is output to an external device such as a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device.
 図5(A)及び(B)は、実施の形態1における効果を説明するためのグラフである。
 図5(A)は、図4(A)と同様に、第1のマイクロホン101で取得された観測アナログ信号の時間波形の一例を示すグラフである。
 図5(B)は、D/A変換部112から出力される出力信号の時間変動の一例を示すグラフである。
 図5(A)及び(B)から明らかなように、出力信号からは妨害音が殆ど除去されて目的音のみが分離されていることが分かる。
FIGS. 5A and 5B are graphs for explaining the effect in the first embodiment.
Similar to FIG. 4A, FIG. 5A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101.
FIG. 5B is a graph showing an example of time variation of the output signal output from the D/A conversion unit 112.
As is clear from FIGS. 5A and 5B, it can be seen that the interfering sound is almost removed from the output signal and only the target sound is separated.
 上記の音源分離装置100のハードウェア構成は、タブレットタイプの可搬型コンピュータ、又は、カーナビゲーションシステム等の機器組み込み用途のマイクロコンピュータ等の、CPU(Central Processing Unit)内蔵のコンピュータで実現可能である。あるいは、上記の音源分離装置100のハードウェア構成は、DSP(Digital Signal Processor)、ASIC(Application Specific Integrated Circuit)、又は、FPGA(Field-Programmable Gate Array)等のLSI(Large Scale Integrated circuit)により実現されてもよい。 The hardware configuration of the sound source separation device 100 described above can be realized by a computer with a built-in CPU (Central Processing Unit), such as a tablet-type portable computer or a microcomputer for use in a device such as a car navigation system. Alternatively, the hardware configuration of the sound source separation apparatus 100 may be a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable Integrated) integrated circuit (LGA) such as an FPGA (Field-Integrated Gate) integrated (LGA) integrated circuit (LGA). May be done.
 図6は、DSP、ASIC又はFPGA等のLSIを用いて構成される音源分離装置100のハードウェア構成例を示すブロック図である。 FIG. 6 is a block diagram showing a hardware configuration example of the sound source separation device 100 configured by using an LSI such as DSP, ASIC or FPGA.
 図6の例では、音源分離装置100は、信号入出力部131、信号処理回路132、記録媒体133及びバス等の信号路134により構成されている。
 信号入出力部131は、マイクロホン回路140及び外部装置141との接続機能を実現するインタフェース回路である。マイクロホン回路140は、第1のマイクロホン101及び第2のマイクロホン102に対応し、例えば、音響振動を捉えて電気信号へ変換する装置等を使用することができる。
In the example of FIG. 6, the sound source separation device 100 includes a signal input/output unit 131, a signal processing circuit 132, a recording medium 133, and a signal path 134 such as a bus.
The signal input/output unit 131 is an interface circuit that realizes a connection function with the microphone circuit 140 and the external device 141. The microphone circuit 140 corresponds to the first microphone 101 and the second microphone 102, and for example, a device that captures acoustic vibration and converts it into an electric signal can be used.
 図1に示されている、T/F変換部104、マスク生成部105、マスキングフィルタ部110及びT/F逆変換部111の各機能は、信号処理回路132及び記録媒体133で実現することができる。
 また、図1のA/D変換部103及びD/A変換部112は、信号入出力部131により実現することができる。
Each function of the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse conversion unit 111 shown in FIG. 1 can be realized by the signal processing circuit 132 and the recording medium 133. it can.
Further, the A/D conversion unit 103 and the D/A conversion unit 112 in FIG. 1 can be realized by the signal input/output unit 131.
 記録媒体133は、信号処理回路132の各種設定データ及び信号データ等の各種データを蓄積するために使用される。記録媒体133としては、例えば、SDRAM(Synchronous Dynamic Random Access Memory)等の揮発性メモリ、HDD(Hard Disk Drive)又はSSD(Solid State Drive)等の不揮発性メモリを使用することができる。記録媒体133には、音源分離処理の初期状態、各種設定データ、制御用の定数データ等を記憶しておくことができる。 The recording medium 133 is used to store various setting data of the signal processing circuit 132 and various data such as signal data. As the recording medium 133, for example, a volatile memory such as SDRAM (Synchronous Dynamic Random Access Memory), a non-volatile memory such as HDD (Hard Disk Drive) or SSD (Solid State Drive) can be used. The recording medium 133 can store the initial state of the sound source separation process, various setting data, constant data for control, and the like.
 信号処理回路132で音源分離処理が行われた出力デジタル信号は、信号入出力部131から外部装置141に送出されるが、この外部装置141としては、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置が相当する。 The output digital signal subjected to the sound source separation processing in the signal processing circuit 132 is sent from the signal input/output unit 131 to the external device 141. As the external device 141, for example, a voice recognition device, a hands-free call device, or It corresponds to an abnormal sound monitoring device.
 図7は、コンピュータ等の演算装置を用いて構成される音源分離装置100のハードウェア構成例を示すブロック図である。
 図7の例では、音源分離装置100は、信号入出力部131、CPU135を内蔵するプロセッサ136、メモリ137、記録媒体138及びバス等の信号路134により構成されている。
 信号入出力部131は、マイクロホン回路140及び外部装置141との接続機能を実現するインタフェース回路である。
FIG. 7 is a block diagram showing a hardware configuration example of the sound source separation device 100 configured by using a computing device such as a computer.
In the example of FIG. 7, the sound source separation device 100 includes a signal input/output unit 131, a processor 136 including a CPU 135, a memory 137, a recording medium 138, and a signal path 134 such as a bus.
The signal input/output unit 131 is an interface circuit that realizes a connection function with the microphone circuit 140 and the external device 141.
 メモリ137は、音源分離処理を実現するための各種プログラムを記憶するプログラムメモリ、プロセッサ136がデータ処理を行う際に使用するワークメモリ、及び、信号データを展開するメモリ等として使用するROM(Read Only Memory)及びRAM(Random Access Memory)等の記憶手段である。 The memory 137 is a program memory that stores various programs for implementing sound source separation processing, a work memory used when the processor 136 performs data processing, and a ROM (Read Only) that is used as a memory that expands signal data. Memory) and RAM (Random Access Memory).
 T/F変換部104、マスク生成部105、マスキングフィルタ部110及びT/F逆変換部111の各機能は、プロセッサ136、メモリ137及び記録媒体138で実現することができる。
 また、A/D変換部103及びD/A変換部112は、信号入出力部131で実現することができる。
The functions of the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse conversion unit 111 can be realized by the processor 136, the memory 137, and the recording medium 138.
Further, the A/D conversion unit 103 and the D/A conversion unit 112 can be realized by the signal input/output unit 131.
 記録媒体138は、プロセッサ136の各種設定データ及び信号データ等の各種データを蓄積するために使用される。記録媒体138としては、たとえば、SDRAM等の揮発性メモリ、HDD又はSSD等の不揮発性メモリを使用することが可能である。OS(Operating System)を含むプログラム、各種設定データ、及び、音響信号データ等の各種データを蓄積することができる。なお、この記録媒体138に、メモリ137内のデータを蓄積しておくこともできる。 The recording medium 138 is used for accumulating various data such as various setting data and signal data of the processor 136. As the recording medium 138, for example, a volatile memory such as SDRAM or a non-volatile memory such as HDD or SSD can be used. It is possible to store programs including an OS (Operating System), various setting data, and various data such as audio signal data. The data in the memory 137 can be stored in the recording medium 138.
 プロセッサ136は、メモリ137を作業用メモリとして使用し、メモリ137から読み出されたコンピュータプログラムに従って動作することにより、T/F変換部104、マスク生成部105、マスキングフィルタ部110及びT/F逆変換部111として機能することができる。 The processor 136 uses the memory 137 as a working memory, and operates according to the computer program read from the memory 137, whereby the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse unit. It can function as the conversion unit 111.
 プロセッサ136で音源分離処理が行われて生成された出力信号は、信号入出力部131から外部装置141に送出されるが、この外部装置141としては、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置が相当する。 The output signal generated by the sound source separation processing performed by the processor 136 is sent from the signal input/output unit 131 to the external device 141. Examples of the external device 141 include a voice recognition device, a hands-free call device, or It corresponds to an abnormal sound monitoring device.
 プロセッサ136が実行されるプログラムは、ソフトウェアプログラムを実行するコンピュータ内部の記憶装置に記憶していても良いし、CD-ROM等の記憶媒体にて配布される形式でもよい。また、LAN(Local Area Network)等の無線又は有線のネットワークを通じて、他のコンピュータからプログラムを取得することも可能である。このようなプログラムは、例えば、プログラムプロダクトとして提供されてもよい。 The program executed by the processor 136 may be stored in a storage device inside a computer that executes the software program, or may be in a format distributed in a storage medium such as a CD-ROM. It is also possible to acquire the program from another computer through a wireless or wired network such as a LAN (Local Area Network). Such a program may be provided as a program product, for example.
 さらに、マイクロホン回路140及び外部装置141に関しても、アナログ信号とデジタル信号との変換等を介せずに、無線又は有線ネットワークを通じて、各種データをデジタル信号のまま送受信しても構わない。 Further, regarding the microphone circuit 140 and the external device 141, various data may be transmitted and received as digital signals through a wireless or wired network without passing through conversion of analog signals and digital signals.
 また、プロセッサ136で実行されるプログラムは、外部装置141で実行されるプログラム、例えば、コンピュータを、音声認識装置、ハンズフリー通話装置又は異常音監視装置として機能させるために実行されるプログラムとソフトウェア上で結合され、同一のコンピュータで動作させることも可能であり、又は、複数のコンピュータ上で分散して動作させることも可能である。 Further, the program executed by the processor 136 is a program executed by the external device 141, for example, a program executed by the computer to cause the computer to function as a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device. It is also possible that they are combined with each other and run on the same computer, or they can be distributed and run on multiple computers.
 なお、外部装置141が音源分離装置100を含んでいてもよい。即ち、音源分離装置100を含む形で、音声認識装置、ハンズフリー通話装置又は異常音監視装置が構成されてもよい。 Note that the external device 141 may include the sound source separation device 100. That is, the voice recognition device, the hands-free communication device, or the abnormal sound monitoring device may be configured to include the sound source separation device 100.
 次に、実施の形態1に係る音源分離装置100の動作について説明する。
 図8は、音源分離装置100の動作を示すフローチャートである。
 まず、A/D変換部103は、第1のマイクロホン101及び第2のマイクロホン102のそれぞれから入力された、第1の観測アナログ信号及び第2の観測アナログ信号のそれぞれを、予め定められたフレーム間隔で取り込み、それぞれをA/D変換することで、第1の観測デジタル信号x(t)及び第2の観測デジタル信号x(t)を生成して、それらをT/F変換部104に与える(S10)。
 そして、A/D変換部103からの出力は、サンプル番号tが予め定められた値Tよりも小さい場合(S11でNo)には、繰り返し行われる。
Next, the operation of the sound source separation device 100 according to the first embodiment will be described.
FIG. 8 is a flowchart showing the operation of the sound source separation device 100.
First, the A/D conversion unit 103 sets each of the first observed analog signal and the second observed analog signal input from each of the first microphone 101 and the second microphone 102 in a predetermined frame. The first observed digital signal x 1 (t) and the second observed digital signal x 2 (t) are generated by taking in at intervals and A/D converting each, and the T/F conversion unit 104 generates them. (S10).
Then, the output from the A/D conversion unit 103 is repeatedly performed when the sample number t is smaller than the predetermined value T (No in S11).
 ステップS12では、T/F変換部104は、第1の観測デジタル信号x(t)及び第2の観測デジタル信号x(t)のそれぞれに対して、例えば、512点の高速フーリエ変換を行い、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)を算出する。そして、T/F変換部104は、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)をマスク生成部105に与え、第1のスペクトル成分X(ω,τ)をマスキングフィルタ部110に与える。 In step S12, the T/F conversion unit 104 performs, for example, fast Fourier transform of 512 points on each of the first observed digital signal x 1 (t) and the second observed digital signal x 2 (t). Then, the first spectral component X 1 (ω, τ) and the second spectral component X 2 (ω, τ) are calculated. Then, the T/F conversion unit 104 gives the first spectral component X 1 (ω, τ) and the second spectral component X 2 (ω, τ) to the mask generation unit 105, and the first spectral component X 1 (Ω, τ) is given to the masking filter unit 110.
 マスク生成部105は、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)から、目的音を分離するためのマスキングを行う時間周波数フィルタ係数bmod(ω,τ)を算出する(S13)。以下、ステップS13A~S13Dにより、ステップS13での詳細な処理を説明する。 The mask generation unit 105 uses the first frequency component X 1 (ω, τ) and the second spectrum component X 2 (ω, τ) to mask the time-frequency filter coefficient b mod ( ω, τ) is calculated (S13). Hereinafter, detailed processing in step S13 will be described with reference to steps S13A to S13D.
 ステップS13Aでは、マスク係数算出部106は、第1のスペクトル成分X(ω,τ)及び第2のスペクトル成分X(ω,τ)の相互相関関数から、クロススペクトルD(ω,τ)を算出するとともに、得られたクロススペクトルD(ω,τ)に基づいて、マスク係数b(ω,τ)を算出する。マスク係数算出部106は、クロススペクトルD(ω,τ)を発話量比算出部107に与え、マスク係数b(ω,τ)をマスク修正部109に与える。そして、処理は、ステップS13Bに進む。 In step S13A, the mask coefficient calculation unit 106 determines the cross spectrum D(ω, τ) from the cross-correlation function of the first spectrum component X 1 (ω, τ) and the second spectrum component X 2 (ω, τ). And the mask coefficient b(ω, τ) is calculated based on the obtained cross spectrum D(ω, τ). The mask coefficient calculation unit 106 gives the cross spectrum D(ω, τ) to the utterance amount ratio calculation unit 107, and gives the mask coefficient b(ω, τ) to the mask correction unit 109. Then, the process proceeds to step S13B.
 ステップS13Bでは、発話量比算出部107は、第1のスペクトル成分X(ω,τ)、第2のスペクトル成分X(ω,τ)及びクロススペクトルD(ω,τ)から、目的音話者の発話量と、妨害音話者の発話量との間の比率である発話量比SR(τ)を算出する。発話量比算出部107は、発話量比SR(τ)をゲイン算出部108に与える。そして、処理はステップS13Cに進む。 In step S13B, the utterance amount ratio calculation unit 107 determines the target sound from the first spectrum component X 1 (ω, τ), the second spectrum component X 2 (ω, τ) and the cross spectrum D(ω, τ). An utterance amount ratio SR(τ) which is a ratio between the utterance amount of the speaker and the utterance amount of the interfering sound speaker is calculated. The utterance amount ratio calculation unit 107 gives the utterance amount ratio SR(τ) to the gain calculation unit 108. Then, the process proceeds to step S13C.
 ステップS13Cでは、ゲイン算出部108は、発話量比SR(τ)を用いて、マスク係数b(ω,τ)を修正するための修正ゲインg(ω,τ)を計算する。ゲイン算出部108は、修正ゲインg(ω,τ)をマスク修正部109に与える。そして、処理はステップS13Dに進む。 In step S13C, the gain calculation unit 108 calculates the correction gain g(ω, τ) for correcting the mask coefficient b(ω, τ) using the utterance amount ratio SR(τ). The gain calculation unit 108 gives the correction gain g(ω, τ) to the mask correction unit 109. Then, the process proceeds to step S13D.
 ステップS13Dでは、マスク修正部109は、マスク係数b(ω,τ)を、修正ゲインg(ω,τ)を用いて修正し、時間周波数フィルタ係数bmod(ω,τ)を得る。そして、マスク修正部109は、時間周波数フィルタ係数bmod(ω,τ)を、マスキングフィルタ部110に与える。 In step S13D, the mask correction unit 109 corrects the mask coefficient b(ω, τ) using the correction gain g(ω, τ) to obtain the time frequency filter coefficient b mod (ω, τ). Then, the mask correction unit 109 gives the time-frequency filter coefficient b mod (ω, τ) to the masking filter unit 110.
 マスキングフィルタ部110は、第1のスペクトル成分X(ω,τ)に、時間周波数フィルタ係数bmod(ω,τ)を乗算し、出力デジタル信号y(t)のスペクトル成分Y(ω,τ)を算出する(S14)。そして、マスキングフィルタ部110は、スペクトル成分Y(ω,τ)をT/F逆変換部111に与える。 The masking filter unit 110 multiplies the first spectral component X 1 (ω, τ) by the time-frequency filter coefficient b mod (ω, τ), and the spectral component Y(ω, τ) of the output digital signal y(t). ) Is calculated (S14). Then, the masking filter unit 110 provides the spectral component Y(ω, τ) to the T/F inverse transform unit 111.
 T/F逆変換部111は、スペクトル成分Y(ω,τ)に対して逆高速フーリエ変換を行うことで、スペクトル成分Y(ω,τ)を時間領域の出力デジタル信号y(t)に変換する(S15)。 The T/F inverse transform unit 111 transforms the spectral component Y(ω, τ) into an output digital signal y(t) in the time domain by performing an inverse fast Fourier transform on the spectral component Y(ω, τ). Yes (S15).
 D/A変換部112は、出力デジタル信号y(t)を、D/A変換することで、アナログ信号である出力信号に変換して、外部に出力する(S16)。
 そして、D/A変換部112からの出力は、サンプル番号tが予め定められた値Tより小さい場合(S17でYes)には、繰り返し行われる。
The D/A conversion unit 112 converts the output digital signal y(t) into an output signal which is an analog signal by D/A conversion, and outputs the output signal to the outside (S16).
Then, the output from the D/A conversion unit 112 is repeatedly performed when the sample number t is smaller than the predetermined value T (Yes in S17).
 次に、音源分離処理が続行される場合(S18でYes)には、処理はステップS10に戻る。一方、音源分離処理が続行されない場合(S18でNo)には、音源分離処理は終了する。 Next, if the sound source separation process is to be continued (Yes in S18), the process returns to step S10. On the other hand, when the sound source separation processing is not continued (No in S18), the sound source separation processing ends.
 以上のように、実施の形態1の音源分離装置100で、低い計算コストで分離性能の高いマスキングフィルタを作成することができる。このため、目的音を的確に取得することができ、高精度の音声認識装置、高品質なハンズフリー通話装置及び検出精度の高い異常音監視装置を提供することが可能となる。 As described above, the sound source separation device 100 according to the first embodiment can create a masking filter with high separation performance at low calculation cost. Therefore, the target sound can be accurately acquired, and it is possible to provide a high-accuracy voice recognition device, a high-quality hands-free communication device, and an abnormal sound monitoring device with high detection accuracy.
実施の形態2.
 実施の形態1では、音声による構成を例示したが、妨害音となる音声以外の雑音が存在する場合にも適用することができる実施の形態を、実施の形態2として説明する。
Embodiment 2.
Although the first embodiment exemplifies the configuration based on the voice, the second embodiment will be described as an embodiment that can be applied to the case where there is noise other than the voice that becomes the disturbing sound.
 図9は、実施の形態2に係る音源分離装置200を含む情報処理システム250の構成を概略的に示すブロック図である。ここで示す情報処理システム250は、カーナビゲーションシステムの一例であり、走行中の自動車内での運転席に着座する話者と、助手席に着座する話者とが発話する場合を示している。実施の形態2では、運転席に着座する話者を目的音話者とし、助手席に着座する話者を妨害音話者として、説明する。 FIG. 9 is a block diagram schematically showing a configuration of an information processing system 250 including the sound source separation device 200 according to the second embodiment. The information processing system 250 shown here is an example of a car navigation system, and shows a case where a speaker seated in a driver seat and a speaker seated in a passenger seat speak in a moving automobile. In the second embodiment, a speaker seated in the driver's seat will be referred to as a target sound speaker, and a speaker seated in the passenger seat will be referred to as an interfering sound speaker.
 図9に示されているように、情報処理システム250は、第1のマイクロホン101と、第2のマイクロホン102と、音源分離装置200と、外部装置141とを備える。
 実施の形態2における第1のマイクロホン101及び第2のマイクロホン102は、実施の形態1における第1のマイクロホン101及び第2のマイクロホン102と同様である。また、外部装置141は、図6又は図7を用いて説明した外部装置141と同様である。
As shown in FIG. 9, the information processing system 250 includes a first microphone 101, a second microphone 102, a sound source separation device 200, and an external device 141.
The first microphone 101 and the second microphone 102 in the second embodiment are the same as the first microphone 101 and the second microphone 102 in the first embodiment. The external device 141 is similar to the external device 141 described with reference to FIG. 6 or 7.
 実施の形態2における入力としては、第1のマイクロホン101及び第2のマイクロホン102を通じて取り込まれた目的音話者及び妨害音話者の音声の他、自動車走行騒音等の騒音、ハンズフリー通話時におけるスピーカより送出された遠端側話者の受話音声、カーナビゲーションが送出する案内音声、又は、カーオーディオの音楽等が回り込む音響エコー等である。目的音話者及び妨害音話者の音声以外の音声を雑音とする。また、雑音の信号を雑音信号とする。そして、実施の形態2では、目的音が到来する第1の方向を含む第1の範囲及び妨害音が到来する第2の方向を含む第2の範囲には含まれない方向から到来する音のスペクトル成分を除外して、発話量比を算出することで、雑音の影響を除外している。 As the input in the second embodiment, in addition to the voices of the target sound speaker and the disturbing sound speaker captured through the first microphone 101 and the second microphone 102, noise such as vehicle running noise, and during hands-free communication It is the received voice of the far-end speaker transmitted from the speaker, the guide voice transmitted by the car navigation, or the acoustic echo around which the music of the car audio or the like goes around. Voices other than the voices of the target speaker and the disturbing speaker are noise. In addition, the noise signal is a noise signal. Then, in the second embodiment, the sound arriving from a direction not included in the first range including the first direction in which the target sound arrives and a second range including the second direction in which the disturbing sound arrives The influence of noise is excluded by excluding the spectrum component and calculating the utterance amount ratio.
 外部装置141は、上述のように、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置である。外部装置141では、例えば、音声認識処理、ハンズフリー通話処理又は異常音検出処理を行って、それぞれの処理に応じた出力結果を得る。 The external device 141 is, for example, a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device, as described above. The external device 141 performs, for example, a voice recognition process, a hands-free call process, or an abnormal sound detection process, and obtains an output result according to each process.
 音源分離装置200は、A/D変換部103と、T/F変換部104と、マスク生成部205と、マスキングフィルタ部110と、T/F逆変換部111とを備える。
 実施の形態2に係る音源分離装置200のA/D変換部103、T/F変換部104、マスキングフィルタ部110及びT/F逆変換部111は、実施の形態1の音源分離装置100のA/D変換部103、T/F変換部104、マスキングフィルタ部110及びT/F逆変換部111と同様である。
 但し、実施の形態2に係る音源分離装置200では、T/F逆変換部111で生成された出力デジタル信号y(t)が外部装置141に与えられる。
The sound source separation device 200 includes an A/D conversion unit 103, a T/F conversion unit 104, a mask generation unit 205, a masking filter unit 110, and a T/F inverse conversion unit 111.
The A/D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, and the T/F inverse conversion unit 111 of the sound source separation device 200 according to the second embodiment are A of the sound source separation device 100 of the first embodiment. It is the same as the /D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, and the T/F inverse conversion unit 111.
However, in the sound source separation device 200 according to the second embodiment, the output digital signal y(t) generated by the T/F inverse conversion unit 111 is given to the external device 141.
 図2に示されているように、マスク生成部205は、マスク係数算出部106と、発話量比算出部207と、ゲイン算出部108と、マスク修正部109とを備える。
 実施の形態2におけるマスク生成部205のマスク係数算出部106、ゲイン算出部108及びマスク修正部109は、実施の形態1におけるマスク生成部105のマスク係数算出部106、ゲイン算出部108及びマスク修正部109と同様である。
As shown in FIG. 2, the mask generation unit 205 includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 207, a gain calculation unit 108, and a mask correction unit 109.
The mask coefficient calculation unit 106, the gain calculation unit 108 and the mask correction unit 109 of the mask generation unit 205 according to the second embodiment are the same as the mask coefficient calculation unit 106, the gain calculation unit 108 and the mask correction of the mask generation unit 105 according to the first embodiment. It is similar to the unit 109.
 発話量比算出部207は、実施の形態1で述べた式(7)を変形した式(13)を用いることで、発話量比SR(τ)の計算から妨害音信号を除外する。
 実施の形態1では、式(1)のクロススペクトルD(ω,τ)の虚数部Q(ω,τ)の符号により、目的音の到来方向を判別していたが、式(13)のように、条件式において、到来方向の角度を意味する、第1のチャンネルCh1及び第2のチャンネルCh2の時間差δ(ω,τ)を組み合わせることで、発話量の計算から目的音話者と妨害音話者以外の雑音の影響を除外することができる。
Figure JPOXMLDOC01-appb-M000013
 ここで、δθDT及びδθDNは、それぞれ、発話量の計算から除外するための観測アナログ信号の時間差の閾値であり、到来方向角度を時間差に変換した予め定められた定数である。
 δθDTは、観測アナログ信号の到来時間差が極めて小さく、到来方向が目的音方向なのか妨害音方向なのか判別が難しい場合、あるいは正面方向から騒音が到来している場合を想定し、それらの場合を発話量の計算から除外するための閾値である。
 δθDNは、目的音及び妨害音の想定する到来方向から外れている可能性が高い場合、言い換えれば、観測アナログ信号が、例えば窓から混入する風きり音等の方向性雑音、又は、スピーカから放出される音楽等の可能性が高い場合において、そのような場合を発話量の計算から除外するための閾値である。
The utterance amount ratio calculation unit 207 excludes the disturbing sound signal from the calculation of the utterance amount ratio SR(τ) by using the equation (13) obtained by modifying the equation (7) described in the first embodiment.
In the first embodiment, the arrival direction of the target sound is determined by the sign of the imaginary part Q(ω, τ) of the cross spectrum D(ω, τ) of the equation (1). In the conditional expression, by combining the time difference δ(ω, τ) of the first channel Ch1 and the second channel Ch2, which means the angle of the arrival direction, the target speaker and the interfering sound are calculated from the calculation of the utterance amount. The influence of noise other than the speaker can be excluded.
Figure JPOXMLDOC01-appb-M000013
Here, δ θDT and δ θDN are threshold values of the time difference of the observed analog signal for excluding from the calculation of the utterance amount, and are predetermined constants obtained by converting the arrival direction angle into the time difference.
δ θ DT is assumed when the arrival time difference of observed analog signals is extremely small and it is difficult to determine whether the arrival direction is the target sound direction or the disturbing sound direction, or when noise is coming from the front direction. Is a threshold value for excluding from the calculation of the utterance amount.
δ θ DN is highly likely to deviate from the expected arrival directions of the target sound and the interfering sound, in other words, the observed analog signal is directional noise such as wind noise mixed from a window, or from the speaker. This is a threshold value for excluding such a case from the calculation of the utterance amount when the possibility of released music or the like is high.
 図10は、式(13)における目的音及び妨害音以外の雑音の影響を除外する方法の一例を示す模式図である。
 図10の例は、第1のチャンネルCh1を基準に除外範囲を記載している。
 図10のように、発話量の計算において除外範囲を設定することで、目的音及び妨害音以外の雑音の影響を除外することができるので、発話量比の計算精度が向上し、更に品質の高い音源分離装置を構成することが可能となる。
FIG. 10 is a schematic diagram illustrating an example of a method for excluding the influence of noise other than the target sound and the disturbing sound in Expression (13).
In the example of FIG. 10, the exclusion range is described based on the first channel Ch1.
As shown in FIG. 10, by setting an exclusion range in the calculation of the utterance amount, the influence of noise other than the target sound and the interfering sound can be excluded, so that the calculation accuracy of the utterance amount ratio is improved and the quality of the utterance amount is further improved. It is possible to configure a high sound source separation device.
 実施の形態2に係る音源分離装置200は、以上のように構成されているため、様々な騒音条件であっても、低い計算コストで分離性能の高いマスキングフィルタを作成できる。このため、自動車内の騒音下でも目的音を的確に取得することができるので、高精度の音声認識装置、高品質なハンズフリー通話装置、又は、自動車内の異常音を検知する異常音監視装置を提供することが可能となる。 Since the sound source separation device 200 according to the second embodiment is configured as described above, it is possible to create a masking filter with high separation performance at low calculation cost even under various noise conditions. Therefore, since the target sound can be accurately acquired even under the noise in the vehicle, a high-accuracy voice recognition device, a high-quality hands-free communication device, or an abnormal sound monitoring device for detecting an abnormal sound in the vehicle. Can be provided.
実施の形態3.
 実施の形態1及び2では、発話量比の計算に現フレーム情報だけを使用しているが、実施の形態はこのような例に限定されるものではなく、過去のフレーム情報を用いて計算することも可能である。
Embodiment 3.
In the first and second embodiments, only the current frame information is used for calculating the utterance amount ratio, but the embodiment is not limited to such an example, and the past frame information is used for the calculation. It is also possible.
 図1に示されているように、実施の形態3に係る音源分離装置300は、A/D変換部103と、T/F変換部104と、マスク生成部305と、マスキングフィルタ部110と、T/F逆変換部111と、D/A変換部112とを備える。
 実施の形態3に係る音源分離装置300のA/D変換部103、T/F変換部104、マスキングフィルタ部110、T/F逆変換部111及びD/A変換部112は、実施の形態1に係る音源分離装置100のA/D変換部103、T/F変換部104、マスキングフィルタ部110、T/F逆変換部111及びD/A変換部112と同様である。
As shown in FIG. 1, the sound source separation device 300 according to the third embodiment includes an A/D conversion unit 103, a T/F conversion unit 104, a mask generation unit 305, a masking filter unit 110, and The T/F inverse conversion unit 111 and the D/A conversion unit 112 are provided.
The A/D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, the T/F inverse conversion unit 111, and the D/A conversion unit 112 of the sound source separation device 300 according to the third embodiment are the same as those of the first embodiment. This is the same as the A/D conversion unit 103, the T/F conversion unit 104, the masking filter unit 110, the T/F inverse conversion unit 111, and the D/A conversion unit 112 of the sound source separation device 100 according to.
 図2に示されているように、実施の形態3におけるマスク生成部305は、マスク係数算出部106と、発話量比算出部307と、ゲイン算出部108と、マスク修正部109とを備える。
 実施の形態3におけるマスク生成部305のマスク係数算出部106、ゲイン算出部108及びマスク修正部109は、実施の形態1におけるマスク生成部105のマスク係数算出部106、ゲイン算出部108及びマスク修正部109と同様である。
As shown in FIG. 2, the mask generation unit 305 according to the third embodiment includes a mask coefficient calculation unit 106, a speech amount ratio calculation unit 307, a gain calculation unit 108, and a mask correction unit 109.
The mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit 109 of the mask generation unit 305 according to the third embodiment are the same as the mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit of the mask generation unit 105 according to the first embodiment. It is similar to the unit 109.
 発話量比算出部307は、上記の式(8)を用いて発話量比SR(τ)を算出し、さらに、下記の式(14)を用いて、算出されたSR(τ)を、1フレーム前の発話量比SR(τ-1)で平滑化する。
Figure JPOXMLDOC01-appb-M000014
 ここで、αは、平滑化係数であり、実施の形態3においては、α=0.9が好適な一例である。
The utterance amount ratio calculation unit 307 calculates the utterance amount ratio SR(τ) by using the above equation (8), and further calculates the calculated SR(τ) by using the following equation (14). Smoothing is performed with the utterance amount ratio SR(τ-1) before the frame.
Figure JPOXMLDOC01-appb-M000014
Here, α is a smoothing coefficient, and α=0.9 is a suitable example in the third embodiment.
 このように発話量比の計算において、過去に算出された発話量比を用いて、最後に算出された発話量比を平滑化することで、観測アナログ信号に騒音が混入した場合でも、安定して発話量比の計算を行うことが可能となり、更に精度の高い音源分離が可能となる。 In this way, when calculating the speech volume ratio, the speech volume ratio calculated in the past is used to smooth the last calculated speech volume ratio, so that even if noise is mixed in the observed analog signal, it will be stable. It is possible to calculate the utterance amount ratio by using the above method, and it is possible to perform sound source separation with higher accuracy.
 さらに、実施の形態2では、発話量比算出部207は、式(13)を用いて、各信号の発話量を計算しているが、変形例として、発話量比算出部207は、この計算を所定のフレーム区間に拡張すること、言い換えると、予め定められたフレーム区間のパワースペクトルの積分値を計算することで、所定のフレーム区間での目的音と妨害音の占有率、具体的には、どちらが長く発話しているか、あるいは、どちらが大きな音量であるかを分析することが可能である。よって、目的音と妨害音とのダブルトーク時において、どちらの音声が支配的かを判定することが可能となり、より精度の高い音源分離が可能となる。 Further, in the second embodiment, the utterance amount ratio calculation unit 207 calculates the utterance amount of each signal by using the equation (13). However, as a modification, the utterance amount ratio calculation unit 207 performs the calculation. To a predetermined frame section, in other words, by calculating the integral value of the power spectrum of a predetermined frame section, the occupation ratio of the target sound and the disturbing sound in the predetermined frame section, specifically, , It is possible to analyze which one is speaking long or which one is loud. Therefore, it is possible to determine which voice is dominant in the double talk of the target sound and the disturbing sound, and more accurate sound source separation can be performed.
 上述の実施の形態2において、情報処理システム250がカーナビゲーションシステムの一例である場合について説明したが、実施の形態2は、これに限定されるものではない。例えば、情報処理システム250は、一般家庭内又はオフィス内に設置されるスマートスピーカ又はテレビ等の遠隔音声認識システム、TV会議システムの拡声通話システム、ロボットの音声認識対話システム、又は、工場の異常音監視システム等にも適用可能である。このような場合にも、これらの音響的環境で生ずる雑音又は音響エコーについても、実施の形態2にて述べた効果を同様に奏する。 In the second embodiment described above, the case where the information processing system 250 is an example of the car navigation system has been described, but the second embodiment is not limited to this. For example, the information processing system 250 includes a remote voice recognition system such as a smart speaker or a TV installed in a general home or an office, a voice call system of a TV conference system, a voice recognition dialogue system of a robot, or an abnormal sound of a factory. It can also be applied to monitoring systems. Even in such a case, the effects described in the second embodiment are similarly exerted on noise or acoustic echo generated in these acoustic environments.
 また、以上に記載された実施の形態1~3では、入力信号の周波数帯域幅を16kHzとしているが、実施の形態1~3は、このような例に限定されない。例えば、実施の形態1~3は、24kHz等の更に広帯域の音響信号についても適用可能である。 Also, in the first to third embodiments described above, the frequency bandwidth of the input signal is 16 kHz, but the first to third embodiments are not limited to such an example. For example, the first to third embodiments can be applied to an acoustic signal in a wider band such as 24 kHz.
 上記以外にも、実施の形態1~3は、任意の構成要素の変形、又は、任意の構成要素の省略が可能である。 In addition to the above, in the first to third embodiments, it is possible to modify any constituent element or omit any constituent element.
 以上のように、実施の形態1~3に係る音源分離装置100~300は、低い計算コストで高品質な音源分離が可能なため、音声認識システム、音声通信システム又は異常音監視システムのいずれかに導入することができる。これにより、カーナビゲーション又はテレビ等の遠隔音声認識システムの認識率向上、携帯電話又はインターフォン等のハンズフリー通話システム、TV会議システム又は異常音監視システム等の品質改善に供することができる。 As described above, since the sound source separation devices 100 to 300 according to the first to third embodiments can perform high quality sound source separation at low calculation cost, any one of the voice recognition system, the voice communication system, and the abnormal sound monitoring system can be used. Can be introduced to. As a result, it is possible to improve the recognition rate of a remote voice recognition system such as a car navigation system or a television, and to improve the quality of a hands-free call system such as a mobile phone or an intercom system, a TV conference system or an abnormal sound monitoring system.
 100,200,300 音源分離装置、 101 第1のマイクロホン、 102 第2のマイクロホン、 103 A/D変換部、 104 T/F変換部、 105,205,305 マスク生成部、 106 マスク係数算出部、 107,207,307 発話量比算出部、 108 ゲイン算出部、 109 マスク修正部、 110 マスキングフィルタ部、 111 T/F逆変換部、 112 D/A変換部、 250 情報処理システム。 100, 200, 300 sound source separation device, 101 first microphone, 102 second microphone, 103 A/D conversion unit, 104 T/F conversion unit, 105, 205, 305 mask generation unit, 106 mask coefficient calculation unit, 107, 207, 307 utterance volume ratio calculation unit, 108 gain calculation unit, 109 mask correction unit, 110 masking filter unit, 111 T/F inverse conversion unit, 112 D/A conversion unit, 250 information processing system.

Claims (6)

  1.  第1の方向から到来する目的音を含む観測音に基づいて第1のマイクロホンで生成された第1の観測アナログ信号、及び、前記観測音に基づいて第2のマイクロホンで生成された第2の観測アナログ信号の入力を受けて、第1の観測アナログ信号及び第2の観測アナログ信号の各々をデジタル信号に変換することで、第1の観測デジタル信号及び第2の観測デジタル信号を生成するアナログ/デジタル変換部と、
     前記第1の観測デジタル信号及び前記第2の観測デジタル信号の各々を、周波数領域の信号に変換することで、第1のスペクトル成分及び第2のスペクトル成分を生成する時間/周波数変換部と、
     前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との時間差により、前記第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部と、
     前記第1のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部と、
     前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間/周波数逆変換部と、を備えること
     を特徴とする情報処理装置。
    A first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and a second observation analog signal generated by a second microphone based on the observation sound. An analog that receives an observation analog signal and converts each of the first observation analog signal and the second observation analog signal into a digital signal to generate a first observation digital signal and a second observation digital signal / Digital converter,
    A time/frequency conversion unit that generates a first spectral component and a second spectral component by converting each of the first observed digital signal and the second observed digital signal into a frequency domain signal,
    By using a cross-correlation function of the first spectral component and the second spectral component, a time difference between a time when the observed sound reaches the first microphone and a time when the observed sound reaches the second microphone. A mask generation unit that calculates a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction,
    A masking filter unit that separates spectral components by masking the first spectral components using the filtering coefficient;
    An information processing apparatus, comprising: a time/frequency inverse conversion unit that generates the output digital signal by converting the separated spectral component into a signal in the time domain.
  2.  前記観測音には、前記第1の方向とは異なる第2の方向から到来する妨害音が含まれており、
     前記マスク生成部は、
     前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との第1の時間差、及び、前記妨害音が前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との第2の時間差から、前記観測音の内、前記第1の方向を含む第1の範囲から到来する音と、前記第2の方向を含み、前記第1の範囲とは重ならない第2の範囲から到来する音とを区別して、前記第1の範囲から到来する音のスペクトル成分を、前記第2の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出するマスク係数算出部と、
     前記第1のスペクトル成分の内、前記第1の範囲から到来する音のスペクトル成分の量の、前記第2の範囲から到来する音のスペクトル成分の量に対する比率を算出する発話量比算出部と、
     前記比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出するゲイン算出部と、
     前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出するマスク修正部と、を備えること
     を特徴とする請求項1に記載の情報処理装置。
    The observation sound includes an interfering sound coming from a second direction different from the first direction,
    The mask generation unit,
    Using the cross-correlation function of the first spectral component and the second spectral component, a first time of arrival of the target sound at the first microphone and a second time of arrival of the target sound at the second microphone. And a second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, including the first direction of the observation sound. A sound coming from the first range is distinguished from a sound coming from the second range that includes the second direction and does not overlap the first range. A mask coefficient calculating unit for calculating a mask coefficient for separating the spectrum component from the spectrum component of the sound coming from the second range;
    An utterance amount ratio calculator that calculates a ratio of the amount of the spectrum component of the sound coming from the first range to the amount of the spectrum component of the sound coming from the second range among the first spectrum components; ,
    A gain calculation unit that calculates a correction gain for correcting the mask coefficient so that the higher the ratio is, the lower the intensity at which the masking is performed,
    The information processing apparatus according to claim 1, further comprising a mask correction unit that calculates the filtering coefficient by correcting the mask coefficient with the correction gain.
  3.  前記観測音には、前記第1の方向とは異なる第2の方向から到来する妨害音が含まれており、
     前記マスク生成部は、
     前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との第1の時間差、及び、前記妨害音が前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との第2の時間差から、前記観測音の内、前記第1の方向を含む第1の範囲から到来する音と、前記第2の方向を含み、前記第1の範囲とは重ならない第2の範囲から到来する音とを区別して、前記第1の範囲から到来する音のスペクトル成分を、前記第2の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出するマスク係数算出部と、
     前記第1のスペクトル成分の内、前記第1の範囲から到来している音のスペクトル成分の量の、前記第2の範囲から到来している音のスペクトル成分の量に対する比率を、時間の経過とともに順次算出し、過去に算出された前記比率を用いて最後に算出された前記比率を平滑化する発話量比算出部と、
     前記平滑化された比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出するゲイン算出部と、
     前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出するマスク修正部と、を備えること
     を特徴とする請求項1に記載の情報処理装置。
    The observation sound includes an interfering sound coming from a second direction different from the first direction,
    The mask generation unit,
    Using the cross-correlation function of the first spectral component and the second spectral component, a first time of arrival of the target sound at the first microphone and a second time of arrival of the target sound at the second microphone. And a second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, including the first direction of the observation sound. A sound coming from the first range is distinguished from a sound coming from the second range that includes the second direction and does not overlap the first range. A mask coefficient calculating unit for calculating a mask coefficient for separating the spectrum component from the spectrum component of the sound coming from the second range;
    Of the first spectral component, the ratio of the amount of the spectral component of the sound arriving from the first range to the amount of the spectral component of the sound arriving from the second range is calculated as a function of time. Together with the utterance amount ratio calculation unit that smoothes the ratio calculated last using the ratio calculated in the past,
    A gain calculation unit that calculates a correction gain for correcting the mask coefficient, such that the higher the smoothed ratio, the lower the intensity with which the masking is performed,
    The information processing apparatus according to claim 1, further comprising a mask correction unit that calculates the filtering coefficient by correcting the mask coefficient with the correction gain.
  4.  前記発話量比算出部は、前記第1の範囲及び前記第2の範囲には含まれない方向から到来する音のスペクトル成分を除外して、前記比率を算出すること
     を特徴とする請求項2又は3に記載の情報処理装置。
    The utterance amount ratio calculation unit calculates the ratio by excluding a spectrum component of a sound coming from a direction that is not included in the first range and the second range. Alternatively, the information processing device according to item 3.
  5.  コンピュータを、
     第1の方向から到来する目的音を含む観測音に基づいて第1のマイクロホンで生成された第1の観測アナログ信号、及び、前記観測音に基づいて第2のマイクロホンで生成された第2の観測アナログ信号の入力を受けて、第1の観測アナログ信号及び第2の観測アナログ信号の各々をデジタル信号に変換することで、第1の観測デジタル信号及び第2の観測デジタル信号を生成するアナログ/デジタル変換部、
     前記第1の観測デジタル信号及び前記第2の観測デジタル信号の各々を、周波数領域の信号に変換することで、第1のスペクトル成分及び第2のスペクトル成分を生成する時間/周波数変換部、
     前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との時間差により、前記第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部、
     前記第1のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部、及び、
     前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間/周波数逆変換部、として機能させること
     を特徴とするプログラム。
    Computer,
    A first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and a second observation analog signal generated by a second microphone based on the observation sound. An analog that receives an observation analog signal and converts each of the first observation analog signal and the second observation analog signal into a digital signal to generate a first observation digital signal and a second observation digital signal / Digital converter,
    A time/frequency conversion unit that generates a first spectral component and a second spectral component by converting each of the first observed digital signal and the second observed digital signal into a frequency domain signal,
    By using a cross-correlation function of the first spectral component and the second spectral component, a time difference between a time when the observed sound reaches the first microphone and a time when the observed sound reaches the second microphone. A mask generator for calculating a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction,
    A masking filter unit that separates spectral components by performing masking on the first spectral components using the filtering coefficient; and
    A program, which functions as a time/frequency inverse conversion unit that generates an output digital signal by converting the separated spectral components into a time domain signal.
  6.  第1の方向から到来する目的音を含む観測音に基づいて第1のマイクロホンで生成された第1の観測アナログ信号、及び、前記観測音に基づいて第2のマイクロホンで生成された第2の観測アナログ信号の入力を受けて、第1の観測アナログ信号及び第2の観測アナログ信号の各々をデジタル信号に変換することで、第1の観測デジタル信号及び第2の観測デジタル信号を生成し、
     前記第1の観測デジタル信号及び前記第2の観測デジタル信号の各々を、周波数領域の信号に変換することで、第1のスペクトル成分及び第2のスペクトル成分を生成し、
     前記第1のスペクトル成分及び前記第2のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第1のマイクロホンに到来する時間と、前記第2のマイクロホンに到来する時間との時間差により、前記第1の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出し、
     前記第1のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離し、
     前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成すること
     を特徴とする情報処理方法。
    A first observation analog signal generated by a first microphone based on an observation sound including a target sound coming from a first direction, and a second observation analog signal generated by a second microphone based on the observation sound. By receiving the observation analog signal and converting each of the first observation analog signal and the second observation analog signal into a digital signal, a first observation digital signal and a second observation digital signal are generated,
    By converting each of the first observed digital signal and the second observed digital signal into a frequency domain signal, a first spectral component and a second spectral component are generated,
    By using a cross-correlation function of the first spectral component and the second spectral component, a time difference between a time when the observed sound reaches the first microphone and a time when the observed sound reaches the second microphone. , Calculating a filtering coefficient for masking a spectral component of a sound coming from a direction different from the first direction,
    By masking the first spectral component using the filtering coefficient, the spectral component is separated,
    An information processing method comprising: generating an output digital signal by converting the separated spectral component into a signal in a time domain.
PCT/JP2018/043747 2018-11-28 2018-11-28 Information processing device, program and information processing method WO2020110228A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2020557460A JP6840302B2 (en) 2018-11-28 2018-11-28 Information processing equipment, programs and information processing methods
PCT/JP2018/043747 WO2020110228A1 (en) 2018-11-28 2018-11-28 Information processing device, program and information processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2018/043747 WO2020110228A1 (en) 2018-11-28 2018-11-28 Information processing device, program and information processing method

Publications (1)

Publication Number Publication Date
WO2020110228A1 true WO2020110228A1 (en) 2020-06-04

Family

ID=70854207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/043747 WO2020110228A1 (en) 2018-11-28 2018-11-28 Information processing device, program and information processing method

Country Status (2)

Country Link
JP (1) JP6840302B2 (en)
WO (1) WO2020110228A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205416B2 (en) * 2018-12-04 2021-12-21 Fujitsu Limited Non-transitory computer-read able storage medium for storing utterance detection program, utterance detection method, and utterance detection apparatus
WO2022244173A1 (en) * 2021-05-20 2022-11-24 三菱電機株式会社 Sound collection device, sound collection method, and sound collection program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004289762A (en) * 2003-01-29 2004-10-14 Toshiba Corp Method of processing sound signal, and system and program therefor
JP2011113044A (en) * 2009-11-30 2011-06-09 Internatl Business Mach Corp <Ibm> Method, device and program for objective voice extraction
JP2013061421A (en) * 2011-09-12 2013-04-04 Oki Electric Ind Co Ltd Device, method, and program for processing voice signals
JP2013097273A (en) * 2011-11-02 2013-05-20 Toyota Motor Corp Sound source estimation device, method, and program and moving body

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004289762A (en) * 2003-01-29 2004-10-14 Toshiba Corp Method of processing sound signal, and system and program therefor
JP2011113044A (en) * 2009-11-30 2011-06-09 Internatl Business Mach Corp <Ibm> Method, device and program for objective voice extraction
JP2013061421A (en) * 2011-09-12 2013-04-04 Oki Electric Ind Co Ltd Device, method, and program for processing voice signals
JP2013097273A (en) * 2011-11-02 2013-05-20 Toyota Motor Corp Sound source estimation device, method, and program and moving body

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205416B2 (en) * 2018-12-04 2021-12-21 Fujitsu Limited Non-transitory computer-read able storage medium for storing utterance detection program, utterance detection method, and utterance detection apparatus
WO2022244173A1 (en) * 2021-05-20 2022-11-24 三菱電機株式会社 Sound collection device, sound collection method, and sound collection program
JP7286057B2 (en) 2021-05-20 2023-06-02 三菱電機株式会社 SOUND COLLECTION DEVICE, SOUND COLLECTION METHOD, AND SOUND COLLECTION PROGRAM

Also Published As

Publication number Publication date
JPWO2020110228A1 (en) 2021-03-11
JP6840302B2 (en) 2021-03-10

Similar Documents

Publication Publication Date Title
US8521530B1 (en) System and method for enhancing a monaural audio signal
JP5528538B2 (en) Noise suppressor
CN108604452B (en) Sound signal enhancement device
EP2773137B1 (en) Microphone sensitivity difference correction device
US8712074B2 (en) Noise spectrum tracking in noisy acoustical signals
WO2015196729A1 (en) Microphone array speech enhancement method and device
US20170140771A1 (en) Information processing apparatus, information processing method, and computer program product
KR101475864B1 (en) Apparatus and method for eliminating noise
JP6780644B2 (en) Signal processing equipment, signal processing methods, and signal processing programs
JP5834088B2 (en) Dynamic microphone signal mixer
JP6545419B2 (en) Acoustic signal processing device, acoustic signal processing method, and hands-free communication device
JPWO2006046293A1 (en) Noise suppressor
EP1913591B1 (en) Enhancement of speech intelligibility in a mobile communication device by controlling the operation of a vibrator in dependance of the background noise
JP4448464B2 (en) Noise reduction method, apparatus, program, and recording medium
WO2020110228A1 (en) Information processing device, program and information processing method
US11380312B1 (en) Residual echo suppression for keyword detection
US11386911B1 (en) Dereverberation and noise reduction
US10951978B2 (en) Output control of sounds from sources respectively positioned in priority and nonpriority directions
US20220208206A1 (en) Noise suppression device, noise suppression method, and storage medium storing noise suppression program
JP2005514668A (en) Speech enhancement system with a spectral power ratio dependent processor
JP7013789B2 (en) Computer program for voice processing, voice processing device and voice processing method
JP6631127B2 (en) Voice determination device, method and program, and voice processing device
US20130226568A1 (en) Audio signals by estimations and use of human voice attributes
JP6221463B2 (en) Audio signal processing apparatus and program
KR20200054754A (en) Audio signal processing method and apparatus for enhancing speech recognition in noise environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18941765

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020557460

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18941765

Country of ref document: EP

Kind code of ref document: A1