CN110675890B

CN110675890B - Audio signal processing device and audio signal processing method

Info

Publication number: CN110675890B
Application number: CN201910070357.XA
Authority: CN
Inventors: 笼岛岳彦
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2018-07-02
Filing date: 2019-01-25
Publication date: 2023-03-14
Anticipated expiration: 2039-01-25
Also published as: JP6961545B2; CN110675890A; JP2020003751A

Abstract

The present invention relates to an audio signal processing apparatus and an audio signal processing method for emphasizing a target audio signal with high accuracy. The audio signal processing device includes a coefficient derivation unit. The coefficient deriving unit derives a spatial filter coefficient F (F, n) for emphasizing a target audio signal included in the 1 st audio signal, from the emphasized audio signal in which the target audio signal is emphasized.

Description

Audio signal processing device and audio signal processing method

This application is based on japanese patent application 2018-125779 (the filing date is on the average 30 years, 7 months and 2 days), from which the benefit of priority is enjoyed. This application is incorporated by reference into this application in its entirety.

Technical Field

Embodiments of the present invention relate to an audio signal processing apparatus and an audio signal processing method.

Background

A technique for emphasizing a target sound signal included in sound signals of sounds emitted from a plurality of sound sources is known. For example, a technique is disclosed in which an SN ratio maximizing beamformer calculated from a feature amount of a sound signal sensed with a microphone is used as a filter for emphasizing a target sound signal included in the sound signal. As the feature amount, a vector representing a speaker direction and a sound arrival time difference between microphones is used.

Conventionally, it has been difficult to accurately emphasize a target sound signal in some cases by extracting a feature amount from a sensed sound signal and calculating a filter for emphasizing the target sound signal based on the feature amount.

Disclosure of Invention

The present invention addresses the problem of providing an audio signal processing device and an audio signal processing method that can accurately emphasize a target audio signal.

The audio signal processing device according to an embodiment includes a coefficient deriving unit that derives a spatial filter coefficient for emphasizing a target audio signal included in a1 st audio signal, based on an emphasized audio signal in which the target audio signal is emphasized.

According to the audio signal processing device, the target audio signal can be emphasized with high accuracy.

Drawings

Fig. 1 is a schematic diagram of a sound signal processing system.

Fig. 2 is a schematic diagram of a functional configuration of the audio signal processing section.

Fig. 3 is a flow chart of sound signal processing.

Fig. 4 is a schematic diagram of a sound signal processing system.

Fig. 5 is a schematic diagram of a functional configuration of the audio signal processing section.

Fig. 6 is a flowchart of sound signal processing.

Fig. 7 is a schematic diagram of a sound signal processing system.

Fig. 8 is an explanatory diagram of the hardware configuration.

(symbol description)

10. 11, 13: a sound signal processing device; 12: a sound source; 12A, 12A1, 12A2, 12A3: a destination sound source; 12B: a non-target sound source; 14. 14A, 14B, 14C, 14D: a1 st microphone; 16: a2 nd microphone; 20. 30: an audio signal processing unit; 20C: a detection unit; 20D: a correlation (correlation) derivation unit; 20G: a coefficient deriving unit; 20H: a generation unit; 24: an identification unit; 30C: a detection unit; 30D: a correlation derivation unit; 30G, 30G1, 30G2, 30G3: a coefficient deriving unit; 30H, 30H1, 30H2, 30H3: a generation unit; 30J: a separation section.

Detailed Description

Hereinafter, the present embodiment will be described in detail with reference to the drawings.

(embodiment 1)

Fig. 1 is a schematic diagram showing an example of an audio signal processing system 1 according to the present embodiment.

The audio signal processing system 1 includes an audio signal processing device 10, a1 st microphone 14, and a2 nd microphone 16. The sound signal processing device 10 is connected to the 1 st microphone 14 and the 2 nd microphone 16 in a data-exchangeable and signal-ground manner.

The sound signal processing apparatus 10 processes sound signals of sounds emitted from 1 or more sound sources 12.

The sound source 12 is a generation source of sound. The sound source 12 is, for example, a living body such as a human or an animal other than a human, or a non-living body such as a musical instrument, but is not limited thereto. In the present embodiment, a case where the sound source 12 is a human will be described as an example. Therefore, in the present embodiment, a case where the sound is a voice will be described as an example. Note that the type of sound is not limited. Hereinafter, the person is sometimes referred to as a speaker.

In the present embodiment, the audio signal processing device 10 processes audio signals including audio generated from a plurality of audio sources 12, and emphasizes a target audio signal included in the audio signals. The plurality of sound sources 12 are classified into a destination sound source 12A and a non-destination sound source 12B. The destination sound source 12A is a sound source 12 that emits a destination sound. The target sound is a sound that emphasizes the object. The target sound signal is a signal representing a target sound. The sound signal of interest is represented, for example, by a spectrum. The non-intended sound source 12B is a sound source 12 that emits a non-intended sound. The unintended sound is a sound other than the intended sound.

In the present embodiment, an environment is assumed in which a conversation is conducted between a target sound source 12A and a non-target sound source 12B, which are two speakers, via a desk T. In the present embodiment, the following applications are assumed, for example: the non-target sound source 12B is a store clerk, the target sound source 12A is a customer, and the target sound signal of the target sound source 12A, which is one of the speakers, is emphasized from the sound signals indicating the conversation of these speakers. The number of sound sources 12 and the arrangement of the sound sources 12 are not limited to these. The assumed environment is not limited to this environment.

The 1 st microphone 14 and the 2 nd microphone 16 capture sound. In the present embodiment, the 1 st microphone 14 and the 2 nd microphone 16 collect sounds emitted from the sound source 12 and output sound signals to the sound signal processing device 10.

The 1 st microphone 14 is a microphone for picking up sound containing at least a sound of interest. In other words, the 1 st microphone 14 is a microphone for picking up at least a target sound emitted from the target sound source 12A.

In the 1 st microphone 14, a3 rd sound signal is output to the sound signal processing device 10 as a sound signal representing the collected sound. The 3 rd sound signal is a sound signal including a non-destination sound signal and a destination sound signal. The unintended sound signal is a signal representing unintended sound. The unintended sound signal is represented by a spectrum, for example. The 1 st microphone 14 collects sounds emitted from the sound sources 12 (the target sound source 12A and the non-target sound source 12B) and is arranged in advance at a position where the 3 rd sound signal can be output to the sound signal processing device 10. In the present embodiment, a case is assumed where the 1 st microphone 14 is disposed on the table T.

In the present embodiment, the audio signal processing system 1 includes a plurality of 1 st microphones 14 (1 st microphone 14A to 1 st microphone 14D). Therefore, a plurality of 3 rd sound signals are output from the plurality of 1 st microphones 14 to the sound signal processing apparatus 10. Note that, a description will be given of a case where a plurality of 3 rd audio signals are collected into one audio signal as a1 st audio signal.

The number of the 1 st microphones 14 may be equal to or greater than the number of the sound sources 12 to be collected. As described above, in the present embodiment, the sound signal processing system 1 assumes 2 sound sources 12 in total, each of 1 target sound

source

12A and 1 non-target sound source 12B. In this case, the number of the 1 st microphones 14 may be 2 or more. In the present embodiment, a case where the audio signal processing system 1 includes 41 st microphones 14 (1 st microphone 14A to 1 st microphone 14D) will be described as an example.

With respect to the plurality of 1 st microphones 14, the sound arrival time differences from the respective sound sources of the plurality of sound sources 12 are different from each other. That is, the plurality of 1 st microphones 14 are arranged in advance so that the sound arrival time differences are different from each other.

The 2 nd microphone 16 is a microphone for picking up at least unintended sounds. In other words, the 2 nd microphone 16 is a microphone for picking up at least the unintended sound emitted from the unintended sound source 12B.

The 2 nd microphone 16 outputs the 2 nd voice signal to the voice signal processing device 10 as a voice signal representing the collected voice. The 2 nd sound signal is a sound signal in which the ratio of the power (power) of the non-target sound signal to the target sound signal is larger than the 1 st sound signal (3 rd sound signal). The 2 nd sound signal is preferably a sound signal in which the ratio of the power of the non-target sound signal to the power of the target sound signal is larger than the 1 st sound signal (the 3 rd sound signal) and the power of the non-target sound signal is larger than the power of the target sound signal.

In the present embodiment, the 2 nd microphone 16 is disposed at a position closer to the non-target sound source 12B than the 1 st microphone 14. For example, the 2 nd microphone 16 is a head microphone, a pin microphone (pin microphone). In the present embodiment, the 2 nd microphone 16 is attached to the non-target sound source 12B so that it can pick up a sound at the mouth of the speaker as the non-target sound source 12B.

The audio signal processing device 10 includes an AD converter 18, an audio signal processor 20, and an output unit 22. The audio signal processing device 10 may be configured to include at least the audio signal processing unit 20, and at least one of the AD conversion unit 18 and the output unit 22 may be configured independently.

The AD converter 18 receives a plurality of 3 rd audio signals from the plurality of 1 st microphones 14. The AD converter 18 receives the 2 nd audio signal from the 2 nd microphone 16. The AD converter 18 converts each of the plurality of 3 rd audio signals and 2 nd audio signals into a digital signal and outputs the digital signal to the audio signal processor 20.

The audio signal processing unit 20 uses the plurality of 3 rd audio signals and the 2 nd audio signal received from the AD conversion unit 18 to emphasize a target audio signal included in the 1 st audio signal obtained by concentrating the plurality of 3 rd audio signals into 1, and outputs the emphasized audio signal to the output unit 22.

The output unit 22 is a device that outputs the emphasized audio signal received from the audio signal processing unit 20. The output unit 22 is, for example, a speaker, a communication device, a display device, a recording device, or the like. The speaker outputs sound represented by the emphasis sound signal. The communication device transmits the emphasized sound signal to an external device or the like via a network or the like. The display device displays information indicating that the sound signal is emphasized. The sound recording device stores the emphasized sound signal. The recording apparatus is, for example, an IC recorder, a personal computer, or the like. A recording device is a device that converts a voice represented by a highlight voice signal into a text by a known method and records the text. The output unit 22 may convert the emphasized audio signal received from the audio signal processing unit 20 into an analog signal, and output, transmit, store, or record the analog signal.

Next, the audio signal processing unit 20 will be described in detail.

Fig. 2 is a schematic diagram showing an example of the functional configuration of the audio signal processing unit 20.

The audio signal processing unit 20 includes a conversion unit 20A, a conversion unit 20B, a detection unit 20C, a correlation derivation unit 20D, a1 st correlation storage unit 20E, a2 nd correlation storage unit 20F, a coefficient derivation unit 20G, a generation unit 20H, and an inverse conversion unit 20I.

The transform unit 20A, the transform unit 20B, the detection unit 20C, the correlation derivation unit 20D, the coefficient derivation unit 20G, the generation unit 20H, and the inverse transform unit 20I are realized by, for example, 1 or more processors. For example, the above-described respective sections may be realized by software that causes a processor such as a CPU (Central Processing Unit) to execute a program. The above-described parts may be implemented by a processor such as a dedicated IC (Integrated Circuit), that is, hardware. The above-described parts may be implemented by using software together with hardware. In the case of using a plurality of processors, each processor may realize 1 of the respective sections, or may realize 2 or more of the respective sections.

The 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F store various kinds of information. The 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F can be configured by any commonly used storage medium such as an HDD (Hard Disk Drive), an optical Disk, a Memory card, a RAM (Random Access Memory), or the like. The 1 st and 2 nd correlation storage units 20E and 20F may be implemented as physically different storage media, or may be implemented as different storage areas of a physically same storage medium. Further, the 1 st relevant storage unit 20E and the 2 nd relevant storage unit 20F may be realized by a plurality of physically different storage media.

The converter 20A performs Short-Time Fourier Transform (STFT) on the 2 nd audio signal received from the 2 nd microphone 16 via the AD converter 18, and passes the spectrum X ₁ The 2 nd audio signal represented by (f, n) is output to the detection unit 20C. In addition, f denotes the number of a frequency bin (frequency bin), and n denotes the number of a frame.

For example, the sampling frequency is set to 16kHz, the frame length is set to 256 samples, and the frame shift is set to 128 samples. In this case, the conversionThe unit 20A converts the 2 nd sound signal into a frequency spectrum by performing Fast Fourier Transform (FFT) after applying a hanning window of 256 samples to the 2 nd sound signal. In addition, in consideration of the symmetry of the low band and the high band of the spectrum, the complex value of 129 points in the spectrum where f is in the range of 0 to 128 is calculated as the spectrum X of the n-th frame in the 2 nd audio signal ₁ (f, n). Then, the conversion unit 20A passes the spectrum X ₁ The 2 nd audio signal (f, n) is output to the detection unit 20C.

The converter 20B performs short-time fourier transform (STFT) on each of the 3 rd audio signals received from the 1 st microphones 14 (the 1 st microphone 14A to the 1 st microphone 14D) via the AD converter 18, and generates a pass spectrum X _2，1 (f, n), frequency spectrum X _2，2 (f, n), frequency spectrum X _2，3 (f, n), frequency spectrum X _2，4 And (f, n) a plurality of 3 rd sound signals respectively.

Frequency spectrum X _2，1 The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14A. Frequency spectrum X _2，2 The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14B. Frequency spectrum X _2，3 The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14C. Frequency spectrum X _2，4 The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14D.

Hereinafter, a multidimensional vector (4-dimensional vector in the present embodiment) in which the plurality of spectra representing the respective 3 rd audio signals are collected is referred to as a spectrum X representing the 1 st audio signal ₂ (f, n) will be described. In other words, the 1 st sound signal passes through the frequency spectrum X ₂ And (f, n) represents. Frequency spectrum X representing the 1 st sound signal ₂ (f, n) is represented by the following formula (1).

[ EQUATION 1]

X ₂ (f，n)＝[X _2，1 (f，n)，X _2，2 (f，n)，X _2，3 (f，n)，X _2，4 (f，n)]\8230offormula (1)

The conversion unit 20B converts the frequency spectrum X representing the 1 st audio signal ₂ (f, n) are output to the correlation derivation unit 20D and the generation unit 20H.

The 1 st correlation storage section 20E stores the 1 st spatial correlation matrix

1 st spatial correlation matrix

A spatial correlation matrix representing a target sound section in the 1 st sound signal. The target sound section indicates a section including the target sound in the 1 st sound signal. The interval indicates a specific period in the time-series direction.

As described above, in the present embodiment, the 1 st sound signal passes through the spectrum X representing the 4-dimensional vector ₂ And (f, n) represents. Thus, the 1 st spatial correlation matrix

Represented by a matrix of 4 x 4 complex numbers per frequency bin.

In the initial state, the 1 st correlation storage unit 20E stores the zero matrix for storage

Initialized 1 st spatial correlation matrix

1 st spatial correlation matrix

Is updated by a correlation derivation unit 20D described later.

The 2 nd correlation storage section 20F stores the 2 nd spatial correlation matrix

2 nd spatial correlation matrix

A spatial correlation matrix representing the non-destination sound section in the 1 st sound signal. The non-target audio section indicates a section other than the target audio section in the 1 st audio signal.

With 1 st spatial correlation matrix

Also, in the present embodiment, the 2 nd spatial correlation matrix

Represented by a matrix of 4 x 4 complex numbers per frequency bin.

In the initial state, the 2 nd correlation storage unit 20F stores the zero matrix for storage

Initialized 2 nd spatial correlation matrix

2 nd spatial correlation matrix

The correlation is updated by the processing of the correlation derivation unit 20D described later.

Next, the detection unit 20C, the correlation derivation unit 20D, the coefficient derivation unit 20G, the generation unit 20H, and the inverse transform unit 20I will be described. In the present embodiment, the audio signal processing unit 20 performs the stabilization processing after performing the initial processing at the start of the audio signal processing. The correlation derivation unit 20D, the coefficient derivation unit 20G, and the generation unit 20H execute different processes between the initial process and the stabilization process.

First, the functions of the correlation derivation unit 20D, the coefficient derivation unit 20G, and the generation unit 20H in the initial processing will be described.

The initial processing is processing executed by the audio signal processing unit 20 at the start of audio signal processing. In the initial processing, the audio signal processing unit 20 updates the 1 st spatial correlation matrix initialized with the zero matrix stored in the 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F

And 2 nd spatial correlation matrix

Initial values are set for these spatial correlation matrices.

The coefficient deriving unit 20G derives a spatial filter coefficient F (F, n) for emphasizing a target audio signal included in the 1 st audio signal. The coefficient derivation unit 20G derives the coefficients from the 1 st spatial correlation matrix

And 2 nd spatial correlation matrix

The spatial filter coefficients F (F, n) are derived.

As described above, in the present embodiment, the 1 st sound signal passes through the spectrum X representing the 4-dimensional vector ₂ And (f, n) represents. Therefore, the coefficient derivation section 20G derives the coefficients from the 1 st spatial correlation matrix

And 2 nd spatial correlation matrix

Spatial filter coefficients F (F, n) of a 4-dimensional vector which is a complex number are calculated. The spatial filter coefficients F (F, n) are expressed by the following formula (2).

[ equation 2 ]

F(f，n)＝[F ₁ (f，n)，F ₂ (f，n)，F ₃ (f，n)，F ₄ (f, n) \8230offormula (2)

In the initial processing, the coefficient derivation unit 20G derives the spatial filter coefficient F (F, n) = [0, 1] as the spatial filter coefficient F (F, n).

The generation unit 20H generates the emphasized pass spectrum X using the spatial filter coefficients F (F, n) derived by the coefficient derivation unit 20G ₂ (f, n) 1 st audio signal, and the target audio signal included in the target audio signal.

Specifically, the generation unit 20H generates an emphasized sound signal represented by the output spectrum Y (f, n) using the following expression (3).

[ equation 3 ]

Y(f，n)＝X ₂ (f，n)F ^H (f, n) \8230offormula (3)

That is, the generation unit 20H converts the spectrum X ₂ The product of (F, n) and a transpose matrix obtained by Hermitian transpose (Hermitian transpose) of the spatial filter coefficients F (F, n) is generated as an output spectrum Y (F, n) representing the emphasis sound signal. In the initial processing, the generator 20H outputs Y (f, n) = X _2，4 (f, n) emphasizing the sound signal. That is, in the initial processing, the generation unit 20H outputs the spectrum of the 3 rd audio signal picked up by the 1 st microphone 14D as the emphasized audio signal. Note that the 1 st microphone 14 used as a highlight sound signal in the initial processing is not limited to the 1 st microphone 14D as long as it is 1 st microphone 14 out of the plurality of 1 st microphones 14 (1 st microphone 14A to 1 st microphone 14D).

The generation unit 20H outputs the emphasized audio signal represented by the output spectrum Y (f, n) to the inverse transform unit 20I and the detection unit 20C.

The detection unit 20C detects the target audio segment from the emphasized audio signal. In the present embodiment, the detection unit 20C detects the target audio section from the 2 nd audio signal and the emphasized audio signal.

Specifically, the detection unit 20C detects the passage spectrum X ₁ The 2 nd audio signal represented by (f, n) and the emphasized audio signal represented by the output spectrum Y (f, n) received from the generating unit 20HThe target audio section is detected.

The target sound section is a function u indicating whether or not the target sound source 12A utters sound by each frame number ₂ And (n) represents.

u ₂ (n) =1 denotes that the destination sound source 12A emits a sound in the nth frame. The nth frame represents the nth frame. u. of ₂ (n) =0 indicates that the destination sound source 12A does not emit a sound in the nth frame.

In particular, the function u ₂ Is represented by the following formula (4).

[ EQUATION 4 ]

In formula (4), p _Y (n) and p _X The (n) is represented by the following formulas (5) and (6). I.e. p _Y (n) and p _X (n) dependent on the emphasized sound signal represented by the output spectrum Y (f, n) and by the frequency spectrum X ₁ (f, n) represents the power of each of the 2 nd sound signals.

[ EQUATION 5 ]

Here, in the stage of the initial processing, p _Y (n) includes a spectrum corresponding to the sound of both the target sound source 12A and the non-target sound source 12B. Therefore, in equation (4), the threshold value t ₁ To satisfy t in the case where a sound is emitted from the target sound source 12A or the non-target sound source 12B ₁ <P _Y The relation of (n) is preset.

When the non-target sound source 12B of the target sound source 12A and the non-target sound source 12B produces a sound, p _X (n) and p _y (n) is relatively large. Therefore, in the formula (4), the threshold valuet ₂ To satisfy p in the case where the unintended sound source 12B emits a sound _X (n)-p _y (n)≥t ₂ The form of the relationship (c) is preset.

By these settings, the function u ₂ (n) the value "1" is represented in the nth frame where only the target sound source 12A makes a sound. In addition, function u ₂ The value "0" is represented in the nth frame in which the destination sound source 12A does not emit sound.

Therefore, the detecting portion 20C will use u ₂ The section represented by (n) =1 is detected as the target audio section, and u will be used ₂ The section indicated by (n) =0 is detected as a non-target audio section.

The correlation derivation unit 20D derives a1 st spatial correlation matrix from the target audio segment detected by the detection unit 20C and the 1 st audio signal received from the 1 st microphone 14 via the conversion unit 20B and the AD conversion unit 18

And 2 nd spatial correlation matrix

Then, the correlation derivation unit 20D derives the 1 st spatial correlation matrix from the correlation data

Stored in the 1 st correlation storage unit 20E, and the 1 st spatial correlation matrix is updated

Similarly, the correlation derivation unit 20D derives the 2 nd spatial correlation matrix by using the derived 2 nd spatial correlation matrix

Storing the data in the 2 nd correlation storage unit 20F, and updating the 2 nd spatial correlation matrix

More specifically, the correlation derivation unit 20D uses u ₂ In the interval (nth frame) represented by (n) =1, the 1 st spatial correlation matrix is derived and updated by the following equation (7)

Without updating the 2 nd spatial correlation matrix

On the other hand, the correlation derivation unit 20D derives a correlation value u ₂ In the section (nth frame) represented by (n) =0, the 2 nd spatial correlation matrix is derived and updated by the following equation (8)

Without updating the 1 st spatial correlation matrix

[ equation 6 ]

Φ _XX (f，n)＝αΦ _XX (f，n-1)+(1-α)X ₂ ^H (f，n)X ₂ (f, n) \8230offormula (7)

Φ _NN (f，n)＝αΦ _NN (f，n-1)+(1-α)X ₂ ^H (f，n)X ₂ (f, n) \8230offormula (8)

In the formulae (7) and (8), α is a value of 0 or more and less than 1. The value of α is closer to 1, indicating the 1 st spatial correlation matrix derived in the past

The more recent the weight of (1) th spatial correlation matrix

Is large. The value of α may be set in advance. α may be 0.95, for example.

That is, the correlation derivation unit 20D derives the target soundThe 1 st audio signal of a pitch segment is represented by the latest 1 st spatial correlation matrix represented by a multiplication value of the 1 st audio signal and a transposed signal obtained by Hermite transposing the 1 st audio signal

Correcting the 1 st spatial correlation matrix derived in the past

Thereby deriving a new 1 st spatial correlation matrix

In addition, the 1 st sound signal of the target section indicates the sound signal of the target section in the 1 st sound signal.

The correlation derivation unit 20D may simply store the 1 st spatial correlation matrix stored in the 1 st correlation storage unit 20E

1 st spatial correlation matrix used as a past derivation

And (4) finishing. The 1 st correlation storage unit 20E is configured to store only 1 st spatial correlation matrix

And updated sequentially by the correlation derivation section 20D.

The correlation derivation unit 20D also derives the latest 2 nd spatial correlation matrix represented by a multiplication value of the 1 st audio signal in the non-target audio section by the first audio signal and a transposed signal obtained by performing hermitian transposition on the 1 st audio signal

Correcting a2 nd spatial correlation matrix derived in the past

Thereby deriving new2 nd spatial correlation matrix

In addition, the 1 st sound signal of the non-target section indicates the sound signal of the non-target section in the 1 st sound signal.

The correlation derivation unit 20D may simply store the 2 nd spatial correlation matrix stored in the 2 nd correlation storage unit 20F

As the 2 nd spatial correlation matrix derived in the past

And (4) finishing. The 2 nd correlation storage unit 20F is configured to store only 12 nd spatial correlation matrix

And updated sequentially by the correlation derivation section 20D.

Next, the functions of the correlation derivation unit 20D, the coefficient derivation unit 20G, and the generation unit 20H in the stabilization processing will be described. The stabilization processing is processing performed after the above-described initial processing.

Further, the audio signal processing unit 20 may shift to the stationary processing after the initial processing is executed for a predetermined time, or may shift to the 1 st spatial correlation matrix

And 2 nd spatial correlation matrix

When updated a predetermined number of times, the process shifts to the stabilization process.

First, the function of the coefficient deriving unit 20G in the stabilization processing will be described. In the initial processing, the coefficient derivation section 20G derives the spatial filter coefficient F (F, n) = [0, 1] as the spatial filter coefficient F (F, n).

In the stabilization processing, the coefficient derivation unit 20G derives a spatial filter coefficient F (F, n) for emphasizing the target audio signal included in the 1 st audio signal, from the emphasized audio signal in which the target audio signal is emphasized.

As described above, the 1 st sound signal is composed of the plurality of 3 rd sound signals acquired from the plurality of 1 st microphones 14. Therefore, the coefficient deriving unit 20G derives the spatial filter coefficient F (F, n) from the emphasized sound signal in which the target sound signal included in the 1 st sound signal composed of the plurality of 3 rd signals output from the plurality of 1 st microphones 14 is emphasized.

Specifically, the coefficient deriving unit 20G derives the 1 st spatial correlation matrix updated by the correlation deriving unit 20D

And 2 nd spatial correlation matrix

The spatial filter coefficients F (F, n) are derived.

The coefficient deriving unit 20G only needs to read the 1 st spatial correlation matrix from the 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F

And 2 nd spatial correlation matrix

And used for deriving the spatial filter coefficients F (F, n).

Here, at the stage of the stabilization process, the 1 st spatial correlation matrix stored in the 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F

And 2 nd spatial correlation matrix

Is the spatial correlation matrix updated by the correlation derivation section 20D. I.e. these spatially dependent momentsThe matrix is a spatial correlation matrix updated by the correlation derivation unit 20D using the target audio segment detected from the emphasized audio signal. Therefore, the coefficient deriving unit 20G derives the spatial filter coefficient F (F, n) from the emphasized sound signal.

Specifically, the coefficient deriving unit 20G derives a1 st spatial correlation matrix

With 2 nd spatial correlation matrix

The eigenvector F corresponding to the maximum eigenvalue of the matrix expressed by the product of the inverse matrices of (2) _SNR (f, n). Then, the coefficient deriving unit 20G converts the unique vector F _SNR (F, n) are derived as spatial filter coefficients F (F, n) (F, n) = F _SNR (f，n))。

Intrinsic vector F _SNR (f, n) constitutes a MAX-SNR (Maximum Signal-to-Noise) beamformer which maximizes the power ratio of the target sounds to the non-target sounds.

The coefficient derivation unit 20G may derive the spatial filter coefficients F (F, n) by adding a post filter w (F, n) for improving the sound quality by adjusting the power of each frequency bin, using the following expression (9).

[ EQUATION 7 ]

F(f，n)＝w(f，n)F _SNR (f, n) \8230offormula (9)

The post-filter w (f, n) is expressed by the following equation (10).

[ EQUATION 8 ]

Next, the generation unit 20H will be described. In the stabilization processing, the generation unit 20H generates the spatial filter coefficients F (F, n) derived by the coefficient derivation unit 20G, using the same as in the initial processingEmphasizes the pass spectrum X ₂ (f, n) the target audio signal included in the 1 st audio signal. That is, the generation unit 20H generates the emphasized sound signal represented by the output spectrum Y (f, n) using the above expression (3).

The generation unit 20H outputs the generated emphasized audio signal to the inverse conversion unit 20I and the detection unit 20C.

The Inverse Transform unit 20I performs Inverse Short-Time Fourier Transform (ISTFT) on the emphasized audio signal received from the generation unit 20H, and outputs the result to the output unit 22.

That is, the inverse transform unit 20I transforms an emphasized signal in which the target sound signal of the target sound emitted from the target sound source 12A is emphasized but the target sound signal is suppressed, into a sound waveform in the time domain.

Specifically, the inverse transform unit 20I generates a spectrum of 256 points from the output spectrum Y (f, n) using the symmetry of the output spectrum Y (f, n) indicating the emphasized signal, and performs inverse fourier transform. Next, the inverse transform unit 20I may generate an audio waveform by applying the synthesis window function to overlap the output waveform of the previous frame by a frame shift amount.

Next, the detection unit 20C will be described. In the initial processing, the detection unit 20C detects the target audio segment.

During the stabilization processing, the detection unit 20C detects the target audio segment and the overlap segment from the emphasized audio signal and the 2 nd audio signal. The repetition section is a section in which sound is emitted from both the target sound source 12A and the non-target sound source 12B. For example, the repetition interval is an interval in which a plurality of speakers speak.

Specifically, the detection unit 20C excludes the function u ₂ (n) in addition to the function u, the function u is detected ₁ (n)。

Function u ₁ (n) is a function representing the 2 nd non-destination sound zone. In detail, the function u ₁ (n) is a function indicating whether or not the unintended sound source 12B has uttered sound for each frame number. The 2 nd non-target sound section is a section in which the non-target sound source 12B emits sound.

Here, in the stage of the stabilization process, the spectrum Y (f, n) is stabilized by outputting) The power of the unintended sound emitted from the unintended sound source 12B included in the emphasized sound signal is suppressed. Thus, p is represented by the above formula (5) _Y (n) can be approximately considered as the power involved in the target sound emitted from the target sound source 12A. Thus, in the stage of the stabilization treatment, u is passed ₁ 2 nd non-target sound section and pass u represented by (n) ₂ The target audio segment represented by (n) is represented by the following formulas (11) and (12).

[ equation 9 ]

In addition, u ₂ (n) =1 denotes that the destination sound source 12A emits a sound in the nth frame. u. of ₂ (n) =0 indicates that the destination sound source 12A does not emit a sound in the nth frame. In addition, u ₁ (n) =1 indicates that the unintended sound source 12B has emitted a sound in the nth frame. u. of ₁ (n) =0 indicates that the non-destination sound source 12B does not emit a sound in the nth frame.

Therefore, the threshold t in the equations (11) and (12) is only required to be set ₃ And a threshold value t ₄ So that u is ₁ (n) and u ₂ The expression (n) may be set in advance to be an expression representing the above condition.

The detecting part 20C converts u ₂ (n) =1 and u ₁ The section (n) =0 is detected as the target audio section. In addition, the detection part 20C will u ₂ The section of (n) =0 is detected as the non-target audio section. In addition, the detection unit 20C converts u to ₂ (n) =1 and u ₁ The section of (n) =1 is detected as an overlapping section in which sound is emitted from both the destination sound source 12A and the non-destination sound source 12B. Then, the detection unit 20C outputs the detection result to the correlation derivation unit 20D. In the present embodiment, the detection unit 20C converts u to ₁ (n) and u ₂ (n) is output to the correlation derivation unit 20D as a detection result.

The correlation derivation unit 20D derives a1 st spatial correlation matrix from the target audio segment, the repeating segment, and the 1 st audio signal detected by the detection unit 20C

And 2 nd spatial correlation matrix

The correlation derivation unit 20D converts u ₂ (n) =1 and u ₁ The segment of (n) =0 is used as the target audio segment, and the 1 st spatial correlation matrix is derived and updated using the following expression (13) for this segment

In addition, with respect to u ₂ (n) =1 and u ₁ (n) =0 target audio segment, and the correlation derivation unit 20D does not perform the 2 nd spatial correlation matrix

The exporting and updating of (2).

[ EQUATION 10 ]

Φ _XX (f，n)＝αΦ _XX (f，n-1)+(1-α)X ₂ ^H (f，n)X ₂ (f, n) \8230equation (13)

On the other hand, the correlation derivation unit 20D converts u to ₂ The section of (n) =0 is used as the non-target audio section, and the 2 nd spatial correlation matrix is derived and updated in this section using the following formula (14)

In addition, with respect to u ₂ In the interval of (n) =0, the correlation derivation unit 20D does not perform the 1 st spatial correlation matrix

The exporting and updating of (1).

[ equation 11 ]

Φ _NN (f，n)＝αΦ _NN (f，n-1)+(1-α)X ₂ ^H (f，n)X ₂ (f, n) \8230offormula (14)

Further, the correlation derivation unit 20D relates to u ₂ (n) =1 and u ₁ (n) =1 interval, the 1 st spatial correlation matrix is not performed

And 2 nd spatial correlation matrix

Both of these exports and updates. As described above, u ₂ (n) =1 and u ₁ The section of (n) =1 is an overlapping section in which sound is emitted from both the target sound source 12A and the non-target sound source 12B.

That is, in the stabilization process, the correlation derivation unit 20D does not perform the 1 st spatial correlation matrix for the repetitive section in which sounds are emitted from both the target sound source 12A and the non-target sound source 12B

And 2 nd spatial correlation matrix

Both of these exports and updates.

In this way, it is assumed that the 1 st spatial correlation matrix is not updated for the repetition section in which both the target sound source 12A and the non-target sound source 12B simultaneously emit sounds

And 2 nd spatial correlation matrix

Both configurations. With this configuration, it is possible to suppress a decrease in the emphasis accuracy of the target sound signal due to the use of the overlapping sections in which both the target sound source 12A and the non-target sound source 12B emit sounds at the same time.

Next, the procedure of the audio signal processing performed by the audio signal processing device 10 of the present embodiment will be described.

Fig. 3 is a flowchart showing an example of the procedure of the audio signal processing executed by the audio signal processing device 10 according to the present embodiment.

The transforming unit 20B performs short-time fourier transform on the 3 rd signal received from the 1 st microphones 14 to obtain a pass spectrum X ₂ The 1 st sound signal (step S100) denoted by (f, n). The converter 20B outputs the acquired 1 st audio signal to the correlation derivation unit 20D and the generation unit 20H (step S102).

Next, the transform unit 20A performs short-time fourier transform on the 2 nd sound signal received from the 2 nd microphone 16 to obtain a pass spectrum X ₁ And (f, n) the 2 nd sound signal (step S104). The converter 20A outputs the acquired 2 nd audio signal to the detector 20C (step S106).

The processing in steps S100 to S106 is not limited to the sequence shown in fig. 3 as long as the conversion unit 20A and the conversion unit 20B are executed in parallel. The processing in steps S100 to S106 is continuously repeated until the audio signal processing is finished.

Then, the audio signal processing apparatus 10 executes initial processing (step S108 to step S120).

Specifically, first, the coefficient deriving unit 20G reads the 1 st spatial correlation matrix from the 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F

And 2 nd spatial correlation matrix

(step S108). As described above, in the initial state, the 1 st spatial correlation matrix

And 2 nd spatial correlation matrix

Initialized with a zero matrix.

Next, the coefficient deriving unit 20G uses the 1 st spatial correlation matrix read out in step S108

And 2 nd spatial correlation matrix

The spatial filter coefficients F (F, n) are derived (step SS 110). As described above, in the initial state, the coefficient deriving section 20G derives the spatial filter coefficient F (F, n) = [0, 1 =]As spatial filter coefficients F (F, n).

Next, the generation unit 20H generates the pass spectrum X obtained in step S110 with emphasis on the spatial filter coefficient F (F, n) derived in step S110 ₂ And (f, n) the emphasized audio signal of the target audio signal included in the 1 st audio signal (step S112).

Next, the inverse transform unit 20I performs inverse short-time fourier transform on the emphasized audio signal represented by the output spectrum Y (f, n) generated in step S112, and outputs the result to the output unit 22 (step S114).

Next, the detection unit 20C detects the pass function u using the emphasized sound signal and the 2 nd sound signal generated in step S112 ₂ And (n) (step S116).

Next, the correlation derivation unit 20D derives the 1 st spatial correlation matrix using the target audio segment detected in step S116 and the 1 st audio signal

And 2 nd spatial correlation matrix

(step S118).

Next, the correlation derivation section 20D passes through the 1 st spatial correlation matrix to be derived in step S118

And 2 nd spatial correlation matrix

The spatial correlation matrices are stored in the 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F, respectively, and updated (step S120).

Next, the audio signal processing unit 20 determines whether or not to shift from the initial processing to the stabilization processing (step S122). For example, the audio signal processing unit 20 determines whether or not to shift to the stationary processing by determining whether or not the initial processing is executed for a predetermined time. In addition, the audio signal processing unit 20 may determine the 1 st spatial correlation matrix

And 2 nd spatial correlation matrix

Whether the update is performed a predetermined number of times or not is judged to shift to the stabilization processing.

When the determination in step S122 is negative (step S122: NO), the process returns to step S108. On the other hand, if the determination in step S122 is affirmative (step S122: yes), the audio signal processing unit 20 performs the stabilization process (step S124 to step S138).

In the stabilization processing, the coefficient derivation section 20G reads the 1 st spatial correlation matrix from the 1 st correlation storage section 20E and the 2 nd correlation storage section 20F

And 2 nd spatial correlation matrix

(step S124). That is, the coefficient deriving unit 20G reads the latest 1 st spatial correlation matrix updated by the correlation deriving unit 20D

And 2 nd spatial correlation matrix

Next, the coefficient deriving unit 20G derives the 1 st spatial correlation matrix read out in step S124 from the correlation matrix

And 2 nd spatial correlation matrix

The spatial filter coefficients F (F, n) are derived (step S126).

Next, the generation unit 20H emphasizes the target audio signal included in the 1 st audio signal received from the conversion unit 20B using the spatial filter coefficients F (F, n) derived in step S126, and generates an emphasized audio signal (step S128).

Next, the inverse transform unit 20I performs inverse short-time fourier transform on the emphasized audio signal generated in step S128, and outputs the result to the output unit 22 (step S130).

Next, the detection unit 20C detects the target audio section and the repeated section using the 2 nd audio signal and the emphasized audio signal generated in step S128 (step S132).

Next, the correlation derivation unit 20D derives the 1 st spatial correlation matrix from the target audio segment and the repeating segment detected in step S132 and the 1 st audio signal received from the 1 st microphone 14 via the conversion unit 20B and the AD conversion unit 18

And 2 nd spatial correlation matrix

(step S134). Then, the correlation derivation unit 20D derives the 1 st spatial correlation matrix by using the first spatial correlation matrix

And 2 nd spatial correlation matrix

The spatial correlation matrices are updated by storing the spatial correlation matrices in the 1 st correlation storage unit 20E and the 2 nd correlation storage unit 20F (step S136).

Next, the audio signal processing unit 20 determines whether or not to end the audio signal processing (step S138). When the determination in step S138 is negative (step S138: NO), the process returns to step S124. When the determination in step S138 is affirmative (step S138: YES), the present routine is ended.

As described above, the audio signal processing device 10 of the present embodiment includes the coefficient deriving unit 20G. The coefficient deriving unit 20G derives a spatial filter coefficient F (F, n) for emphasizing the target audio signal included in the 1 st audio signal, from the emphasized audio signal in which the target audio signal is emphasized. Therefore, by generating an emphasized sound signal in which the target sound signal is emphasized using the derived spatial filter coefficient F (F, n), the target sound signal can be emphasized with high accuracy.

Here, conventionally, when a plurality of speakers speak simultaneously, the accuracy of emphasizing a target voice may be lowered. For example, a conventional method is known in which a vector representing a speaker direction and a difference in arrival time between microphones is used as a feature amount of a speech signal, and a filter for emphasizing a target speech signal included in the speech signal is generated from the feature amount.

However, in such a conventional system, when a plurality of speakers speak simultaneously, a distribution of feature amounts different from the feature amounts of the respective speakers is obtained, and therefore, the accuracy of a filter for emphasizing a target speech signal may be lowered. In addition, in a situation where a plurality of speakers speak sequentially, since a section occurs in which the speakers speak simultaneously due to the sound accompaniment and the like, the accuracy of a filter for emphasizing a target sound signal may be lowered.

On the other hand, in the audio signal processing apparatus 10 according to the present embodiment, the spatial filter coefficient F (F, n) for emphasizing the target audio signal included in the 1 st audio signal is derived from the emphasized audio signal in which the target audio signal is emphasized. Therefore, by applying the derived spatial filter coefficients F (F, n) to the 1 st sound signal, an emphasized sound signal in which the target sound signal is emphasized is generated, and the target sound signal can be emphasized with high accuracy.

Therefore, the audio signal processing device 10 can emphasize the target audio signal with high accuracy.

In the audio signal processing apparatus 10 according to the present embodiment, the detection unit 20C detects the target audio section from the 2 nd audio signal and the emphasized audio signal in which the ratio of the power of the non-target audio signal to the power of the target audio signal is larger than that of the 1 st audio signal. Therefore, the detection unit 20C can detect the target audio segment with high accuracy. Then, the coefficient deriving unit 20G derives the 1 st spatial correlation matrix based on the target audio segment detected with high accuracy and the 1 st audio signal

And 2 nd spatial correlation matrix

The spatial filter coefficients F (F, n) are derived.

Therefore, the audio signal processing device 10 can emphasize the target audio signal with higher accuracy.

In the present embodiment, the detection unit 20C detects the target audio section from the 2 nd audio signal and the emphasized audio signal. Therefore, in the sound signal processing device 10 of the present embodiment, the spatial filter coefficient F (F, n) can be derived so as to suppress the non-target sound of the non-target sound source 12B and emphasize the target sound signal of the target sound source 12A, regardless of the positions of the target sound source 12A and the non-target sound source 12B. Therefore, the audio signal processing apparatus 10 can derive the spatial filter coefficients (f, n) for emphasizing the target audio signal included in the 1 st audio signal with higher accuracy.

In the present embodiment, the detection unit 20C detects an overlap section in which the target sound overlaps with the non-target sound and a target sound section from the emphasized sound signal. Then, the correlation derivation unit 20D derives a first speech segment from the target speech segment, the repetition segment, and the 1 st speech signal1 spatial correlation matrix

And 2 nd spatial correlation matrix

Then, the correlation derivation unit 20D does not derive and update the 1 st spatial correlation matrix for the repetition interval

And 2 nd spatial correlation matrix

Therefore, the coefficient derivation unit 20G does not derive the spatial filter coefficient F (F, n) for the overlapping section. Therefore, the audio signal processing device 10 of the present embodiment can highlight the target audio signal with high accuracy even in the section of the 1 st audio signal in which the sounds are simultaneously emitted from the plurality of sound sources 12.

< modification 1>

In the above, the detection unit 20C detects the emphasized sound signal indicated by the output spectrum Y (f, n) and the pass spectrum X ₁ And (f, n) detecting the target audio section and the repetition section according to the power of the 2 nd audio signal.

However, the detection unit 20C may use the output spectrum Y (f, n) and the spectrum X ₁ (f, n), the target audio section and the repeated section are detected by other methods.

For example, the output spectrum Y (f, n) and the spectrum X may be learned by a decision tree, a k-neighborhood method, a support vector machine, a neural network, or the like ₁ (f, n) as input to estimate the function u ₁ (n) and function u ₂ (n) model.

As an example, learning of a model using a neural network is explained.

In this case, the detection unit 20C collects learning data for learning the model. For example, the audio signal processing unit 20 of the present embodiment is mounted on a learning device, and uses an audio signalThe number processing section 20 performs the above-described processing, thereby recording the content including the frequency spectrum X ₁ (f, n) and from the spectrum X ₁ (f, n) a plurality of learning data of the derived output spectrum Y (f, n). At the same time, the 1 st microphone 14D collects and records the target sound of the target sound source 12A. Then, the user listens to the target sound, and the user visually recognizes the waveform of the target sound to determine the sound source 12 that emits sound in each frame, thereby creating the function u ₁ (n) and function u ₂ Correct data of (n) c ₁ (n) and c ₂ (n)。

The detection unit 20C uses a vector v (n) represented by the following expression (15) as an input feature amount.

[ equation 12 ]

The vector v (n) expressed by equation (15) is a 516-dimensional vector obtained by connecting the logarithms of the absolute values of the spectra of the frame and the frame immediately before. Therefore, in the detection of the target speech section and the repeated section, the vector v (n) can be normalized to a two-dimensional vector c (n) = [ c ] representing accurate data ₁ (n)，c ₂ (n)]The above-described method is used.

Here, the structure of the neural network model is defined by the following equations (16) to (20).

[ equation 13 ]

An input layer: h is ₁ (n)＝sigmoid(W _i v(n) ^T ) \8230type (16)

Intermediate layer 1: h is ₂ (n)＝sigmoid(W ₁ h ₁ (n)) \8230offormula (17)

Intermediate layer 2: h is a total of ₃ (n)＝sigmoid(W ₂ h ₂ (n)) \ 8230and formula (18)

Intermediate layer 3: h is ₄ (n)＝sigmoid(W ₃ h ₃ (n)) \8230equation (19)

An output layer: u (n) = [ u ] ₁ (n)，u ₂ (n)] ^T ＝sigmoid(W _o h ₄ (n)) \ 8230and formula (20)

The number of nodes in the intermediate layer is set to 100Time, matrix W _i And a matrix W ₀ The sizes of (a) and (b) are 100 × 516 and 2 × 100, respectively. Thus, the matrix W ₁ Matrix W ₂ Matrix W ₃ All become 100 x 100.

In equations (16) to (20), the function sigmoid () represents an operation of applying a sigmoid function represented by the following equation (21) to each element of the vector.

[ equation 14 ]

Then, the objective function L is defined using the cross entropy represented by the following equation (22).

[ equation 15 ]

Then, the detection unit 20C obtains the parameter sequence W for maximizing the objective function L by learning _i 、W _o 、W ₁ 、W ₂ 、W ₃ 。

As a method of learning, an existing method such as a probabilistic gradient descent method may be used. Function u derived using the model ₁ (n) and function u ₂ (n) is a continuous value between 0 and 1. Therefore, for example, 0.5 may be used as the threshold value, and if the threshold value is greater than the threshold value, the threshold value may be binarized into 1 (target sound section), and if the threshold value is smaller than the threshold value, the threshold value may be binarized into 0 (non-target sound section).

In this way, the detection unit 20C may use the output spectrum Y (f, n) and the spectrum X ₁ (f, n), the target voice section and the overlap section are detected by a method different from that of embodiment 1.

(embodiment 2)

In the present embodiment, a description will be given of a method of performing audio signal processing using the 1 st audio signal acquired from the 1 st microphone 14 without using the 2 nd audio signal acquired from the 2 nd microphone 16.

Fig. 4 is a schematic diagram showing an example of the audio signal processing system 2 according to the present embodiment.

The audio signal processing system 2 includes an audio signal processing device 11 and a plurality of 1 st microphones 14. The sound signal processing apparatus 11 and the plurality of 1 st microphones 14 are connected to exchange data and signal.

That is, the audio signal processing system 2 is the same as the audio signal processing system 1 according to embodiment 1 except that the audio signal processing device 11 is provided instead of the audio signal processing device 10 and the 2 nd microphone 16 is not provided.

In the present embodiment, in the audio signal processing system 2, a plurality of target sound sources 12A are assumed as the sound sources 12. In fig. 4, as the plurality of target sound sources 12A, target sound sources 12A1 to 12A3 as three speakers are shown as an example. The destination sound source 12A is, for example, a person (speaker). In the present embodiment, an environment is assumed in which 1 speaker (target sound source 12A1, target sound source 12A2, target sound source 12 A3) is seated on each of 3 sides of a rectangular table T and conversation is performed. In the present embodiment, it is assumed that the positions of the plurality of target sound sources 12A do not move significantly in the sound signal processing performed by the sound signal processing device 11. The number of the target sound sources 12A is not limited to 3, and may be 2 or 4 or more.

As in embodiment 1, the audio signal processing system 2 includes a plurality of 1 st microphones 14. In the present embodiment, 41 st microphones 14, which are the 1 st to 1 st microphones 14A to 14D, are shown as an example.

As in embodiment 1, the sound arrival time differences of the plurality of 1 st microphones 14 from the respective plurality of target sound sources 12A are different from each other. That is, the plurality of 1 st microphones 14 are arranged in advance so that the sound arrival time differences are different from each other.

The number of the 1 st microphones 14 provided in the audio signal processing system 2 may be equal to or greater than the number of the sound sources 12 in the present embodiment. Therefore, in the present embodiment, the number of the 1 st microphones 14 may be 3 or more. The greater the number of 1 st microphones 14, the more the emphasis accuracy of the target sound can be improved.

As an example, a mode in which the audio signal processing system 2 includes 41 st microphones 14 (1 st microphone 14A to 1 st microphone 14D) will be described.

As in embodiment 1, the 3 rd signal is output from each of the 1 st microphones 14, and a plurality of the 3 rd signals are output to the audio signal processing device 11. Similarly to embodiment 1, a description will be given of an audio signal in which a plurality of 3 rd audio signals are collected into one, being referred to as a1 st audio signal.

The audio signal processing device 11 includes an AD converter 18, an audio signal processor 30, and an output unit 22. The AD converter 18 and the output unit 22 are the same as those of embodiment 1. The audio signal processing device 11 is the same as that of embodiment 1 except that an audio signal processing unit 30 is provided instead of the audio signal processing unit 20. The audio signal processing device 11 may be configured to include at least the audio signal processing unit 30, and at least one of the AD conversion unit 18 and the output unit 22 may be configured independently.

The audio signal processing unit 30 receives a plurality of 3 rd audio signals via the AD conversion unit 18. The audio signal processing unit 30 emphasizes a target audio signal included in the 1 st audio signal obtained by collecting the received 3 rd audio signals into 1, and outputs the emphasized audio signal to the output unit 22.

The audio signal processing unit 30 will be described in detail.

Fig. 5 is a schematic diagram showing an example of the functional configuration of the audio signal processing unit 30.

The audio signal processing unit 30 includes a conversion unit 30B, a separation unit 30J, a detection unit 30C, a correlation derivation unit 30D, a plurality of correlation storage units 3E, a correlation storage unit 4F, a plurality of addition units 30K, a plurality of coefficient derivation units 30G, a plurality of generation units 30H, and a plurality of inverse conversion units 30I.

The conversion unit 30B, the separation unit 30J, the detection unit 30C, the correlation derivation unit 30D, the plurality of coefficient derivation units 30G, the plurality of addition units 30K, the plurality of generation units 30H, and the plurality of inverse conversion units 30I are realized by, for example, 1 or more processors. For example, the above-described respective sections may be realized by software, which is a program executed by a processor such as a CPU. The above-described parts may be realized by a processor such as a dedicated IC, that is, hardware. The above-described parts may be implemented by using software together with hardware. In the case of using a plurality of processors, each processor may realize 1 of the respective sections, or may realize 2 or more of the respective sections.

The 3 rd correlation storage unit 30E and the 4 th correlation storage unit 30F store various kinds of information. The 3 rd and 4 th

correlation storage units

30E and 30F can be configured by any commonly used storage medium such as an HDD, an optical disk, a memory card, and a RAM. The 3 rd and 4 th

correlation storage units

30E and 30F may be different storage media or different storage areas of the same storage medium. Further, the 3 rd correlation storage unit 30E and the 4 th correlation storage unit 30F may be realized by a plurality of physically different storage media.

The audio signal processing unit 30 includes a3 rd correlation storage unit 30E, a coefficient derivation unit 30G, an addition unit 30K, a generation unit 30H, and an inverse conversion unit 30I, which correspond to each of the plurality of target sound sources 12A. As described above, in the present embodiment, the 3 rd sound source 12A (the target sound source 12A1 to the target sound source 12A 3) is assumed.

Therefore, in the present embodiment, the audio signal processing unit 30 includes 3 rd correlation storage units 30E (3 rd correlation storage units 30E1 to 3 rd correlation storage units 30E 3), 3 coefficient derivation units 30G (coefficient derivation units 30G1 to coefficient derivation units 30G 2), 3 addition units 30K (addition units 30K1 to addition units 30K 3), 3 generation units 30H (generation units 30H1 to generation units 30H 3), and 3 inverse conversion units 30I (inverse conversion units 30I1 to inverse conversion units 30I 3).

The number of the target sound sources 12A assumed in the sound signal processing system 2 is not limited to 3. For example, the number of the target sound sources 12A assumed in the sound signal processing system 2 may be 1,2, or 4 or more. The audio signal processing unit 30 may be configured to include the 3 rd correlation storage unit 30E, the coefficient derivation unit 30G, the 3 rd addition units 30K, the generation unit 30H, and the inverse conversion unit 30I in the same number as the plurality of target sound sources 12A.

The converter 30B performs short-time fourier transform (STFT) on each of the 3 rd audio signals received from the 1 st microphones 14 (the 1 st microphones 14A to 14D) via the AD converter 18, and generates a spectrum X by performing short-time fourier transform (STFT) on the spectrum X, in the same manner as the converter 20B of embodiment 1 ₁ (f, n), frequency spectrum X ₂ (f, n), frequency spectrum X ₃ (f, n), frequency spectrum X ₄ And (f, n) a plurality of 3 rd sound signals respectively.

Frequency spectrum X ₁ The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14A. Frequency spectrum X ₂ The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14B. Frequency spectrum X ₃ The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14C. Frequency spectrum X ₄ The (f, n) is obtained by performing short-time fourier transform on the 3 rd sound signal received from the 1 st microphone 14D.

In the present embodiment, a multidimensional vector (4-dimensional vector in the present embodiment) in which the plurality of frequency spectra representing the respective 3 rd audio signals are collected is referred to as a frequency spectrum X (f, n) representing the 1 st audio signal. In other words, in the present embodiment, the 1 st sound signal is represented by the frequency spectrum X (f, n). The frequency spectrum X (f, n) representing the 1 st sound signal is represented by the following equation (23).

[ equation 16 ]

X(f，n)＝[X ₁ (f，n)，X ₂ (f，n)，X ₃ (f，n)，X ₄ (f，n)]\823080type (23)

The converter 30B outputs the frequency spectrum X (f, n) indicating the 1 st audio signal to each of the separator 30J and the plurality of generators 30H (the generators 30H1 to 30H 3).

The 3 rd correlation storage section 30E stores the 3 rd spatial correlation matrix. The 3 rd spatial correlation matrix represents a spatial correlation matrix of the destination sound component in the 1 st sound signal.

As described above, the audio signal processing unit 30 includes the 3 rd correlation storage units 30E (the 3 rd correlation storage unit 30E1 to the 3 rd correlation storage unit 30E 3) corresponding to the respective sound sources of the plurality of target sound sources 12A.

The 3 rd correlated storage unit 30E1 is a3 rd correlated storage unit 30E corresponding to the destination sound source 12 A1. The 3 rd correlation storage section 30E1 stores the 3 rd spatial correlation matrix

3 rd spatial correlation matrix

A spatial correlation matrix representing the target sound component of the target sound source 12A1 in the 1 st sound signal. The destination sound component of the destination sound source 12A1 represents a component (i.e., spectrum) of the destination sound emitted from the destination sound source 12A1 included in the 1 st sound signal. The target sound component is separated from the 1 st sound signal by a separating unit 30J (described in detail later).

As described above, in the present embodiment, the 1 st sound signal is represented by the spectrum X (f, n) representing a 4-dimensional vector. Thus, the 3 rd spatial correlation matrix

Represented by a matrix of 4 x 4 complex numbers per frequency bin.

Similarly, the 3 rd correlation storage unit 30E2 is the 3 rd correlation storage unit 30E corresponding to the target sound source 12 A2. The 3 rd correlation storage section 30E2 stores the 3 rd spatial correlation matrix

3 rd spatial correlation matrix

A spatial correlation matrix representing the destination sound component of the destination sound source 12A2 in the 1 st sound signal. The destination sound component of the destination sound source 12A2 represents a component of the destination sound emitted from the destination sound source 12A2 included in the 1 st sound signal. With 3 rd spatial correlation matrix

Likewise, the 3 rd spatial correlation matrix

Represented by a matrix of 4 x 4 complex numbers for each bin.

The 3 rd correlated storage unit 30E3 is A3 rd correlated storage unit 30E corresponding to the destination sound source 12 A3. The 3 rd correlation storage section 30E3 stores the 3 rd spatial correlation matrix

3 rd spatial correlation matrix

A spatial correlation matrix representing the destination sound component of the destination sound source 12A3 in the 1 st sound signal. The destination sound component of the destination sound source 12A3 represents a component of the destination sound emitted from the destination sound source 12A3 included in the 1 st sound signal. 3 rd spatial correlation matrix

Represented by a matrix of 4 x 4 complex numbers per frequency bin.

The 4 th correlation storage section 30F stores a 4 th spatial correlation matrix

4 th spatial correlation matrix

A spatial correlation matrix representing the non-intended sound components in the 1 st sound signal. The non-target sound component means a component other than the component of the target sound emitted from each of the target sound sources 12A (the target sound source 12A1 to the target sound source 12A 3) included in the 1 st sound signal. The unintended sound component is separated from the 1 st sound signal by a separating unit 30J (described later in detail).

In the initial state, the 4 th associative memory section 30F is initialized with a zero matrix

4 th spatial correlation matrix of

Is stored as an initial value in advance.

On the other hand, in the initial state, the 3 rd spatial correlation matrix of the spatial correlation matrices of the target sound emitted at the positions of the target sound source 12A1, the target sound source 12A2, and the target sound source 12A3 is indicated in the 3 rd correlation storage unit 30E1, the 3 rd correlation storage unit 30E2, and the 3 rd correlation storage unit 30E3

3 rd spatial correlation matrix

3 rd spatial correlation matrix

Are stored in advance as initial values, respectively.

Such a3 rd spatial correlation matrix

3 rd spatial correlation matrix

And the 3 rd spatial correlation matrix

The respective initial values are based on a plurality of 1 microphone 14 (1 st microphone 14A &The arrangement of the 1 st microphone 14D) and the positions of the plurality of target sound sources 12A (the target sound sources 12A1 to 12A 3) may be obtained in advance by simulation. In addition, regarding the 3 rd spatial correlation matrix

3 rd spatial correlation matrix

3 rd spatial correlation matrix

The initial value of (2) may be derived in advance from a target sound signal obtained by collecting target sounds emitted from the plurality of target sound sources 12A (the target sound sources 12A1 to 12A 3) at the positions of the respective sound sources 12 by the plurality of 1 st microphones 14 (the 1 st microphones 14A to 14D) in advance.

Specifically, the sound signal processing unit 30 may derive the initial values of the 3 rd spatial correlation matrices in advance from target sound signals obtained by collecting target sounds emitted from the positions of the target sound sources 12A1 to 12A3 by the plurality of 1 st microphones 14 (1 st microphone 14A to 1 st microphone 14D) on the table T.

For example, assuming that speakers are arranged at the positions of the target sound sources 12A1 to 12A3 to reproduce white noise, a 4-dimensional vector representing the spectrum of the sound collected by the plurality of 1 st microphones 14 (1 st microphone 14A to 1 st microphone 14D) is represented by Na (f, n), nb (f, n), and Nc (f, n). In this case, the audio signal processing unit 30 may derive the initial values of the 3 rd spatial correlation matrix in advance using the following equations (24) to (26) and store the initial values in the 3 rd correlation storage unit 30E1 to the 3 rd correlation storage unit 30E3 in advance.

[ equation 17 ]

Next, the adder 30K1, the coefficient derivation unit 30G1, the generation unit 30H1, and the inverse transform unit 30I1 corresponding to the target sound source 12A1 will be described.

The adder 30K1 is an adder 30K corresponding to the destination sound source 12 A1. The adder 30K1 adds the 3 rd spatial correlation matrix (the 3 rd spatial correlation matrix) of the target sound source 12A (the target sound source 12A2, the target sound source 12A 3) other than the corresponding target sound source 12A1

3 rd spatial correlation matrix

) And 4 th spatial correlation matrix

The sum is output to the coefficient deriving unit 30G. Specifically, the addition unit 30K1 derives the sum of the spatial correlation matrices by the following expression (27), and outputs the sum to the coefficient derivation unit 30G1.

[ equation 18 ]

Φ _SS (f，n)＝Φ _XXb (f，n)+Φ _XXc (f，n)+Φ _NN (f, n) \8230offormula (27)

The coefficient derivation unit 30G1 is a coefficient derivation unit 30G corresponding to the target sound source 12 A1. The coefficient deriving unit 30G1 derives a spatial filter coefficient F for emphasizing the target sound signal of the corresponding target sound source 12A1 included in the 1 st sound signal _a (f, n). Specifically, the coefficient deriving unit 30G1 derives the 3 rd spatial correlation matrix from the correlation matrix

And 4 th spatial correlation matrix

Deriving spatial filter coefficients F _a (f，n)。

Specifically, the coefficient derivation unit 30G1 derives the 3 rd spatial correlation matrix

Sum of spatial correlation matrices

The eigenvector F corresponding to the maximum eigenvalue of the matrix expressed by the product of the inverse matrices of (2) _SNR (f，n)。

Then, the coefficient deriving unit 30G1 derives the unique vector F _SNR (F, n) derived as spatial filter coefficients F corresponding to the target sound source 12A _a (f，n)(Fa(f，n)＝F _SNR (f, n)). Note that the coefficient deriving unit 30G1 may derive the spatial filter coefficient F by adding a post filter w (F, n) as in embodiment 1 _a (f，n)。

The generator 30H1 is a generator 30H corresponding to the target sound source 12 A1. The generation unit 30H1 uses the spatial filter coefficient F derived by the coefficient derivation unit 30G1 _a (f, n), an emphasized sound signal in which the target sound signal of the target sound source 12A1 included in the 1 st sound signal indicated by the frequency spectrum X (f, n) is emphasized is generated.

Specifically, the generation unit 30H1 generates the pass output spectrum Y using the following expression (28) _a (f, n) the emphasized sound signal. By outputting a spectrum Y _a The emphasized sound signal represented by (f, n) is a sound signal in which the target sound signal of the target sound source 12A in the 1 st sound signal is emphasized.

[ equation 19 ]

That is, the generating unit 30H1 will frequency-convertSpectrum X (F, n) and the pair space filter coefficient F _a (f, n) the product of the transpose matrix obtained by Hermite transpose is generated as an output spectrum Y representing a strong modulated sound signal _a (f，n)。

The generation unit 30H1 outputs the pass spectrum Y _a The emphasized audio signal represented by (f, n) is output to the inverse transform unit 30I1 and the detection unit 30C. That is, the generation unit 30H1 outputs the emphasized audio signal in which the target audio signal of the target sound source 12A is emphasized, to the inverse conversion unit 30I1 and the detection unit 30C.

The inverse transform unit 30I1 is an inverse transform unit 30I corresponding to the target sound source 12 A1. The inverse transform unit 30I uses the output spectrum Y indicating the emphasized signal in the same manner as the inverse transform unit 20I of embodiment 1 _a Symmetry of (f, n) according to the output spectrum Y _a (f, n) A spectrum of 256 points is generated and subjected to inverse Fourier transform. Next, the inverse transform unit 30I1 may generate an audio waveform by applying the synthesis window function and overlapping the output waveform of the previous frame by shifting the frame shift amount. Then, the inverse transform unit 30I1 outputs the emphasized sound signal of the target sound source 12A indicated by the generated sound waveform to the output unit 22.

Next, the adder 30K2, coefficient derivation unit 30G2, generator 30H2, and inverse transform unit 30I2 corresponding to the target sound source 12A2 will be described. The addition unit 30K3, coefficient derivation unit 30G3, generation unit 30H3, and inverse transform unit 30I3 corresponding to the target sound source 12A3 will be described.

The adder 30K2, the adder 30K3, the coefficient derivation unit 30G2, the coefficient derivation unit 30G3, the generator 30H2, the generator 30H3, the inverse transform unit 30I2, and the inverse transform unit 30I3 perform the same processing as the adder 30K1, the coefficient derivation unit 30G1, the generator 30H1, and the inverse transform unit 30I1, except that the information corresponding to the corresponding target sound source 12A is different.

In detail, the addition unit 30K2 derives the 3 rd spatial correlation matrix

3 rd spatial correlation matrix

And 4 th spatial correlation matrix

Sum of spatial correlation matrices of

And outputs the result to the coefficient derivation unit 30G2. The sum is represented by the following formula (29)

[ equation 20 ]

Φ _SS (f，n)＝Φ _XXa (f，n)+Φ _XXc (f，n)+Φ _NN (f, n) \8230offormula (29)

Then, the coefficient derivation section 30G2 derives the coefficients

And represented by formula (29)

Deriving spatial filter coefficients F _b (f, n). Therefore, the generation unit 30H2 generates an emphasized sound signal (output spectrum Y) in which the target sound signal of the target sound source 12A2 is emphasized _b (f, n)) is output to the inverse transform unit 30I1 and the detection unit 30C.

The adder 30K3 derives a3 rd spatial correlation matrix

3 rd spatial correlation matrix

And 4 th spatial correlation matrix

Sum of spatial correlation matrices of

And outputs the result to the coefficient derivation unit 30G3. The sum is represented by the following formula (30)

[ equation 21 ]

Φ _SS (f，n)＝Φ _XXa (f，n)+Φ _XXb (f，n)+Φ _NN (f, n) \8230offormula (30)

Then, the coefficient derivation section 30G3 derives from

And represented by formula (29)

Deriving spatial filter coefficients F _c (f, n). Therefore, the generation unit 30H3 generates an emphasized sound signal (output spectrum Y) in which the target sound signal of the target sound source 12A3 is emphasized _c (f, n)) is output to the inverse transform unit 30I2 and the detection unit 30C.

Next, the detection unit 30C will be described. The detection unit 30C detects a target audio segment from the emphasized audio signal. In the present embodiment, the detection unit 30C detects a target sound segment of a target sound emitted from each of the plurality of target sound sources 12A using a plurality of emphasized sound signals corresponding to each of the plurality of target sound sources 12A (target sound sources 12A1 to 12A 3).

Specifically, the detection unit 30C receives the pass output spectrum Y from the generation unit 30H1 _a And (f, n) is a target sound signal in which the target sound signal of the target sound source 12A1 is emphasized. The detection unit 30C receives the pass output spectrum Y from the generation unit 30H2 _b And (f, n) is an emphasized sound signal obtained by emphasizing the target sound signal of the target sound source 12 A2. The detection unit 30C receives the pass output spectrum Y from the generation unit 30H3 _c And (f, n) is a target sound signal in which the target sound signal of the target sound source 12A3 is emphasized.

Then, the detection unit 30C detects the emphasized sound signal (output spectrum Y) from the emphasized sound signal _a (f, n), output spectrum Y _b (f, n), output spectrum Y _c (f, n)), the target sound section of each of the target sound sources 12A1 to 12A3 is detected.

As in embodiment 1, the target sound section is expressed by a function u (n) indicating whether or not the target sound source 12A is emitting sound for each frame number. In this embodiment, the function u is used _a (n), function u _b (n), function u _c The term (n) indicates a target sound section of the target sound of each of the target sound sources 12A1 to 12 A3. Further, these functions indicate that the destination sound source 12A emits sound in the nth frame in the case where the value "1" is indicated. In addition, in the case where the value "0" is indicated, it indicates that the destination sound source 12A does not emit sound in the nth frame.

The detection unit 30C uses these functions u _a (n), function u _b (n), function u _c (n) the target sound segment of the target sound of each target sound source 12A is detected by performing threshold processing represented by the following equations (31) to (33).

[ equation 22 ]

In the above equations (31) to (33), t is a threshold value indicating the power of the boundary between the target sound and the non-target sound. In the formulae (31) to (33), P is represented by the following formulae (34) to (36), respectively _a 、P _b 、P _c 。

[ equation 23 ]

The detection unit 30C outputs the detection result of the target sound segment of the target sound of each of the plurality of target sound sources 12A (the target sound source 12A1 to the target sound source 12A 3) to the correlation derivation unit 30D.

Next, the separation portion 30J will be explained. The separation section 30J separates the 1 st sound signal into a target sound component and a non-target sound component.

The separation unit 30J receives the frequency spectrum X (f, n) representing the 1 st audio signal from the conversion unit 30B. As described above, in the present embodiment, the frequency spectrum X (f, n) representing the 1 st audio signal is represented by the above equation (23). In the present embodiment, the frequency spectrum X (f, n) is represented by a 4-dimensional vector that collectively represents the frequency spectrum of each of the 43 rd audio signals received from the 41 st microphones 14A (the 1 st microphones 14A1 to 14D).

The separation unit 30J separates the 1 st sound signal represented by the frequency spectrum X (f, N) into the target sound component S (f, N) and the non-target sound component N (f, N). The target sound component S (f, n) is represented by the following formula (37). Non-target sound component N _i (f, n) is represented by the following formula (38).

[ equation 24 ]

S(f，n)＝[S ₁ (f，n)，S ₂ (f，n)，S ₃ (f，n)，S ₄ (f，n)]\8230ofthe formula (37)

N(f，n)＝[N ₁ (f，n)，N ₂ (f，n)，N ₃ (f，n)，N ₄ (f，n)]8230Chinese medicinal composition (38)

Then, the separation unit 30J calculates S (f, N) = X (f, N), N (f, N) = [0, 0] for all frequencies f when the nth frame is the target audio section using a known audio section detection technique. In addition, the separation unit 30J may set S (f, N) = [0, 0], N (f, N) = X (f, N) for all frequencies f when the N-th frame is a non-target audio segment.

Then, the separation unit 30J outputs the target sound component S (f, N) and the non-target sound component N (f, N) separated from the 1 st sound signal to the correlation derivation unit 30D.

The correlation derivation section 30D derives a3 rd spatial correlation matrix of the target sound component in the 1 st sound signal and a 4 th spatial correlation matrix of the non-target sound component in the 1 st sound signal, from the target sound segment, the target sound component, and the non-target sound component.

Specifically, the correlation derivation unit 30D receives the target speech component S (f, N) and the non-target speech component N (f, N) from the separation unit 30J. The correlation derivation unit 30D receives the function u from the detection unit 30C _a (n), function u _b (n), function u _c (n) as a function representing the destination sound zone.

Then, the correlation derivation unit 30D derives the target sound component S (f, N), the non-target sound component N (f, N), and the function u from the target sound component S (f, N) _a (n), function u _b (n), function u _c (n) deriving a3 rd spatial correlation matrix

3 rd spatial correlation matrix

3 rd spatial correlation matrix

And 4 th spatial correlation matrix

Then, the correlation derivation unit 30D derives a3 rd spatial correlation matrix by using the derived 3 rd spatial correlation matrix

3 rd spatial correlation matrix

3 rd spatial correlation matrix

And 4 th spatial correlation matrix

The spatial correlation matrices are updated by storing the spatial correlation matrices in the 3 rd correlation storage unit 30E1, the 3 rd correlation storage unit 30E2, the 3 rd correlation storage unit 30E3, and the 4 th correlation storage unit 30F, respectively.

Correlation derivation unit 30D relating to u _a (n) =1 and u _b (n) =0 and u _c In the section (nth frame) where (n) =0, the 3 rd spatial correlation matrix is derived and updated by the following expression (39)

Without updating the 3 rd spatial correlation matrix

And the 3 rd spatial correlation matrix

The correlation derivation unit 30D also derives u _a (n) =0 and u _b (n) =1 and u _c In the (n) =0 section (nth frame), the 3 rd spatial correlation matrix is derived and updated by the following equation (40)

Without updating the 3 rd spatial correlation matrix

And 3 rd spatial correlation matrix

The correlation derivation unit 30D also derives u _a (n) =0 and u _b (n) =0 and u _c In the (n) =1 section (nth frame), the 3 rd spatial correlation matrix is derived and updated by the following equation (41)

Without updating the 3 rd spatial correlation matrix

And the 3 rd spatial correlation matrix

The correlation derivation unit 30D derives and updates the 4 th spatial correlation matrix by the following equation (42)

[ equation 25 ]

Φ _XXa (f，n)＝αΦ _XXa (f，n-1)+(1-α)S ^H (f, n) S (f, n) \8230offormula (39)

Φ _XXb (f，n)＝αΦ _XXb (f，n-1)+(1-α)S ^H (f, n) S (f, n) \8230offormula (40)

Φ _XXc (f，n)＝αΦ _XXc (f，n-1)+(1-α)S ^H (f, n) S (f, n) \8230offormula (41)

Φ _NN (f，n)＝αΦ _NN (f，n-1)+(1-α)N ^H (f, N) N (f, N) \8230offormula (42)

In formulas (39) to (42), α is a value of 0 or more and less than 1. The value of α is closer to 1, indicating that the weight of the spatial correlation matrix derived in the past is larger than the latest spatial correlation matrix. The value of α may be set in advance. α may be set to, for example, 0.95.

That is, the correlation derivation unit 30D derives a new 3 rd spatial correlation matrix by correcting the 3 rd spatial correlation matrix derived in the past with the latest 3 rd spatial correlation matrix represented by the multiplication value of the transposed component obtained by hermitian transposition of the target sound component S (f, n).

The correlation derivation unit 30D only needs to store the correlation data in the 3 rd space of the 3 rd correlation storage unit 30E (the 3 rd correlation storage unit 30E1 to the 3 rd correlation storage unit 30E 3)Correlation matrix

3 rd spatial correlation matrix

3 rd spatial correlation matrix

It is sufficient to use as the 3 rd spatial correlation matrix derived in the past. In addition, only 1 of the 3 rd spatial correlation matrices is stored in each of the 3 rd correlation storage units 30E, and is updated by the correlation derivation unit 30D in sequence.

The correlation derivation section 30D also derives the latest 4 th spatial correlation matrix represented by a multiplication of the unintended sound component N (f, N) and a transposed component obtained by hermitian transposition of the unintended sound component N (f, N)

To correct the 4 th spatial correlation matrix derived in the past

Thereby deriving a new 4 th spatial correlation matrix

The correlation derivation unit 30D only needs to store the 4 th spatial correlation matrix stored in the 4 th correlation storage unit 30F

As the 4 th space derived in the pastInter-correlation matrix

And (4) finishing. In addition, in the 4 th correlation storage unit 30F, only 14 th spatial correlation matrix is stored

And updated sequentially by the correlation derivation section 30D.

Next, the procedure of the audio signal processing executed by the audio signal processing device 11 of the present embodiment will be described.

Fig. 6 is a flowchart showing an example of the procedure of the audio signal processing executed by the audio signal processing device 11 of the present embodiment.

The converter 30B performs short-time fourier transform on the 3 rd signal received from the 1 st microphones 14 via the AD converter 18, and acquires the 1 st audio signal indicated by the frequency spectrum X (f, n) (step S200). The converter 30B outputs the acquired 1 st audio signal to each of the separator 30J and the generator 30H (the generators 30H1 to 30H 3) (step S202).

Next, the separation section 30J separates the 1 st sound signal into the target sound component S _i (f, N) and a non-target sound component N _i (f, n) (step S204). Then, the separating section 30J separates the target sound component S _i (f, N) and a non-target sound component N _i (f, n) are output to the correlation derivation unit 30D.

Next, in the audio signal processing unit 30, the addition unit 30K, the coefficient derivation unit 30G, the generation unit 30H, and the inverse transform unit 30I1 corresponding to the target sound source 12A1 to the target sound source 12A3 respectively execute the processing of steps S206 to S212. The processing of step S206 to step S212 is executed in parallel between the functions corresponding to the plurality of target sound sources 12A (target sound sources 12A1 to 12A 3).

First, the adder 30K adds the 3 rd spatial correlation matrix and the 4 th spatial correlation moment of the target sound source 12A other than the corresponding target sound source 12AMatrix of

The sum is output to the coefficient derivation unit 30G of the corresponding destination sound source 12A (step S2206).

The coefficient deriving unit 30G reads the 3 rd and 4 th spatial correlation matrices of the corresponding target sound source 12A from the 3 rd and 4 th

correlation storage units

30E and 30F

(step S208).

Then, the coefficient deriving unit 30G derives the 3 rd and 4 th spatial correlation matrices read out in step S208 from the correlation matrix

Spatial filter coefficients are derived (step S210).

Next, the generating unit 30H generates an emphasized sound signal in which the target sound signal of the corresponding target sound source 12A included in the 1 st sound signal is emphasized, using the spatial filter coefficient derived in step S210 (step S212).

Then, the inverse transform unit 30I1 outputs the emphasized audio signal generated in step S212 to the output unit 22 (step S214).

When the addition unit 30K, the coefficient derivation unit 30G, the generation unit 30H, and the inverse transform unit 30I1 corresponding to the target sound source 12A1 to the target sound source 12A3 execute the processing of steps S206 to S212, the emphasized sound signal in which the target sound signal emitted from the target sound source 12A1 is emphasized, the emphasized sound signal in which the target sound signal emitted from the target sound source 12A2 is emphasized, and the emphasized sound signal in which the target sound signal emitted from the target sound source 12A3 is emphasized are output to the detection unit 30C and the inverse transform unit 30I.

Therefore, the output unit 22 that outputs the emphasized sound signals received from the inverse transform unit 30I1 to the inverse transform unit 30I3 can output a plurality of emphasized sound signals in which the target sounds of the plurality of target sound sources 12A are emphasized.

Next, the detection unit 30C detects a target sound segment of the target sound of each of the plurality of target sound sources 12A by using the plurality of emphasized sound signals received from the generation unit 30H (the generation units 30H1 to 30H 3) (step S216).

Next, the correlation derivation unit 30D calculates the target sound components S (f, N) and the non-target sound components N (f, N) separated in step S204, and the function (u) indicating the target sound segment of the target sound of each of the plurality of target sound sources 12A _a (n)，u _b (n)，u _c (n)), a3 rd spatial correlation matrix (3 rd spatial correlation matrix) corresponding to each of the plurality of target sound sources 12A is derived

3 rd spatial correlation matrix

3 rd spatial correlation matrix

) And 4 th spatial correlation matrix

(step PS 218).

3 rd spatial correlation matrix

3 rd spatial correlation matrix

And 4 th spatial correlation matrix

Respectively stored in the 3 rd related storage units 30E1 and 3 rdThe correlation storage unit 30E2, the 3 rd correlation storage unit 30E3, and the 4 th correlation storage unit 30F update the spatial correlation matrices (step S220).

Next, the audio signal processing unit 30 determines whether or not to end the audio signal processing (step S222). When the determination in step S222 is negative (step S222: NO), the process returns to step S200. On the other hand, if the determination in step S222 is positive (step S222: YES), the present flow is terminated.

As described above, in the audio signal processing apparatus 11 according to the present embodiment, the separation unit 30J separates the 1 st audio signal into the target audio component and the non-target audio component. The detection unit 30C detects a target audio segment from the emphasized audio signal. The correlation derivation unit 30D derives a3 rd spatial correlation matrix of the target sound component in the 1 st sound signal and a 4 th spatial correlation matrix of the non-target sound component in the 1 st sound signal from the target sound segment, the target sound component, and the non-target sound component. Then, the coefficient deriving unit 30G derives the spatial filter coefficient from the 3 rd spatial correlation matrix and the 4 th spatial correlation matrix.

As described above, in the audio signal processing device 11 according to the present embodiment, the spatial filter coefficient is derived using the 1 st audio signal acquired from the 1 st microphone 14 without using the 2 nd audio signal acquired from the 2 nd microphone 16. Therefore, in the present embodiment, it is not necessary to prepare the 2 nd microphone 16 for collecting the sound of the non-target sound source 12B other than the target sound source 12A, and it is possible to emphasize the target sound signal emitted from the target sound source 12A with high accuracy.

In the audio signal processing device 11 according to the present embodiment, it is possible to separate and emphasize target audio signals of target audio of each of the plurality of target audio sources 12A.

In the audio signal processing apparatus 11 according to the present embodiment, the correlation derivation unit 30D sequentially updates the 3 rd spatial correlation matrix and the 4 th spatial correlation matrix. Therefore, even when the positional relationship between the target sound source 12A and the 1 st microphone 14 is deviated, which is assumed as the 3 rd spatial correlation matrix stored as the initial value in the 3 rd correlation storage unit 30E, the spatial correlation matrix corresponding to the actual positional relationship is updated to gradually converge.

Therefore, the audio signal processing device 11 of the present embodiment can effectively emphasize the target audio signal emitted from the target audio source 12A and suppress the undesired audio signal.

The audio signal processing apparatus 11 according to the present embodiment separates the 1 st audio signal into a target audio component and a non-target audio component, and uses them for deriving a spatial correlation matrix. Therefore, the audio signal processing device 11 can generate an emphasized audio signal in which undesired audio such as noise is effectively suppressed. Therefore, the audio signal processing device 11 can provide a highly accurate emphasized audio signal.

< modification 2>

The separation unit 30J may separate the 1 st audio signal into the target audio component and the non-target audio component by using a method different from the method described in embodiment 2.

For example, the separation unit 30J may determine whether the target sound is a target sound or a non-target sound for each frequency bin, and separate the 1 st sound signal into a target sound component and a non-target sound component using the determination result.

For example, the separation unit 30J separates the 1 st sound signal into the target sound component and the non-target sound component using the neural network.

In this case, the separation unit 30J estimates the sound mask M indicating the value "0" or the value "1" for each frame and frequency point using the neural network _S (f, n) and a non-sound mask M _N (f, n). Then, the separation unit 30J derives the target sound component S using the following equations (43) and (44) _i (f, N) and a non-target sound component N _i (f，n)。

[ equation 26 ]

S _i (f，n)＝M _S (f，n)X _i (f, n) (i =1,2,3, 4) … formula (43)

N _i (f，n)＝M _N (f，n)X _i (f, n) (i =1,2,3, 4) … formula (44)

In the separation unit 30J, the 1 channel is used as an input of the neural networkSpectrum Xi (f, n). Then, the separation unit 30J estimates the sound mask M for the input of each channel _S (f, n) and non-acoustic mask M _N (f，n)。

Then, the separation unit 30J estimates the sound mask M common to the channels by using a small number of estimation results of all the channels, which are subject to majority decision, and the like _S (f, n) and non-acoustic mask M _N (f, n) is sufficient.

The separation unit 30J may generate the learning data of the neural network in advance by simulation using a clean target sound signal not containing noise and a non-target sound signal not containing the target sound.

With S _t (f, N) represents the spectrum of the clean target sound signal, with N _t (f, n) represents a spectrum of a non-target sound signal not including the target sound. Then, a spectrum X of a sound superimposed with an unintended sound such as a noise _t (f, n) and correct data M for sound mask _tS (f, n) and correct data M for non-sound mask _tN (f, n) is derived from the following formulae (45) to (47).

[ equation 27 ]

X _t (f，n)＝S _t (f，n)+N _t (f, n) \8230expression (45)

In the formulae (46) and (47), t _S And t _N A threshold value representing a power ratio of the target sound to the non-target sound.

As the input feature amount, a vector v (n) represented by the following formula (48) is used.

[ EQUATION 28 ]

The vector v (n) represented by equation (48) is a 516-dimensional vector obtained by connecting the logarithms of the absolute values of the spectra of the frame and the frame immediately before. In addition, in the sound mask M _S (f, n) and a non-sound mask M _N In the estimation of (f, n), the 258-dimensional vector c (n) representing the correct data can be estimated from the vector v (n) representing the input feature amount, and the estimation can be standardized to the following formula (49).

[ equation 29 ]

c(n)＝[c ₁ (n)，c ₂ (n)，…，c ₂₅₈ (n)]＝[M _tS (0，n)，…，M _tS (128，n)，M _tN (0，n)，…，M _tN (128，n)]\823080type (49)

Therefore, the structure of the neural network model can be defined by the following equations (50) to (54).

[ equation 30 ]

An input layer: h is a total of ₁ (n)＝sigmoid(W _i v(n) ^T ) \8230type (50)

Intermediate layer 1: h is ₂ (n)＝sigmoid(W ₁ h ₁ (n)) \ 8230and formula (51)

Intermediate layer 2: h is ₃ (n)＝sigmoid(W ₂ h ₂ (n)) \ 8230and formula (52)

Intermediate layer 3: h is ₄ (n)＝sigmoid(W ₃ h ₃ (n)) \8230equation (53)

And (3) an output layer: m (n) = [ m = [) ₁ (n)，m ₂ (n)，…，m ₂₅₈ (n)] ^T ＝sigmoid(W _o h ₄ (n)) \ 8230and formula (54)

Here, the number of nodes in the middle layer is assumed to be 200. Thus, the matrix W _i Has a size of 200 × 516, matrix W _o The size of (a) becomes 258 × 200. Thus, the matrix W ₁ Matrix W ₂ And a matrix W ₃ All become 200 × 200.

Here, the objective function L is defined by the cross entropy expressed by the following equation (55).

[ equation 31 ]

The separation unit 30J learns the parameter sequence W that maximizes the objective function L by learning _i 、W _o 、W ₁ 、W ₂ 、W ₃ . As a method of learning, a known method such as a probabilistic gradient descent method may be used.

The separation unit 30J uses m in the formula (55) estimated by the model generated by the method described above _i (n), (i =1, ·, 258) is a continuous value between 0 and 1. Therefore, for example, the separation unit 30J estimates the sound mask M for each frame and each frequency point by setting "0.5" as a threshold, binarizing the value to "1" if the threshold is equal to or higher than the threshold, and binarizing the value to "0" if the threshold is smaller than the threshold _S (f, n) and non-acoustic mask M _N (f, n). Then, the separation unit 30J derives the target sound component S by using the above equation (43) and the above equation (44) _i (f, N) and a non-target sound component N _i And (f, n) is required.

In addition, in modification 1 and modification 2, a case where the constituent element of the neural network is a fully-connected network having 3 intermediate layers is described as an example. However, the constituent elements of the neural network are not limited thereto.

For example, when the learning data can be sufficiently prepared, the accuracy can be improved by further increasing the number of layers of the intermediate layer and the number of nodes. In addition, bias terms may also be used. In addition, various functions such as relu and tanh may be used as the activation function in addition to sigmoid. In addition to the full-binding layer, various structures such as a convolutional neural network and a recurrent neural network can be used. In addition, although the FFT power spectrum is used as the feature value input to the neural network, various feature values such as a mel filter bank and a mel cepstrum, and combinations thereof may be used.

(embodiment 3)

The audio signal processing device 10 and the audio signal processing device 11 may be configured to include an identification unit instead of the audio signal processing system 2.

Fig. 7 is a schematic diagram showing an example of the audio signal processing system 3 according to the present embodiment.

The audio signal processing system 3 includes an audio signal processing device 13 and a plurality of 1 st microphones 14. The sound signal processing device 13 and the plurality of 1 st microphones 14 are connected to exchange data and signal. That is, the audio signal processing system 3 includes an audio signal processing device 13 instead of the audio signal processing device 10.

The audio signal processing device 13 includes an AD converter 18, an audio signal processor 20, and a recognizer 24. The AD converter 18 and the audio signal processor 20 are the same as those of embodiment 1. That is, the audio signal processing device 13 is the same as the audio signal processing device 10 except that the output unit 22 is replaced with the recognition unit 24.

The recognition unit 24 recognizes the emphasized audio signal received from the audio signal processing unit 20.

Specifically, the recognition unit 24 is a device that analyzes the emphasized sound signal. The recognition unit 24 recognizes the emphasized sound signal represented by the output spectrum Y (f, n) by a known analysis method, for example, and outputs a recognition result. The output may be text data or symbolized information such as the recognized word ID. The recognition unit 24 may be a known recognition device.

(scope of application)

The audio signal processing device 10, the audio signal processing device 11, and the audio signal processing device 13 described in the above embodiments and modifications can be applied to various devices and systems for emphasizing a target audio signal.

Specifically, the speech signal processing device 10, the speech signal processing device 11, and the speech signal processing device 13 can be applied to various systems and devices that capture and process speech in an environment where 1 or more speakers output speech.

For example, the audio signal processing device 10, the audio signal processing device 11, and the audio signal processing device 13 can be applied to a conference system, a lecture system, a guest response system, a smart speaker, an in-vehicle system, and the like.

A conference system is a system that processes sound collected with microphones arranged in a space where 1 or more speakers speak. The lecture system is a system for processing the sound collected by a microphone disposed in a space where at least one of a lecturer and a student speaks. The guest reception coping system is a system for processing sounds collected by microphones provided in a space where a clerk and a customer speak in a conversational manner. The smart speaker is a speaker capable of using an AI (Artificial Intelligence) assistant corresponding to a voice operation of a dialogue type. An in-vehicle system collects and processes sounds generated by occupants in a vehicle, and uses the processing results for drive control of the vehicle and the like.

Next, hardware configurations of the audio signal processing apparatus 10, the audio signal processing apparatus 11, and the audio signal processing apparatus 13 according to the above-described embodiments and modifications will be described.

Fig. 8 is an explanatory diagram showing an example of the hardware configuration of the audio signal processing device 10, the audio signal processing device 11, and the audio signal processing device 13 according to the above-described embodiment and modification.

The audio signal Processing apparatus 10, the audio signal Processing apparatus 11, and the audio signal Processing apparatus 13 according to the above-described embodiments and modifications include a control apparatus such as a CPU (Central Processing Unit) 51, a storage apparatus such as a ROM (Read Only Memory) 52 and a RAM (Random Access Memory) 53, a communication I/F54 connected to a network for communication, and a bus 61 connecting the respective units.

The programs executed by the audio signal processing apparatus 10, the audio signal processing apparatus 11, and the audio signal processing apparatus 13 according to the above-described embodiments and modifications are embedded in the ROM52 and the like in advance.

The programs executed by the audio signal processing apparatus 10, the audio signal processing apparatus 11, and the audio signal processing apparatus 13 according to the above-described embodiments and modifications may be configured to be provided as a computer program product by recording a file in an installable or executable format on a computer-readable recording medium such as a CD-ROM (Compact Disk Read Only Memory), a Flexible Disk (FD), a CD-R (Compact Disk Recordable), or a DVD (Digital Versatile Disk).

Further, the program executed by the audio signal processing apparatus 10, the audio signal processing apparatus 11, and the audio signal processing apparatus 13 according to the above-described embodiment and modification may be stored in a computer connected to a network such as the internet and downloaded via the network to be provided. The programs executed by the audio signal processing device 10, the audio signal processing device 11, and the audio signal processing device 13 according to the above-described embodiments and modifications may be provided or distributed via a network such as the internet.

The programs executed by the audio signal processing apparatus 10, the audio signal processing apparatus 11, and the audio signal processing apparatus 13 of the above-described embodiments and modifications enable a computer to function as each part of the audio signal processing apparatus 10, the audio signal processing apparatus 11, and the audio signal processing apparatus 13 of the above-described embodiments and modifications. In this computer, the CPU51 can read a program from a computer-readable storage medium into a main storage device and execute the program.

Although the embodiments and modifications of the present invention have been described, these embodiments and modifications are merely provided as examples and are not intended to limit the scope of the present invention. These new embodiments and modifications can be implemented in other various ways, and various omissions, substitutions, and changes can be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof.

The above embodiments can be summarized as the following embodiments.

Technical solution 1

A sound signal processing device is provided with a coefficient deriving unit that derives a spatial filter coefficient for emphasizing a target sound signal included in a1 st sound signal, from an emphasized sound signal in which the target sound signal is emphasized.

Technical solution 2

According to the audio signal processing apparatus described in claim 1,

the coefficient deriving unit derives the spatial filter coefficient from the emphasized sound signal in which the target sound signal included in the 1 st sound signal acquired from the plurality of microphones is emphasized.

Technical solution 3

The audio signal processing device according to claim 1 or claim 2 includes:

a detection unit that detects a target audio segment from the emphasized audio signal; and

a correlation derivation unit that derives a1 st spatial correlation matrix for the target audio segment in the 1 st audio signal and a2 nd spatial correlation matrix for a non-target audio segment other than the target audio segment in the 1 st audio signal, based on the target audio segment and the 1 st audio signal,

the coefficient derivation unit derives the spatial filter coefficient from the 1 st spatial correlation matrix and the 2 nd spatial correlation matrix.

Technical solution 4

According to the audio signal processing apparatus described in claim 3,

the detection unit detects the target audio section based on a2 nd audio signal and the emphasized audio signal, in which a ratio of power of the non-target audio signal to power of the target audio signal is larger than that of the 1 st audio signal.

Technical solution 5

According to the audio signal processing apparatus described in claim 3 or claim 4,

the detection unit detects the target audio section and an overlap section in which the target audio overlaps with the non-target audio, based on the emphasized audio signal,

the correlation derivation unit derives the 1 st spatial correlation matrix and the 2 nd spatial correlation matrix from the target audio segment, the repetition segment, and the 1 st audio signal.

Technical scheme 6

The audio signal processing apparatus according to any one of claim 3 to claim 5,

the correlation derivation section derives a new 1 st spatial correlation matrix by correcting the 1 st spatial correlation matrix derived in the past with respect to the 1 st audio signal in the target audio segment using the 1 st spatial correlation matrix that is the latest represented by a multiplication value of the 1 st audio signal and a transposed signal obtained by hermitian transposition of the 1 st audio signal,

the correlation derivation unit derives a new 2 nd spatial correlation matrix by correcting the 2 nd spatial correlation matrix derived in the past with respect to the 1 st audio signal in the non-target audio section by the latest 2 nd spatial correlation matrix represented by a multiplication value of the 1 st audio signal and a transposed signal obtained by Hermite transposing the 1 st audio signal.

Technical scheme 7

According to the audio signal processing apparatus described in claim 5 or claim 6,

the coefficient derivation unit derives a eigenvector corresponding to a maximum eigenvalue of a product of the 1 st spatial correlation matrix and an inverse matrix of the 2 nd spatial correlation matrix as the spatial filter coefficient.

Technical solution 8

The audio signal processing device according to

claim

1 or 2 includes:

a detection unit that detects a target audio section from the emphasized audio signal; and

a separation unit that separates the 1 st audio signal into a target audio component and a non-target audio component; and

a correlation derivation unit that derives a3 rd spatial correlation matrix of the target sound component in the 1 st sound signal and a 4 th spatial correlation matrix of the non-target sound component in the 1 st sound signal, from the target sound segment, the target sound component, and the non-target sound component,

the coefficient deriving unit derives the spatial filter coefficient from the 3 rd spatial correlation matrix and the 4 th spatial correlation matrix.

Technical solution 9

According to the audio signal processing apparatus of claim 8,

the correlation derivation unit derives a new 3 rd spatial correlation matrix by correcting the 3 rd spatial correlation matrix derived in the past with the latest 3 rd spatial correlation matrix represented by a multiplication value of the target audio component and a transposed component obtained by Hermite transposing the target audio component,

the correlation derivation unit derives the new 4 th spatial correlation matrix by correcting the 4 th spatial correlation matrix derived in the past with the latest 4 th spatial correlation matrix represented by a multiplication value of the unintended sound component and a transposed component obtained by hermitian transposition of the unintended sound component.

Technical means 10

According to the audio signal processing apparatus of claim 9,

the coefficient derivation unit derives, as the spatial filter coefficient, an eigenvector corresponding to a maximum eigenvalue of a product of the 3 rd spatial correlation matrix and an inverse matrix of the 4 th spatial correlation matrix.

Technical means 11

A sound signal processing method, comprising the steps of: a spatial filter coefficient for emphasizing a target audio signal included in a1 st audio signal is derived from the emphasized audio signal in which the target audio signal is emphasized.

Technical means 12

An audio signal processing device is provided with:

a coefficient derivation unit that derives a spatial filter coefficient for emphasizing a target audio signal included in a1 st audio signal, based on an emphasized audio signal in which the target audio signal is emphasized;

a generating unit that generates the emphasized sound signal in which the target sound included in the 1 st sound signal is emphasized, using the spatial filter coefficient; and

and a recognition unit which recognizes the emphasized sound signal.

Claims

1. An audio signal processing device is provided with:

a coefficient deriving unit that derives, from an emphasized sound signal in which a target sound signal is emphasized, a spatial filter coefficient for emphasizing the target sound signal included in a1 st sound signal acquired by a1 st microphone for acquiring at least a target sound emitted from a target sound source;

a correlation derivation unit that derives a1 st spatial correlation matrix of the target audio section in the 1 st audio signal and a2 nd spatial correlation matrix of a non-target audio section other than the target audio section in the 1 st audio signal, from the target audio section and the 1 st audio signal,

the coefficient deriving unit derives the spatial filter coefficient from the 1 st spatial correlation matrix and the 2 nd spatial correlation matrix,

the detection unit detects the target sound section based on a2 nd sound signal and the emphasized sound signal, the ratio of the power of a non-target sound signal to the power of the target sound signal being larger than that of the 1 st sound signal, the 2 nd sound signal being collected by a2 nd microphone for collecting at least non-target sound emitted from a non-target sound source.

2. The sound signal processing apparatus according to claim 1,

3. The sound signal processing apparatus according to claim 1,

the detection unit detects the target sound section and an overlap section, which is a section in which sounds are emitted from both the target sound source and the non-target sound source, based on the emphasized sound signal,

4. The sound signal processing apparatus according to claim 1,

5. The sound signal processing apparatus according to claim 3,

6. An audio signal processing device is provided with:

a coefficient derivation unit that derives, from an emphasized sound signal in which a target sound signal is emphasized, a spatial filter coefficient for emphasizing the target sound signal included in a1 st sound signal acquired by a1 st microphone for acquiring at least a target sound emitted from a target sound source;

a separation unit that separates the 1 st sound signal into a target sound component and a non-target sound component; and

7. The sound signal processing apparatus according to claim 6,

8. The sound signal processing apparatus according to claim 7,

the coefficient derivation unit derives a eigenvector corresponding to a maximum eigenvalue of a product of the 3 rd spatial correlation matrix and an inverse matrix of the 4 th spatial correlation matrix as the spatial filter coefficient.

9. A sound signal processing method, comprising the steps of:

a coefficient derivation step of deriving a spatial filter coefficient for emphasizing a target sound signal included in a1 st sound signal, the 1 st sound signal being acquired by a1 st microphone for acquiring at least a target sound emitted from a target sound source, from an emphasized sound signal in which the target sound signal is emphasized;

a detection step of detecting a target audio section based on the emphasized audio signal; and

a correlation derivation step of deriving a1 st spatial correlation matrix of the target sound section in the 1 st sound signal and a2 nd spatial correlation matrix of a non-target sound section other than the target sound section in the 1 st sound signal from the target sound section and the 1 st sound signal,

in the coefficient derivation step, the spatial filter coefficients are derived from the 1 st spatial correlation matrix and the 2 nd spatial correlation matrix,

in the detecting step, the target sound section is detected based on a2 nd sound signal and the emphasized sound signal, the ratio of the power of a non-target sound signal to the power of the target sound signal being larger than the 1 st sound signal, the 2 nd sound signal being picked up by a2 nd microphone for picking up at least a non-target sound emitted from a non-target sound source.

10. An audio signal processing device is provided with:

a generation unit that generates the emphasized sound signal in which a target sound included in the 1 st sound signal is emphasized, using the spatial filter coefficient;

a recognition unit that recognizes the emphasized sound signal;

the detection unit detects the target sound section based on a2 nd sound signal and the emphasized sound signal, the ratio of the power of the non-target sound signal to the power of the target sound signal being larger than that of the 1 st sound signal, the 2 nd sound signal being collected by a2 nd microphone for collecting at least non-target sound emitted from a non-target sound source.

11. An audio signal processing method includes:

a coefficient derivation step of deriving, from an emphasized sound signal in which a target sound signal is emphasized, a spatial filter coefficient for emphasizing the target sound signal included in a1 st sound signal acquired by a1 st microphone for acquiring at least a target sound emitted from a target sound source;

a detection step of detecting a target audio section from the emphasized audio signal; and

a separation step of separating the 1 st sound signal into a target sound component and a non-target sound component; and

a correlation derivation step of deriving a3 rd spatial correlation matrix of the target sound component in the 1 st sound signal and a 4 th spatial correlation matrix of the non-target sound component in the 1 st sound signal from the target sound section, the target sound component, and the non-target sound component,

in the coefficient derivation step, the spatial filter coefficient is derived from the 3 rd spatial correlation matrix and the 4 th spatial correlation matrix.

12. An audio signal processing device is provided with:

a recognition unit that recognizes the emphasized sound signal;

a detection unit that detects a target audio section from the emphasized audio signal;

a correlation derivation unit that derives a3 rd spatial correlation matrix of the target sound component in the 1 st sound signal and a 4 th spatial correlation matrix of the non-target sound component in the 1 st sound signal from the target sound segment, the target sound component, and the non-target sound component,