WO2019102585A1

WO2019102585A1 - Acoustic signal separation device and method for separating acoustic signal

Info

Publication number: WO2019102585A1
Application number: PCT/JP2017/042222
Authority: WO
Inventors: 啓吾川島; 石井　純; 岡登　洋平
Original assignee: 三菱電機株式会社
Priority date: 2017-11-24
Filing date: 2017-11-24
Publication date: 2019-05-31

Abstract

A feature extraction unit (2) extracts a feature (b) for classification and a feature (c) for signal regeneration from an input signal (a). A data estimation unit (3) estimates data (d) for classification for inter-component mapping of acoustic signals output from the same sound source using DNN (3a) on the basis of the feature (b) for classification. A data classification unit (5) classifies the data (d) for classification for each component on the basis of classification parameter data (e) in which a classification condition is established, the classification parameter including information associated with at least either the number of components in the acoustic signals or the types of components in the acoustic signals. A signal regeneration unit (6) regenerates a signal for each component on the basis of data (f) for classification and the feature (c) for signal regeneration.

Description

Acoustic signal separation apparatus and acoustic signal separation method

The present invention relates to an acoustic signal separation device and an acoustic signal separation method for separating an acoustic signal in which one or more components are mixed into an acoustic signal for each component.

As a conventional technique for separating an acoustic signal in which one or more components are mixed into an acoustic signal for each component, there is a method described in Patent Document 1, for example. In this method, a deep neural network (hereinafter referred to as DNN) is used to separate an acoustic signal in which one or more components are mixed into an acoustic signal for each component.

International Publication No. 2017/007035

In the method described in Patent Document 1, DNN is the number of components of the acoustic signal not considered in the design because it separates the acoustic signal regardless of the number of components of the acoustic signal and the type of the component of the acoustic signal. In the case where the number of components of the acoustic signal or the type of the component of the acoustic signal dynamically changes, there is a problem that the acoustic signal can not be separated with high accuracy.

This invention solves the said subject, and it aims at obtaining the acoustic signal separation apparatus and the acoustic signal separation method which can isolate | separate an acoustic signal precisely for every component.

An acoustic signal separation device according to the present invention includes a feature extraction unit, a data estimation unit, a data classification unit, and a signal regeneration unit. The feature amount extraction unit extracts a feature amount from an input signal including an acoustic signal in which one or more components are mixed. The data estimation unit uses the DNN in which the network parameters have been learned so as to estimate the first data for correlating the components of the acoustic signal output from the same sound source to the feature amount extracted by the feature amount extraction unit The first data is estimated based on. The data classification unit estimates the first data estimated by the data estimation unit based on second data in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. The data of are classified by component. The signal regeneration unit regenerates an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature amount extracted by the feature amount extraction unit.

According to the present invention, the feature quantity extraction unit extracts the feature quantity from the input signal, and the data estimation unit uses DNN for the first data for correlating the components of the acoustic signal output from the same sound source. The first data is estimated based on the feature amount, and the data classification unit is configured to set classification conditions including information on at least one of the number of components of the acoustic signal and the type of the components of the acoustic signal. The data is classified for each component, and the signal regeneration unit regenerates an acoustic signal for each component based on the first data and the feature value classified for each component. Thereby, the acoustic signal separation device is not limited to the number of components of the acoustic signal not considered in the design, or even if the number of the components of the acoustic signal or the type of the component of the acoustic signal dynamically changes. The acoustic signal can be separated into components with high accuracy.

It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 1 of this invention. FIG. 2A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device according to the first embodiment. FIG. 2B is a block diagram showing a hardware configuration for executing software that implements the function of the acoustic signal separation device according to the first embodiment. 3 is a flowchart showing an acoustic signal separation method according to Embodiment 1; FIG. 4A is a diagram showing classification data corresponding to two different components mapped in a two-dimensional space. FIG. 4B is a diagram showing the classification data of FIG. 4A classified by component. FIG. 5A is a diagram showing classification data corresponding to the same component mapped in a two-dimensional space. FIG. 5B is a diagram showing the classification data of FIG. 5A incorrectly classified into two components. FIG. 5C is a diagram showing the classification data of FIG. 5A classified into one component. FIG. 6A is a view showing classification data and classification condition data corresponding to two components arranged in time series. FIG. 6B is a view showing the classification data and the smoothed classification condition data of FIG. 6A classified by component. It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 2 of this invention. FIG. 8A is a block diagram showing a hardware configuration that implements the function of the acoustic signal separation device according to the second embodiment. FIG. 8B is a block diagram showing a hardware configuration that executes software that implements the function of the acoustic signal separation device according to Embodiment 2. 7 is a flowchart showing an acoustic signal separation method according to Embodiment 2; It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 3 of this invention. 10 is a flowchart showing an acoustic signal separation method according to Embodiment 3. It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 4 of this invention. 15 is a flowchart showing an acoustic signal separation method according to Embodiment 4.

Hereinafter, in order to explain the present invention in more detail, embodiments for carrying out the present invention will be described according to the attached drawings.
Embodiment 1
FIG. 1 is a block diagram showing a configuration of an acoustic signal separation device 1 according to Embodiment 1 of the present invention. The acoustic signal separation device 1 includes a feature extraction unit 2, a data estimation unit 3, a data acquisition unit 4, a data classification unit 5, and a signal regeneration unit 6, and the acoustic signal included in the input signal a is divided into components. The sound signal is separated, and an output signal g including the sound signal for each component is output.

The feature amount extraction unit 2 extracts a feature amount from the input signal a. The input signal a may be an acoustic signal in which one or more components are mixed, but may be a signal including an acoustic signal and another signal. For example, the input signal a may be a signal including, in addition to the acoustic signal, an image signal or text data associated with the acoustic signal.

The feature quantities extracted from the input signal a by the feature quantity extraction unit 2 are the classification feature quantity b and the signal regeneration feature quantity c. The classification feature amount b is a feature amount used for estimation of the classification data d by the data estimation unit 3. For example, the feature amount extraction unit 2 performs short-time Fourier transformation on the acoustic signal included in the input signal a to obtain the amplitude on the frequency axis, and calculates the feature amount based on the amplitude on the frequency axis. Data in which the feature quantities calculated from the acoustic signal in this manner are arranged in time series may be used as the classification feature quantity b.

The signal regeneration feature quantity c is a feature quantity used for regeneration of the signal for each component by the signal regeneration unit 6. For example, the feature amount c for signal regeneration may be a spectrum coefficient calculated by performing the short-time Fourier transform on the acoustic signal included in the input signal a by the feature amount extraction unit 2. It may be included image information or text data.

The data estimation unit 3 estimates classification data d based on the classification feature amount b extracted from the input signal a by the feature amount extraction unit 2 using the DNN 3a. The classification data d is first data for correlating the components of the acoustic signal output from the same sound source. For example, the classification data d may be the cost between the components of the acoustic signal converted so that the distance between the time frequency components of the acoustic signal output from the same sound source becomes small.

In the DNN 3a, a network parameter 3b learned in advance so as to estimate the classification data d based on the classification feature b is set. The DNN 3a in which the network parameter 3b is set estimates the classification data d by hierarchically calculating the classification feature amount b. For DNN 3a, for example, Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN) may be used.

The data acquisition unit 4 receives and acquires the input of the classification condition data e. The classification condition data e acquired by the data acquisition unit 4 is output to the data classification unit 5.
In the acoustic signal separation device 1, the data classification unit 5 may obtain classification condition data e directly, and the data acquisition unit 4 may be provided in a device different from the acoustic signal separation device 1.
That is, in the acoustic signal separation device 1, the data classification unit 5 may have the function of acquiring the classification condition data e, and the data acquisition unit 4 may not be provided.

The classification condition data e is second data in which the classification condition of the classification data d is set. The classification condition set in the classification condition data e includes information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal. The information on the number of components of the sound signal may be data indicating the number of dynamically changing sound sources, and may be, for example, sound source sequence data in which the number of sound sources is arranged in time series.

The information on the type of component of the acoustic signal may be any information that can specify the sound source, such as the gender of the speaker, the type of language, and the type of the output sound. For example, if the type of the component of the sound signal is a siren and an animal stalk, the sound signal will be a mixture of the siren component output from the alarm and the component of the sound uttered from the animal.

The data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e acquired by the data acquisition unit 4.
Classification methods such as k-means clustering or GMM (Gaussian Mixture Models) may be used to classify the classification data d.
Classification result information f that is classification data d classified by the data classification unit 5 is output to the signal regeneration unit 6.

The signal regeneration unit 6 receives the classification result information f from the data classification unit 5, and based on the classification data d for each component in the classification result information f, an acoustic signal for each component from the feature amount c for signal regeneration Regenerate The signal regeneration unit 6 outputs an output signal g which is an acoustic signal for each regenerated component. The output signal g may include an image signal and text information corresponding to the acoustic signal for each regenerated component.

FIG. 2A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device 1. FIG. 2B is a block diagram showing a hardware configuration for executing software for realizing the function of the acoustic signal separation device 1. In FIGS. 2A and 2B, the acoustic interface 100 is an interface that receives an acoustic signal included in an input signal a and outputs an acoustic signal included in an output signal g. For example, the acoustic interface 100 is connected to a microphone that collects an acoustic signal, and is connected to a speaker that outputs the acoustic signal.

The image interface 101 is an interface that receives an image signal included in an input signal a and outputs an image signal included in an output signal g. For example, the image interface 101 is connected to a camera for capturing an image signal and connected to a display for displaying the image signal.
The text input interface 102 is an interface for inputting text information included in the input signal a. For example, the text input interface 102 is connected to a keyboard or mouse for inputting text information.

The memory (not shown) included in the processing circuit 103 shown in FIG. 2A or the memory 105 shown in FIG. 2B has an input signal a, classification feature b, signal regeneration feature c, classification data d, and classification condition data e. , Classification result information f and output signal g are temporarily stored. The processing circuit 103 or the processor 104 appropriately reads these data to separate the acoustic signal.

The respective functions of the feature quantity extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 in the acoustic signal separation device 1 are realized by a processing circuit.
That is, the acoustic signal separation device 1 includes a processing circuit for executing the processing from step ST1 to step ST4 described later with reference to FIG. The processing circuit may be dedicated hardware or may be a CPU (Central Processing Unit) that executes a program stored in a memory.

When the processing circuit is the dedicated hardware processing circuit 103 shown in FIG. 2A, the processing circuit 103 may be, for example, a single circuit, a complex circuit, a programmed processor, a parallel programmed processor, or an ASIC (Application Specific Integrated Circuit). ), FPGA (Field-Programmable Gate Array), or a combination thereof. The respective functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 may be realized by separate processing circuits. It may be realized by one processing circuit.

When the processing circuit is the processor 104 shown in FIG. 2B, the functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 are software, firmware or software It is realized by the combination of the and the firmware. The software or firmware is written as a program and stored in the memory 105.

The processor 104 reads out and executes the program stored in the memory 105 to thereby perform the respective functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6. To realize. That is, the acoustic signal separation device 1 includes the memory 105 for storing a program that is to be executed as a result of the processing from step ST1 to step ST4 shown in FIG. 3 when executed by the processor 104. These programs cause a computer to execute the procedure or method of the feature extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6.
The memory 105 is a computer-readable storage medium storing a program for causing a computer to function as the feature extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6. It is also good.

The memory 105 may be, for example, a nonvolatile or volatile semiconductor memory such as a random access memory (RAM), a read only memory (ROM), a flash memory, an erasable programmable read only memory (EPROM), or an EEPROM (electrically-EPROM). A magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD, etc. correspond. Also, the memory 105 may be an external memory such as a USB (Universal Serial Bus) memory.

The functions of the feature quantity extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6 are partially realized by dedicated hardware, and partially realized by software or firmware. You may For example, the functions of the feature amount extraction unit 2, the data estimation unit 3, and the data acquisition unit 4 are realized by processing circuits that are dedicated hardware. The functions of the data classification unit 5 and the signal regeneration unit 6 may be realized by the processor 104 reading and executing a program stored in the memory 105. Thus, the processing circuit can realize each of the above functions by hardware, software, firmware or a combination thereof.

Next, the operation will be described.
FIG. 3 is a flowchart showing an acoustic signal separation method according to the first embodiment.
The feature quantity extraction unit 2 extracts the classification feature quantity b and the signal regeneration feature quantity c from the input signal a (step ST1). The classification feature amount b is output from the feature extraction unit 2 to the data estimation unit 3, and the signal regeneration feature amount c is output from the feature extraction unit 2 to the signal regeneration unit 6.

The input signal a may include an image signal input by the image interface 101 or text information input by the text input interface 102 in addition to the sound signal input by the audio interface 100.
Also, the feature quantity extraction unit 2 may extract the feature quantity by reading the input signal a from a memory (not shown) included in the processing circuit 103 or the memory 105.
Furthermore, the input signal a may be stream data.

Next, the data estimation unit 3 estimates classification data d based on the classification feature b using DNN 3a (step ST2). The classification data d is output from the data estimation unit 3 to the data classification unit 5.

The data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e (step ST3). The data classification unit 5 outputs, to the signal regeneration unit 6, classification result information f which is classification data d classified for each component.

FIG. 4A is a view showing classification data d1 and d2 corresponding to components of two different acoustic signals mapped in a two-dimensional space. In the example shown in FIG. 4A, the number of sound sources is two, that is, the sound source A and the sound source B, and the input signal a includes the component of the acoustic signal output from the sound source A and the acoustic signal output from the sound source B. It is assumed that the ingredients are mixed.

Classification data d1 indicated by a circle symbol is data for correlating the components of the sound signal output from the sound source A, and classification data d2 indicated by a triangle symbol corresponds between the components of the sound signal output from the sound source B It is data.
However, for example, when the output state of the acoustic signal from the sound source changes, the classification feature b also changes accordingly. Therefore, when the data estimation unit 3 estimates the classification data d based on the classification feature b using the DNN 3a, the classification data d corresponding to the component of the acoustic signal output from the same sound source Even in this case, the value of the classification data d may vary depending on the change of the classification feature value b.
Data classification unit 5 is input in a state in which it is not known whether classification data d dispersed into a plurality of values is classification data d1 belonging to sound source A or classification data d2 belonging to sound source B. .

FIG. 4B is a view showing classification data d1 and d2 classified by component. In FIG. 4A and FIG. 4B, it is assumed that the classification condition that the number of sound sources is two is set in the classification condition data e. Since the number of sound sources is 2, the data classification unit 5 classifies the classification data d1 into the first group A1 corresponding to the sound source A, and the classification data d2 into the second group A2 corresponding to the sound source B. Classify.

FIG. 5A is a diagram showing classification data d1 corresponding to the same component mapped in a two-dimensional space. FIG. 5B is a diagram showing classification data d1 incorrectly classified into two components. FIG. 5C is a diagram showing classification data d1 classified into one component.
In the example shown to FIG. 5A, the number of sound sources is only the sound source A, and only the component of the acoustic signal output from the sound source A is mixed with the acoustic signal contained in the input signal a.
However, in FIG. 5A, since the values of the classification data d1 are dispersed, classification data d1 of a plurality of values are generated. Furthermore, the data classification unit 5 is input in a state where it does not know to which sound source classification data d dispersed into a plurality of values belongs.

In the method described in Patent Document 1, the acoustic signal is separated by the separation number (for example, separation number = 2) preset in DNN. Therefore, when such DNN is applied to classification of classification data d, as shown in FIG. 5B, a plurality of classification data d1 belonging to one sound source A are different from each other in the first group B1 and the second group B1. There is a high possibility of being erroneously separated into the group B2.

On the other hand, in the acoustic signal separation device 1, the data classification unit 5 classifies classification data d based on, for example, the number of sound sources set in the classification condition data e. In FIG. 5A, since the number of sound sources is 1 (only sound source A), the data classification unit 5 correctly classifies the plurality of classification data d into the group C corresponding to the sound source A, as shown in FIG. 5C. Thus, the acoustic signal separation device 1 can prevent separation errors of the acoustic signal due to the mismatch between the number of separations and the number of sound sources.

FIG. 6A is a view showing classification data d1 and d2 and classification condition data e corresponding to two components arranged in time series. For example, when a plurality of speakers are sound sources and a plurality of speakers speak, the number of sound sources changes dynamically. The example shown in FIG. 6A shows the case where the sound signal is also output from the sound source B after the sound signal is output from the sound source A.

As described above, according to the method described in Patent Document 1, when the number of sound sources dynamically changes, the acoustic signal can not be separated with high accuracy.
On the other hand, in the acoustic signal separation device 1, the classification data d is classified with high accuracy by using the classification condition including the information on at least one of the number of acoustic signal components and the type of acoustic signal components. be able to. For example, as shown in FIG. 6A, the data classification unit 5 uses the classification condition data e, which is a sound source number sequence indicating the number of time-series sound sources, to classify the classification data d1 into the first group corresponding to the sound source A. Classification is made to D1, and classification data d2 is classified to a second group D2 corresponding to the sound source B.
As described above, by referring to the sound source number sequence indicated by the classification condition data e, the data classification unit 5 can classify the classification data d with high accuracy even if the number of sound sources dynamically changes.

FIG. 6B is a view showing classification data d1 and d2 classified according to components and smoothed classification condition data. In the classification condition data e shown in FIG. 6A, the number of sound sources at time e1 is “1” although the number of sound sources at times before and after time e1 is “2”. As described above, in the case where it is predicted that the number of sound sources is incorrect in consideration of the tendency of the number of sound sources in the preceding and succeeding times, the data classification unit 5 smoothes time-series changes of sound source number sequences. For example, the data classification unit 5 converts the average value of the values “2” in the time before and after the time e1 into the number of sound sources at the time e1. As a result, even if the number of sound sources in the sound source sequence changes suddenly, the data classification unit 5 can classify the classification data d with high accuracy.

Although the case where the classification condition is time series data of the number of sound sources is shown, the classification condition may be time series data of the type of component of the acoustic signal. For example, in FIG. 6A, during a period in which the sound source A outputs an acoustic signal, information on the type of component of the acoustic signal output from the sound source A is set as the classification condition of the classification condition data e. During a period in which both the sound source A and the sound source B output acoustic signals, information on the type of the component of the sound signal output from each of the sound sources A and B is set as the classification condition of the classification condition data e. The data classification unit 5 can classify the classification data d with high accuracy even if the type of the component of the acoustic signal changes dynamically by referring to the time-series data of the type of the component of the acoustic signal.

It returns to the explanation of FIG. The signal regeneration unit 6 receives the feature amount c for signal regeneration from the feature amount extraction unit 2, and receives the classification result information f from the data classification unit 5, and classification data d for each component in the classification result information f and An acoustic signal for each component is regenerated based on the signal regeneration feature value c (step ST4).

For example, the signal regeneration unit 6 identifies the signal regeneration feature amount c corresponding to the sound source A using the classification data d1 classified into the first group A1 shown in FIG. An acoustic signal for each component is regenerated based on the generation feature c and the classification data d1.
Similarly, the signal regeneration unit 6 identifies the signal regeneration feature amount c corresponding to the sound source B using the classification data d2 classified into the second group A2 shown in FIG. 4B, and identifies the signal The component signal output from the sound source B is regenerated based on the regeneration feature amount c and the classification data d2. Thereby, the acoustic signal included in the input signal a is separated into the acoustic signal of the component output from the sound source A and the acoustic signal of the component output from the sound source B.

In addition, the signal regeneration unit 6 may output the classification condition data e in association with the acoustic signal for each regenerated component. For example, when the information indicating that the sound source is the sound source A and the sound source B is set in the classification condition data e as the information regarding the type of the component of the acoustic signal, the signal regeneration unit 6 generates the information for each regenerated component. Among the sound signals, the signal corresponding to the sound source A is associated with the information indicating that the sound source is the sound source A, and the signal corresponding to the sound source B is associated with the information indicating that the sound source is the sound source B And output. Thereby, the signal regeneration unit 6 can provide the output signal g capable of specifying the sound source that has output the acoustic signal of each component.

Note that, as a method of performing the above-described association, the type of component of the acoustic signal is determined based on the distance between the teacher signal related to the type of the component of the acoustic signal and the acoustic signal of the regenerated component. And the acoustic signal of the regenerated component may be associated.
Further, the signal regeneration unit 6 compares the utterance timing of the speaker analyzed from the image signal extracted from the input signal a with the output timing of the acoustic signal of the regenerated component, and the speaker and the audio whose timing overlaps with each other. It may be associated with a signal. For example, the signal regeneration unit 6 may specify the speech timing of the speaker from the lip information of the speaker included in the image signal.

Further, the classification condition set in the classification condition data e may be the number of components of the acoustic signal, but may be at least one of the lower limit value, the upper limit value, and the range of the number of components of the acoustic signal. For example, in the case of separating a speech uttered in a meeting into an audio signal for each attendee, the data classification unit 5 may stop classification of the classification data d when the number of components of the audio signal becomes less than the lower limit. . In addition, the data classification unit 5 may stop classification of the classification data d when the number of components of the acoustic signal exceeds the upper limit value, and the number of components of the acoustic signal is out of the range of the classification condition. At this time, the classification of the classification data d may be stopped. In these cases, the separation of the acoustic signal is stopped. As described above, the classification criteria may be set to a stop criterion for classification of the classification data d, and the data classification unit 5 classifies the classification data d with high accuracy within the range designated by the classification conditions. be able to.

As the classification condition set in the classification condition data e, feature information indicating the feature of the sound source may be set. The feature information indicating the feature of the sound source may be an output aspect of an acoustic signal by the sound source, or may be a physical feature of the sound source. For example, the output mode of the sound signal may be an average output time at which the sound signal is output from the sound source. When the sound source is a person, the physical feature of the sound source may be information on the physical constitution of the person. The data classification unit 5 can classify the classification data d with high accuracy according to the feature information indicating the feature of the sound source.

As described above, in the acoustic signal separation device 1 according to the first embodiment, the feature quantity extraction unit 2 extracts the feature quantity from the input signal a. The data estimation unit 3 estimates classification data d based on the classification feature b using DNN 3a. The data classification unit 5 performs classification data d for each component based on classification condition data e in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. Classify. The signal regeneration unit 6 regenerates an acoustic signal for each component based on the classification data d classified for each component and the feature value c for signal regeneration. Thereby, the acoustic signal separation device is the number of acoustic signal components not considered in the design, or even if the number of acoustic signal components or the type of acoustic signal component dynamically changes. 1 can accurately separate the acoustic signal into components.

In the acoustic signal separation device 1 according to the first embodiment, the classification condition set in the classification condition data e is at least one of the lower limit value of the number of components of the acoustic signal, the upper limit value of the number of sound sources, and the range. is there. Thereby, the data classification unit 5 can classify the classification data d with high accuracy within the range designated by the classification condition.

In the acoustic signal separation device 1 according to the first embodiment, the classification condition set in the classification condition data e is a sound source number sequence indicating a time-series change in the number of sound sources. By referring to the sound source sequence, the data classification unit 5 can classify the classification data d with high accuracy even if the number of sound sources changes dynamically.

In the acoustic signal separation device 1 according to the first embodiment, the data classification unit 5 smoothes the time-series change of the sound source sequence. Thereby, even if the number of sound sources in the sound source sequence changes suddenly, the data classification unit 5 can classify the classification data d with high accuracy.

In the acoustic signal separation device 1 according to the first embodiment, the classification condition set in the classification condition data e is feature information indicating the feature of the sound source. Thereby, the data classification unit 5 can classify the classification data d with high accuracy according to the feature information indicating the feature of the sound source.

In the acoustic signal separation device 1 according to the first embodiment, the signal regeneration unit 6 outputs the information on the type of the component of the acoustic signal in association with the signal of each component. Thereby, the signal regeneration unit 6 can provide the output signal g capable of specifying the sound source that has output the regenerated acoustic signal.

Second Embodiment
FIG. 7 is a block diagram showing a configuration of an acoustic signal separation device 1A according to Embodiment 2 of the present invention. In FIG. 7, the same components as in FIG. 1 are assigned the same reference numerals and descriptions thereof will be omitted. The acoustic signal separation device 1A includes a feature extraction unit 2, a data estimation unit 3, a data classification unit 5, a signal regeneration unit 6, and a data calculation unit 7. The acoustic signal included in the input signal a is divided into components. The sound signal is separated, and an output signal g including the sound signal for each component is output.

The data calculation unit 7 calculates classification condition data e from the input signal a. The input signal a includes, in addition to the acoustic signal, sensor information for specifying the number of components of the acoustic signal and the type of the component of the acoustic signal. The sensor information includes, for example, brain waves, ecological information of a person who can be a sound source, such as heart rate, image information of a person who can be a sound source, and physical information such as vibration or temperature change caused by speech of the person. The data calculation unit 7 uses the sensor information included in the input signal a to set classification condition data e in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. Calculate

FIG. 8A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device 1A. FIG. 8B is a block diagram showing a hardware configuration for executing software for realizing the function of the acoustic signal separation device 1A. In FIG. 8A and FIG. 8B, the same components as in FIG. 2A and FIG. 2B are assigned the same reference numerals and descriptions thereof will be omitted. The sensor interface 106 is an interface for inputting the sensor information described above. For example, the sensor interface 106 is connected to an ecological sensor that detects ecological information of a person who can be a sound source, a camera that captures a person who can be a sound source, or a physical sensor that detects vibration or temperature change caused by speech of a person. There is.

The respective functions of the feature extraction unit 2, the data estimation unit 3, the data classification unit 5, the signal regeneration unit 6, and the data calculation unit 7 in the acoustic signal separation device 1A are realized by a processing circuit. That is, the acoustic signal separation device 1A includes a processing circuit for executing the processing from step ST1a to step ST5a described later with reference to FIG. The processing circuit may be the processing circuit 103 that is dedicated hardware, or may be the processor 104 that executes a program stored in the memory 105.

Next, the operation will be described.
FIG. 9 is a flowchart showing an acoustic signal separation method according to the second embodiment. In FIG. 9, the processes of step ST1a, step ST3a, step ST4a and step ST5a are the same as step ST1, step ST2, step ST3 and step ST4 shown in FIG.

In step ST2a, the data calculation unit 7 calculates classification condition data e based on the input signal a including sensor information. For example, the data calculation unit 7 specifies the presence or absence of a person and the number of persons based on sensor information, and calculates classification condition data e in which classification conditions including the specified information are set.

Alternatively, the data calculation unit 7 may detect the speech of the person based on the lip information of the person and specify the number of components of the acoustic signal.
Furthermore, using DNN machine-learned so that the data calculation unit 7 outputs at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal from the sensor information included in the input signal a. The number of components of the acoustic signal and the type of component of the acoustic signal may be specified.
Furthermore, the data calculation unit 7 may detect the number of sound sources based on the sound signal, and specify the number of components of the sound signal. For example, the number of speakers detected using a speaker verification technique for detecting a specific speaker may be the number of components of the acoustic signal.

The data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e calculated by the data calculation unit 7. The data classification unit 5 generates, from the classification data d classified for each component, classification result information f in which the classification data d belonging to the same component are associated with each other and outputs the generated classification result information f to the signal regeneration unit 6.

As described above, the acoustic signal separation device 1A according to the second embodiment includes the data calculation unit 7 that calculates the classification condition data e from the input signal a. With this configuration, the acoustic signal separation device 1A can obtain classification condition data e from the input signal a. Thus, the acoustic signal separation device 1A can accurately separate the acoustic signal even if the number of components of the acoustic signal or the type of the component of the acoustic signal is unknown.

Third Embodiment
FIG. 10 is a block diagram showing a configuration of an acoustic signal separation device 1B in accordance with Embodiment 3 of the present invention. In FIG. 10, the same components as in FIG. 1 and FIG. The acoustic signal separation device 1B includes a feature extraction unit 2, a data estimation unit 3, a data classification unit 5, a signal regeneration unit 6, a data calculation unit 7, and a parameter switching unit 8, and the acoustic signal included in the input signal a The signal is separated into component-specific acoustic signals, and an output signal g including the component-specific acoustic signals is output.

The parameter switching unit 8 switches the network parameter 3b of the DNN 3a based on the classification condition data e. The network parameters 3b are learned in advance according to the characteristics of various sound sources. For example, with reference to the classification condition data e, the parameter switching unit 8 selects a network parameter 3b corresponding to the feature of the sound source to be processed from among various network parameters 3b learned in advance, and this network parameter Set 3b to DNN 3a.

Each function of feature amount extraction unit 2, data estimation unit 3, data classification unit 5, signal regeneration unit 6, data calculation unit 7, and parameter switching unit 8 in acoustic signal separation apparatus 1B is realized by a processing circuit. Ru. That is, the acoustic signal separation device 1B includes a processing circuit for executing the processing from step ST1b to step ST6b described later with reference to FIG. The processing circuit may be the processing circuit 103 that is dedicated hardware shown in FIG. 2A, or may be the processor 104 that executes a program stored in the memory 105 shown in FIG. 2B.

Next, the operation will be described.
FIG. 11 is a flowchart showing an acoustic signal separation method according to the third embodiment. In FIG. 11, the processes of step ST1b, step ST2b, and step ST4b to step ST6b are the same as the processes of step ST1a, step ST2a, and step ST3a to step ST5a shown in FIG.

In step ST3b, the parameter switching unit 8 selects and selects the network parameter 3b corresponding to the sound source to be processed from among various network parameters 3b learned in advance by referring to the classification condition data e. Set network parameter 3b to DNN 3a.

In the classification condition data e, classification conditions including information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal are set. For example, information that can specify the sound source, such as the gender of the speaker, the type of language, and the type of sound, is set as the information on the type of component of the acoustic signal. As the classification condition, in addition to the type of the component of the sound signal, information on the characteristics of the sound source, such as the number of sound sources, the average output time of the sound signal by the sound source, and the physique of the person who is the sound source may be set.

The parameter switching unit 8 may select the network parameter 3b using a decision tree based on a predetermined selection rule. The selection rule may be, for example, a rule that defines the type of acoustic signal component and the number of acoustic signal components, such as “two males” and “three or more females”.

The parameter switching unit 8 selects the network parameter 3b using a neural network or GMM machine-learned to select the network parameter 3b corresponding to the processing target sound source using the classification condition data e as an input. It is also good.

As described above, the acoustic signal separation device 1B according to the third embodiment includes the parameter switching unit 8 that switches the network parameter 3b set in the DNN 3a based on the classification condition data e. With this configuration, the data estimation unit 3 can estimate the classification data d corresponding to the sound source to be processed using the DNN 3a in which the network parameter 3b is switched. Accordingly, the acoustic signal separation device 1B can separate the acoustic signal with high accuracy.

Fourth Embodiment
FIG. 12 is a block diagram showing a configuration of an acoustic signal separation device 1C according to Embodiment 4 of the present invention. In FIG. 12, the same components as in FIG. 1 and FIG. The acoustic signal separation device 1C includes a feature extraction unit 2A, a data estimation unit 3A, a data classification unit 5A, a signal regeneration unit 6A, a data calculation unit 7A, and a section information acquisition unit 9, and is included in the input signal a. The acoustic signal is separated into component-specific acoustic signals, and an output signal g including the component-specific acoustic signals is output.

The feature quantity extraction unit 2A extracts the feature quantity b for classification and the feature quantity c for signal regeneration from the input signal a for each section based on the section information h. Here, the section is a section of the input signal a, and the input signal a is sectioned into sections according to the change of the acoustic signal. The section information h is information indicating the position of the section separating the input signal a. For example, the section information h may be information that can specify the position of the section in the input signal a, such as time information of the input signal a or a change value of a feature of the input signal a.

The data estimation unit 3A estimates the classification data d based on the classification feature b extracted from the input signal a for each section using the DNN 3a. In the DNN 3a, a network parameter 3b learned in advance so as to estimate classification data d based on the classification feature value b in section units is set. The DNN 3a in which the network parameter 3b is set estimates the classification data d for each section by hierarchically operating the classification feature amount b. For DNN 3a, for example, RNN or CNN may be used.

Note that the feature quantity extraction unit 2A performs short-time Fourier transformation on the acoustic signal included in the input signal a for each section to obtain the amplitude on the frequency axis, and calculates the feature quantity based on the amplitude on the frequency axis Do. Data in which the feature quantities calculated from the acoustic signal in this manner are arranged in time series may be used as the classification feature quantity b.

The data classification unit 5A classifies classification data d for each section for each component based on the classification condition data e calculated by the data calculation unit 7. Classification result information f which is classification data d classified for each section and each component is output to the signal regeneration unit 6A. Classification methods such as k-means or GMM may be used to classify the classification data d.

The signal regeneration unit 6A is configured for each component based on classification data d for each section and each component in the classification result information f, section information h, and a feature amount c for signal regeneration of the input signal a for each section. Regenerate the acoustic signal. The signal regeneration unit 6A outputs an output signal g which is an acoustic signal for each component. The output signal g may include an image signal and text information corresponding to the acoustic signal for each component.

The data calculation unit 7A calculates the classification condition data e for each section based on the section information h input from the section information acquisition unit 9 and the input signal a for each section. The input signal a for each section includes, in addition to the acoustic signal, sensor information used to identify the number of components of the acoustic signal corresponding to the section and the type of the component of the acoustic signal. The sensor information includes, for example, ecological information of a person who can be a sound source, image information in which a person who can be a sound source is photographed, and physical information such as vibration or temperature change caused by speech of the person.

The section information acquisition unit 9 acquires the section information h, and outputs the section information h to each of the feature extraction unit 2A, the signal regeneration unit 6A, and the data calculation unit 7A. For example, the section information acquisition unit 9 may acquire the section information h created by the external device, but may acquire the section information h input from the user of the acoustic signal separation device 1C.

The functions of the feature extraction unit 2A, the data estimation unit 3A, the data classification unit 5A, the signal regeneration unit 6A, the data calculation unit 7 and the section information acquisition unit 9 in the acoustic signal separation device 1C are realized by a processing circuit. Ru. That is, the acoustic signal separation device 1C includes a processing circuit for executing the processing from step ST1c to step ST5c described later with reference to FIG. The processing circuit may be the processing circuit 103 that is dedicated hardware shown in FIG. 2A, or may be the processor 104 that executes a program stored in the memory 105 shown in FIG. 2B.

Next, the operation will be described.
FIG. 13 is a flowchart showing an acoustic signal separation method according to the fourth embodiment.
First, the feature extraction unit 2A extracts the classification feature b and the signal regeneration feature c from the input signal a divided into sections based on the section information h input from the section information acquisition unit 9 (Step ST1 c). The feature extraction unit 2A outputs the classification feature b extracted from the input signal a for each section to the data estimation unit 3A, and regenerates the signal regeneration feature c extracted from the input signal a for each section. Output to section 6A.

Subsequently, the data estimation unit 3A estimates classification data d for each section of the input signal a from the classification feature b extracted from the input signal a for each section using the DNN 3a (step ST2c). The data estimation unit 3A outputs classification data d for each section to the data classification unit 5A.

The data classification unit 5A classifies classification data d estimated for each section of the input signal a for each component based on the classification condition data e calculated by the data calculation unit 7A (step ST3c). Classification result information f that is classification data d classified for each component is output from the data classification unit 5A to the signal regeneration unit 6A.

The signal regeneration unit 6A is configured for each component based on classification data d classified for each component in the classification result information f, section information h, and a signal regeneration feature amount c of the input signal a for each section. The acoustic signal is regenerated (step ST4c).

Next, the feature quantity extraction unit 2A confirms whether or not an unprocessed section remains in the input signal a for each section (step ST5c). Here, if an unprocessed section remains (step ST5c; YES), the process returns to the process of step ST1c, and the above-described series of processes is performed on the input signal a of the remaining section.
If no unprocessed section remains (step ST5c; NO), the acoustic signal separation device 1C ends the process of FIG.

As described above, in the acoustic signal separation device 1C according to the fourth embodiment, the feature quantity extraction unit 2A extracts the feature quantity from the input signal a for each section based on the section information h. The data estimation unit 3A estimates classification data d for each section based on the classification feature b using DNN 3a. The data classification unit 5A classifies classification data d for each section for each component based on the classification condition data e. Based on the classification data d classified by the data classification unit 5A, the section information h, and the signal regeneration feature value c of the input signal a for each section, the signal regeneration unit 6A is for each section of the input signal a. Regenerate an acoustic signal for each component. Thereby, even when the number of components of the acoustic signal or the type of the component of the acoustic signal is different for each section, the acoustic signal separation device 1C can accurately separate the acoustic signal.

The present invention is not limited to the above embodiment, and within the scope of the present invention, variations or embodiments of respective free combinations of the embodiments or respective optional components of the embodiments. An optional component can be omitted in each of the above.

The audio signal separation apparatus according to the present invention can separate an audio signal for each component with high accuracy, and therefore can be used for a conference system or the like in which a plurality of sound sources exist.

1, 1A, 1B, 1C acoustic signal separation device, 2, 2A feature quantity extraction unit, 3, 3A data estimation unit, 3a DNN, 3b network parameter, 4 data acquisition unit, 5, 5A data classification unit, 6, 6A signal Regeneration unit, 7, 7 A data calculation unit, 8 parameter switching unit, 9 section information acquisition unit, 100 acoustic interface, 101 image interface, 102 text input interface, 103 processing circuit, 104 processor, 105 memory, 106 sensor interface.

Claims

A feature amount extraction unit that extracts a feature amount from an input signal including an acoustic signal in which one or more components are mixed;
Based on the feature quantity extracted by the feature quantity extraction unit using a deep layer neural network in which network parameters are learned so as to estimate first data correlating components of an acoustic signal output from the same sound source A data estimation unit that estimates the first data;
The first data estimated by the data estimation unit is set based on second data in which a classification condition including information on at least one of the number of acoustic signal components and the type of acoustic signal components is set. A data classification unit that classifies each component;
And a signal regeneration unit for regenerating an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature quantity extracted by the feature quantity extraction unit. An acoustic signal separation device characterized by
The feature quantity extraction unit extracts a feature quantity from the input signal for each section based on section information indicating a position of a section that divides the input signal.
The data estimation unit estimates the first data for each section based on the feature quantity extracted by the feature quantity extraction unit using the deep layer neural network.
The data classification unit classifies the first data of each section for each component based on the second data,
The signal regeneration unit is based on the first data classified by component by the data classification unit, the section information, and the feature amount of the input signal for each section extracted by the feature amount extraction unit. The acoustic signal separation device according to claim 1, wherein the acoustic signal of each component is regenerated.
The sound signal separation device according to claim 1 or 2, further comprising: a data calculation unit that calculates the second data from the input signal.
The acoustic signal according to any one of claims 1 to 3, further comprising a parameter switching unit that switches a network parameter set in the deep layer neural network based on the second data. Separation device.
The classification condition set in the second data is at least one of a lower limit value, an upper limit value, and a range of the number of components of an acoustic signal. The acoustic signal separation device according to any one of the items.
The sound signal according to any one of claims 1 to 3, wherein the classification condition set in the second data is a sound source number sequence indicating a time-series change of the number of sound sources. Separation device.
The sound signal separation device according to claim 6, wherein the data classification unit smoothes time-series changes of the sound source sequence.
The acoustic signal separation device according to any one of claims 1 to 3, wherein the classification condition set in the second data is feature information indicating a feature of a sound source.
The acoustic signal separation device according to any one of claims 1 to 3, wherein the signal regeneration unit associates the second data with a signal of each component and outputs the second data.
Extracting a feature amount from an input signal including an acoustic signal in which one or more components are mixed;
The feature quantity extraction unit extracts the data using the deep layer neural network in which the network parameter is learned so that the data estimation unit estimates first data correlating the components of the acoustic signal output from the same sound source. Estimating the first data based on the feature amount;
The data estimated by the data estimation unit based on second data in which a classification condition includes information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal. Classifying the first data by component;
The signal regeneration unit regenerates an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature amount extracted by the feature amount extraction unit An acoustic signal separation method characterized by comprising.