WO2019102585A1 - Acoustic signal separation device and method for separating acoustic signal - Google Patents

Acoustic signal separation device and method for separating acoustic signal Download PDF

Info

Publication number
WO2019102585A1
WO2019102585A1 PCT/JP2017/042222 JP2017042222W WO2019102585A1 WO 2019102585 A1 WO2019102585 A1 WO 2019102585A1 JP 2017042222 W JP2017042222 W JP 2017042222W WO 2019102585 A1 WO2019102585 A1 WO 2019102585A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
classification
acoustic signal
unit
signal
Prior art date
Application number
PCT/JP2017/042222
Other languages
French (fr)
Japanese (ja)
Inventor
啓吾 川島
石井 純
岡登 洋平
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2017/042222 priority Critical patent/WO2019102585A1/en
Publication of WO2019102585A1 publication Critical patent/WO2019102585A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to an acoustic signal separation device and an acoustic signal separation method for separating an acoustic signal in which one or more components are mixed into an acoustic signal for each component.
  • DNN deep neural network
  • DNN is the number of components of the acoustic signal not considered in the design because it separates the acoustic signal regardless of the number of components of the acoustic signal and the type of the component of the acoustic signal.
  • the number of components of the acoustic signal or the type of the component of the acoustic signal dynamically changes, there is a problem that the acoustic signal can not be separated with high accuracy.
  • This invention solves the said subject, and it aims at obtaining the acoustic signal separation apparatus and the acoustic signal separation method which can isolate
  • An acoustic signal separation device includes a feature extraction unit, a data estimation unit, a data classification unit, and a signal regeneration unit.
  • the feature amount extraction unit extracts a feature amount from an input signal including an acoustic signal in which one or more components are mixed.
  • the data estimation unit uses the DNN in which the network parameters have been learned so as to estimate the first data for correlating the components of the acoustic signal output from the same sound source to the feature amount extracted by the feature amount extraction unit
  • the first data is estimated based on.
  • the data classification unit estimates the first data estimated by the data estimation unit based on second data in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set.
  • the data of are classified by component.
  • the signal regeneration unit regenerates an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature amount extracted by the feature amount extraction unit.
  • the feature quantity extraction unit extracts the feature quantity from the input signal
  • the data estimation unit uses DNN for the first data for correlating the components of the acoustic signal output from the same sound source.
  • the first data is estimated based on the feature amount
  • the data classification unit is configured to set classification conditions including information on at least one of the number of components of the acoustic signal and the type of the components of the acoustic signal.
  • the data is classified for each component, and the signal regeneration unit regenerates an acoustic signal for each component based on the first data and the feature value classified for each component.
  • the acoustic signal separation device is not limited to the number of components of the acoustic signal not considered in the design, or even if the number of the components of the acoustic signal or the type of the component of the acoustic signal dynamically changes.
  • the acoustic signal can be separated into components with high accuracy.
  • FIG. 2A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device according to the first embodiment.
  • FIG. 2B is a block diagram showing a hardware configuration for executing software that implements the function of the acoustic signal separation device according to the first embodiment.
  • 3 is a flowchart showing an acoustic signal separation method according to Embodiment 1;
  • FIG. 4A is a diagram showing classification data corresponding to two different components mapped in a two-dimensional space.
  • FIG. 4B is a diagram showing the classification data of FIG. 4A classified by component.
  • FIG. 4A is a diagram showing classification data corresponding to two different components mapped in a two-dimensional space.
  • FIG. 5A is a diagram showing classification data corresponding to the same component mapped in a two-dimensional space.
  • FIG. 5B is a diagram showing the classification data of FIG. 5A incorrectly classified into two components.
  • FIG. 5C is a diagram showing the classification data of FIG. 5A classified into one component.
  • FIG. 6A is a view showing classification data and classification condition data corresponding to two components arranged in time series.
  • FIG. 6B is a view showing the classification data and the smoothed classification condition data of FIG. 6A classified by component. It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 2 of this invention.
  • FIG. 1 is a diagram showing classification data corresponding to the same component mapped in a two-dimensional space.
  • FIG. 5B is a diagram showing the classification data of FIG. 5A incorrectly classified into two components.
  • FIG. 5C is a diagram showing the classification data of FIG. 5A classified into one component.
  • FIG. 6A is
  • FIG. 8A is a block diagram showing a hardware configuration that implements the function of the acoustic signal separation device according to the second embodiment.
  • FIG. 8B is a block diagram showing a hardware configuration that executes software that implements the function of the acoustic signal separation device according to Embodiment 2.
  • 7 is a flowchart showing an acoustic signal separation method according to Embodiment 2; It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 3 of this invention.
  • 10 is a flowchart showing an acoustic signal separation method according to Embodiment 3. It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 4 of this invention.
  • 15 is a flowchart showing an acoustic signal separation method according to Embodiment 4.
  • FIG. 1 is a block diagram showing a configuration of an acoustic signal separation device 1 according to Embodiment 1 of the present invention.
  • the acoustic signal separation device 1 includes a feature extraction unit 2, a data estimation unit 3, a data acquisition unit 4, a data classification unit 5, and a signal regeneration unit 6, and the acoustic signal included in the input signal a is divided into components. The sound signal is separated, and an output signal g including the sound signal for each component is output.
  • the feature amount extraction unit 2 extracts a feature amount from the input signal a.
  • the input signal a may be an acoustic signal in which one or more components are mixed, but may be a signal including an acoustic signal and another signal.
  • the input signal a may be a signal including, in addition to the acoustic signal, an image signal or text data associated with the acoustic signal.
  • the feature quantities extracted from the input signal a by the feature quantity extraction unit 2 are the classification feature quantity b and the signal regeneration feature quantity c.
  • the classification feature amount b is a feature amount used for estimation of the classification data d by the data estimation unit 3.
  • the feature amount extraction unit 2 performs short-time Fourier transformation on the acoustic signal included in the input signal a to obtain the amplitude on the frequency axis, and calculates the feature amount based on the amplitude on the frequency axis.
  • Data in which the feature quantities calculated from the acoustic signal in this manner are arranged in time series may be used as the classification feature quantity b.
  • the signal regeneration feature quantity c is a feature quantity used for regeneration of the signal for each component by the signal regeneration unit 6.
  • the feature amount c for signal regeneration may be a spectrum coefficient calculated by performing the short-time Fourier transform on the acoustic signal included in the input signal a by the feature amount extraction unit 2. It may be included image information or text data.
  • the data estimation unit 3 estimates classification data d based on the classification feature amount b extracted from the input signal a by the feature amount extraction unit 2 using the DNN 3a.
  • the classification data d is first data for correlating the components of the acoustic signal output from the same sound source.
  • the classification data d may be the cost between the components of the acoustic signal converted so that the distance between the time frequency components of the acoustic signal output from the same sound source becomes small.
  • a network parameter 3b learned in advance so as to estimate the classification data d based on the classification feature b is set.
  • the DNN 3a in which the network parameter 3b is set estimates the classification data d by hierarchically calculating the classification feature amount b.
  • DNN 3a for example, Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN) may be used.
  • RNN Recurrent Neural Network
  • CNN Convolutional Neural Network
  • the data acquisition unit 4 receives and acquires the input of the classification condition data e.
  • the classification condition data e acquired by the data acquisition unit 4 is output to the data classification unit 5.
  • the data classification unit 5 may obtain classification condition data e directly, and the data acquisition unit 4 may be provided in a device different from the acoustic signal separation device 1. That is, in the acoustic signal separation device 1, the data classification unit 5 may have the function of acquiring the classification condition data e, and the data acquisition unit 4 may not be provided.
  • the classification condition data e is second data in which the classification condition of the classification data d is set.
  • the classification condition set in the classification condition data e includes information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal.
  • the information on the number of components of the sound signal may be data indicating the number of dynamically changing sound sources, and may be, for example, sound source sequence data in which the number of sound sources is arranged in time series.
  • the information on the type of component of the acoustic signal may be any information that can specify the sound source, such as the gender of the speaker, the type of language, and the type of the output sound. For example, if the type of the component of the sound signal is a siren and an animal stalk, the sound signal will be a mixture of the siren component output from the alarm and the component of the sound uttered from the animal.
  • the data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e acquired by the data acquisition unit 4.
  • Classification methods such as k-means clustering or GMM (Gaussian Mixture Models) may be used to classify the classification data d.
  • Classification result information f that is classification data d classified by the data classification unit 5 is output to the signal regeneration unit 6.
  • the signal regeneration unit 6 receives the classification result information f from the data classification unit 5, and based on the classification data d for each component in the classification result information f, an acoustic signal for each component from the feature amount c for signal regeneration Regenerate
  • the signal regeneration unit 6 outputs an output signal g which is an acoustic signal for each regenerated component.
  • the output signal g may include an image signal and text information corresponding to the acoustic signal for each regenerated component.
  • FIG. 2A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device 1.
  • FIG. 2B is a block diagram showing a hardware configuration for executing software for realizing the function of the acoustic signal separation device 1.
  • the acoustic interface 100 is an interface that receives an acoustic signal included in an input signal a and outputs an acoustic signal included in an output signal g.
  • the acoustic interface 100 is connected to a microphone that collects an acoustic signal, and is connected to a speaker that outputs the acoustic signal.
  • the image interface 101 is an interface that receives an image signal included in an input signal a and outputs an image signal included in an output signal g.
  • the image interface 101 is connected to a camera for capturing an image signal and connected to a display for displaying the image signal.
  • the text input interface 102 is an interface for inputting text information included in the input signal a.
  • the text input interface 102 is connected to a keyboard or mouse for inputting text information.
  • the memory (not shown) included in the processing circuit 103 shown in FIG. 2A or the memory 105 shown in FIG. 2B has an input signal a, classification feature b, signal regeneration feature c, classification data d, and classification condition data e. , Classification result information f and output signal g are temporarily stored.
  • the processing circuit 103 or the processor 104 appropriately reads these data to separate the acoustic signal.
  • the acoustic signal separation device 1 includes a processing circuit for executing the processing from step ST1 to step ST4 described later with reference to FIG.
  • the processing circuit may be dedicated hardware or may be a CPU (Central Processing Unit) that executes a program stored in a memory.
  • the processing circuit 103 may be, for example, a single circuit, a complex circuit, a programmed processor, a parallel programmed processor, or an ASIC (Application Specific Integrated Circuit). ), FPGA (Field-Programmable Gate Array), or a combination thereof.
  • the respective functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 may be realized by separate processing circuits. It may be realized by one processing circuit.
  • the functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 are software, firmware or software It is realized by the combination of the and the firmware.
  • the software or firmware is written as a program and stored in the memory 105.
  • the processor 104 reads out and executes the program stored in the memory 105 to thereby perform the respective functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6.
  • the acoustic signal separation device 1 includes the memory 105 for storing a program that is to be executed as a result of the processing from step ST1 to step ST4 shown in FIG. 3 when executed by the processor 104.
  • These programs cause a computer to execute the procedure or method of the feature extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6.
  • the memory 105 is a computer-readable storage medium storing a program for causing a computer to function as the feature extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6. It is also good.
  • the memory 105 may be, for example, a nonvolatile or volatile semiconductor memory such as a random access memory (RAM), a read only memory (ROM), a flash memory, an erasable programmable read only memory (EPROM), or an EEPROM (electrically-EPROM).
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • EEPROM electrically-EPROM
  • a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD, etc. correspond.
  • the memory 105 may be an external memory such as a USB (Universal Serial Bus) memory.
  • USB Universal Serial Bus
  • the functions of the feature quantity extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6 are partially realized by dedicated hardware, and partially realized by software or firmware. You may For example, the functions of the feature amount extraction unit 2, the data estimation unit 3, and the data acquisition unit 4 are realized by processing circuits that are dedicated hardware. The functions of the data classification unit 5 and the signal regeneration unit 6 may be realized by the processor 104 reading and executing a program stored in the memory 105. Thus, the processing circuit can realize each of the above functions by hardware, software, firmware or a combination thereof.
  • FIG. 3 is a flowchart showing an acoustic signal separation method according to the first embodiment.
  • the feature quantity extraction unit 2 extracts the classification feature quantity b and the signal regeneration feature quantity c from the input signal a (step ST1).
  • the classification feature amount b is output from the feature extraction unit 2 to the data estimation unit 3
  • the signal regeneration feature amount c is output from the feature extraction unit 2 to the signal regeneration unit 6.
  • the input signal a may include an image signal input by the image interface 101 or text information input by the text input interface 102 in addition to the sound signal input by the audio interface 100. Also, the feature quantity extraction unit 2 may extract the feature quantity by reading the input signal a from a memory (not shown) included in the processing circuit 103 or the memory 105. Furthermore, the input signal a may be stream data.
  • the data estimation unit 3 estimates classification data d based on the classification feature b using DNN 3a (step ST2).
  • the classification data d is output from the data estimation unit 3 to the data classification unit 5.
  • the data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e (step ST3).
  • the data classification unit 5 outputs, to the signal regeneration unit 6, classification result information f which is classification data d classified for each component.
  • FIG. 4A is a view showing classification data d1 and d2 corresponding to components of two different acoustic signals mapped in a two-dimensional space.
  • the number of sound sources is two, that is, the sound source A and the sound source B
  • the input signal a includes the component of the acoustic signal output from the sound source A and the acoustic signal output from the sound source B. It is assumed that the ingredients are mixed.
  • Classification data d1 indicated by a circle symbol is data for correlating the components of the sound signal output from the sound source A
  • classification data d2 indicated by a triangle symbol corresponds between the components of the sound signal output from the sound source B It is data.
  • the classification feature b also changes accordingly. Therefore, when the data estimation unit 3 estimates the classification data d based on the classification feature b using the DNN 3a, the classification data d corresponding to the component of the acoustic signal output from the same sound source Even in this case, the value of the classification data d may vary depending on the change of the classification feature value b.
  • Data classification unit 5 is input in a state in which it is not known whether classification data d dispersed into a plurality of values is classification data d1 belonging to sound source A or classification data d2 belonging to sound source B. .
  • FIG. 4B is a view showing classification data d1 and d2 classified by component.
  • the classification condition that the number of sound sources is two is set in the classification condition data e. Since the number of sound sources is 2, the data classification unit 5 classifies the classification data d1 into the first group A1 corresponding to the sound source A, and the classification data d2 into the second group A2 corresponding to the sound source B. Classify.
  • FIG. 5A is a diagram showing classification data d1 corresponding to the same component mapped in a two-dimensional space.
  • FIG. 5B is a diagram showing classification data d1 incorrectly classified into two components.
  • FIG. 5C is a diagram showing classification data d1 classified into one component.
  • the number of sound sources is only the sound source A, and only the component of the acoustic signal output from the sound source A is mixed with the acoustic signal contained in the input signal a.
  • the data classification unit 5 is input in a state where it does not know to which sound source classification data d dispersed into a plurality of values belongs.
  • the data classification unit 5 classifies classification data d based on, for example, the number of sound sources set in the classification condition data e.
  • the data classification unit 5 correctly classifies the plurality of classification data d into the group C corresponding to the sound source A, as shown in FIG. 5C.
  • the acoustic signal separation device 1 can prevent separation errors of the acoustic signal due to the mismatch between the number of separations and the number of sound sources.
  • FIG. 6A is a view showing classification data d1 and d2 and classification condition data e corresponding to two components arranged in time series. For example, when a plurality of speakers are sound sources and a plurality of speakers speak, the number of sound sources changes dynamically. The example shown in FIG. 6A shows the case where the sound signal is also output from the sound source B after the sound signal is output from the sound source A.
  • the classification data d is classified with high accuracy by using the classification condition including the information on at least one of the number of acoustic signal components and the type of acoustic signal components.
  • the data classification unit 5 uses the classification condition data e, which is a sound source number sequence indicating the number of time-series sound sources, to classify the classification data d1 into the first group corresponding to the sound source A.
  • Classification is made to D1, and classification data d2 is classified to a second group D2 corresponding to the sound source B.
  • the data classification unit 5 can classify the classification data d with high accuracy even if the number of sound sources dynamically changes.
  • FIG. 6B is a view showing classification data d1 and d2 classified according to components and smoothed classification condition data.
  • the number of sound sources at time e1 is “1” although the number of sound sources at times before and after time e1 is “2”.
  • the data classification unit 5 smoothes time-series changes of sound source number sequences. For example, the data classification unit 5 converts the average value of the values “2” in the time before and after the time e1 into the number of sound sources at the time e1. As a result, even if the number of sound sources in the sound source sequence changes suddenly, the data classification unit 5 can classify the classification data d with high accuracy.
  • the classification condition may be time series data of the type of component of the acoustic signal.
  • the classification condition data e information on the type of component of the acoustic signal output from the sound source A is set as the classification condition of the classification condition data e.
  • the classification condition data e information on the type of the component of the sound signal output from each of the sound sources A and B is set as the classification condition of the classification condition data e.
  • the data classification unit 5 can classify the classification data d with high accuracy even if the type of the component of the acoustic signal changes dynamically by referring to the time-series data of the type of the component of the acoustic signal.
  • the signal regeneration unit 6 receives the feature amount c for signal regeneration from the feature amount extraction unit 2, and receives the classification result information f from the data classification unit 5, and classification data d for each component in the classification result information f and An acoustic signal for each component is regenerated based on the signal regeneration feature value c (step ST4).
  • the signal regeneration unit 6 identifies the signal regeneration feature amount c corresponding to the sound source A using the classification data d1 classified into the first group A1 shown in FIG. An acoustic signal for each component is regenerated based on the generation feature c and the classification data d1. Similarly, the signal regeneration unit 6 identifies the signal regeneration feature amount c corresponding to the sound source B using the classification data d2 classified into the second group A2 shown in FIG. 4B, and identifies the signal The component signal output from the sound source B is regenerated based on the regeneration feature amount c and the classification data d2. Thereby, the acoustic signal included in the input signal a is separated into the acoustic signal of the component output from the sound source A and the acoustic signal of the component output from the sound source B.
  • the signal regeneration unit 6 may output the classification condition data e in association with the acoustic signal for each regenerated component. For example, when the information indicating that the sound source is the sound source A and the sound source B is set in the classification condition data e as the information regarding the type of the component of the acoustic signal, the signal regeneration unit 6 generates the information for each regenerated component.
  • the signal corresponding to the sound source A is associated with the information indicating that the sound source is the sound source A
  • the signal corresponding to the sound source B is associated with the information indicating that the sound source is the sound source B And output.
  • the signal regeneration unit 6 can provide the output signal g capable of specifying the sound source that has output the acoustic signal of each component.
  • the type of component of the acoustic signal is determined based on the distance between the teacher signal related to the type of the component of the acoustic signal and the acoustic signal of the regenerated component. And the acoustic signal of the regenerated component may be associated. Further, the signal regeneration unit 6 compares the utterance timing of the speaker analyzed from the image signal extracted from the input signal a with the output timing of the acoustic signal of the regenerated component, and the speaker and the audio whose timing overlaps with each other. It may be associated with a signal. For example, the signal regeneration unit 6 may specify the speech timing of the speaker from the lip information of the speaker included in the image signal.
  • the classification condition set in the classification condition data e may be the number of components of the acoustic signal, but may be at least one of the lower limit value, the upper limit value, and the range of the number of components of the acoustic signal.
  • the data classification unit 5 may stop classification of the classification data d when the number of components of the audio signal becomes less than the lower limit.
  • the data classification unit 5 may stop classification of the classification data d when the number of components of the acoustic signal exceeds the upper limit value, and the number of components of the acoustic signal is out of the range of the classification condition.
  • the classification of the classification data d may be stopped. In these cases, the separation of the acoustic signal is stopped.
  • the classification criteria may be set to a stop criterion for classification of the classification data d, and the data classification unit 5 classifies the classification data d with high accuracy within the range designated by the classification conditions. be able to.
  • feature information indicating the feature of the sound source may be set.
  • the feature information indicating the feature of the sound source may be an output aspect of an acoustic signal by the sound source, or may be a physical feature of the sound source.
  • the output mode of the sound signal may be an average output time at which the sound signal is output from the sound source.
  • the physical feature of the sound source may be information on the physical constitution of the person.
  • the data classification unit 5 can classify the classification data d with high accuracy according to the feature information indicating the feature of the sound source.
  • the feature quantity extraction unit 2 extracts the feature quantity from the input signal a.
  • the data estimation unit 3 estimates classification data d based on the classification feature b using DNN 3a.
  • the data classification unit 5 performs classification data d for each component based on classification condition data e in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. Classify.
  • the signal regeneration unit 6 regenerates an acoustic signal for each component based on the classification data d classified for each component and the feature value c for signal regeneration.
  • the acoustic signal separation device is the number of acoustic signal components not considered in the design, or even if the number of acoustic signal components or the type of acoustic signal component dynamically changes. 1 can accurately separate the acoustic signal into components.
  • the classification condition set in the classification condition data e is at least one of the lower limit value of the number of components of the acoustic signal, the upper limit value of the number of sound sources, and the range. is there.
  • the data classification unit 5 can classify the classification data d with high accuracy within the range designated by the classification condition.
  • the classification condition set in the classification condition data e is a sound source number sequence indicating a time-series change in the number of sound sources.
  • the data classification unit 5 can classify the classification data d with high accuracy even if the number of sound sources changes dynamically.
  • the data classification unit 5 smoothes the time-series change of the sound source sequence. Thereby, even if the number of sound sources in the sound source sequence changes suddenly, the data classification unit 5 can classify the classification data d with high accuracy.
  • the classification condition set in the classification condition data e is feature information indicating the feature of the sound source.
  • the data classification unit 5 can classify the classification data d with high accuracy according to the feature information indicating the feature of the sound source.
  • the signal regeneration unit 6 outputs the information on the type of the component of the acoustic signal in association with the signal of each component. Thereby, the signal regeneration unit 6 can provide the output signal g capable of specifying the sound source that has output the regenerated acoustic signal.
  • FIG. 7 is a block diagram showing a configuration of an acoustic signal separation device 1A according to Embodiment 2 of the present invention.
  • the acoustic signal separation device 1A includes a feature extraction unit 2, a data estimation unit 3, a data classification unit 5, a signal regeneration unit 6, and a data calculation unit 7.
  • the acoustic signal included in the input signal a is divided into components.
  • the sound signal is separated, and an output signal g including the sound signal for each component is output.
  • the data calculation unit 7 calculates classification condition data e from the input signal a.
  • the input signal a includes, in addition to the acoustic signal, sensor information for specifying the number of components of the acoustic signal and the type of the component of the acoustic signal.
  • the sensor information includes, for example, brain waves, ecological information of a person who can be a sound source, such as heart rate, image information of a person who can be a sound source, and physical information such as vibration or temperature change caused by speech of the person.
  • the data calculation unit 7 uses the sensor information included in the input signal a to set classification condition data e in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. Calculate
  • FIG. 8A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device 1A.
  • FIG. 8B is a block diagram showing a hardware configuration for executing software for realizing the function of the acoustic signal separation device 1A.
  • the sensor interface 106 is an interface for inputting the sensor information described above.
  • the sensor interface 106 is connected to an ecological sensor that detects ecological information of a person who can be a sound source, a camera that captures a person who can be a sound source, or a physical sensor that detects vibration or temperature change caused by speech of a person. There is.
  • the acoustic signal separation device 1A includes a processing circuit for executing the processing from step ST1a to step ST5a described later with reference to FIG.
  • the processing circuit may be the processing circuit 103 that is dedicated hardware, or may be the processor 104 that executes a program stored in the memory 105.
  • FIG. 9 is a flowchart showing an acoustic signal separation method according to the second embodiment.
  • the processes of step ST1a, step ST3a, step ST4a and step ST5a are the same as step ST1, step ST2, step ST3 and step ST4 shown in FIG.
  • step ST2a the data calculation unit 7 calculates classification condition data e based on the input signal a including sensor information. For example, the data calculation unit 7 specifies the presence or absence of a person and the number of persons based on sensor information, and calculates classification condition data e in which classification conditions including the specified information are set.
  • the data calculation unit 7 may detect the speech of the person based on the lip information of the person and specify the number of components of the acoustic signal. Furthermore, using DNN machine-learned so that the data calculation unit 7 outputs at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal from the sensor information included in the input signal a. The number of components of the acoustic signal and the type of component of the acoustic signal may be specified. Furthermore, the data calculation unit 7 may detect the number of sound sources based on the sound signal, and specify the number of components of the sound signal. For example, the number of speakers detected using a speaker verification technique for detecting a specific speaker may be the number of components of the acoustic signal.
  • the data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e calculated by the data calculation unit 7.
  • the data classification unit 5 generates, from the classification data d classified for each component, classification result information f in which the classification data d belonging to the same component are associated with each other and outputs the generated classification result information f to the signal regeneration unit 6.
  • the acoustic signal separation device 1A includes the data calculation unit 7 that calculates the classification condition data e from the input signal a.
  • the acoustic signal separation device 1A can obtain classification condition data e from the input signal a.
  • the acoustic signal separation device 1A can accurately separate the acoustic signal even if the number of components of the acoustic signal or the type of the component of the acoustic signal is unknown.
  • FIG. 10 is a block diagram showing a configuration of an acoustic signal separation device 1B in accordance with Embodiment 3 of the present invention.
  • the acoustic signal separation device 1B includes a feature extraction unit 2, a data estimation unit 3, a data classification unit 5, a signal regeneration unit 6, a data calculation unit 7, and a parameter switching unit 8, and the acoustic signal included in the input signal a
  • the signal is separated into component-specific acoustic signals, and an output signal g including the component-specific acoustic signals is output.
  • the parameter switching unit 8 switches the network parameter 3b of the DNN 3a based on the classification condition data e.
  • the network parameters 3b are learned in advance according to the characteristics of various sound sources. For example, with reference to the classification condition data e, the parameter switching unit 8 selects a network parameter 3b corresponding to the feature of the sound source to be processed from among various network parameters 3b learned in advance, and this network parameter Set 3b to DNN 3a.
  • acoustic signal separation device 1B includes a processing circuit for executing the processing from step ST1b to step ST6b described later with reference to FIG.
  • the processing circuit may be the processing circuit 103 that is dedicated hardware shown in FIG. 2A, or may be the processor 104 that executes a program stored in the memory 105 shown in FIG. 2B.
  • FIG. 11 is a flowchart showing an acoustic signal separation method according to the third embodiment.
  • the processes of step ST1b, step ST2b, and step ST4b to step ST6b are the same as the processes of step ST1a, step ST2a, and step ST3a to step ST5a shown in FIG.
  • step ST3b the parameter switching unit 8 selects and selects the network parameter 3b corresponding to the sound source to be processed from among various network parameters 3b learned in advance by referring to the classification condition data e. Set network parameter 3b to DNN 3a.
  • classification conditions including information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal are set.
  • information that can specify the sound source such as the gender of the speaker, the type of language, and the type of sound
  • the classification condition in addition to the type of the component of the sound signal, information on the characteristics of the sound source, such as the number of sound sources, the average output time of the sound signal by the sound source, and the physique of the person who is the sound source may be set.
  • the parameter switching unit 8 may select the network parameter 3b using a decision tree based on a predetermined selection rule.
  • the selection rule may be, for example, a rule that defines the type of acoustic signal component and the number of acoustic signal components, such as “two males” and “three or more females”.
  • the parameter switching unit 8 selects the network parameter 3b using a neural network or GMM machine-learned to select the network parameter 3b corresponding to the processing target sound source using the classification condition data e as an input. It is also good.
  • the acoustic signal separation device 1B includes the parameter switching unit 8 that switches the network parameter 3b set in the DNN 3a based on the classification condition data e.
  • the data estimation unit 3 can estimate the classification data d corresponding to the sound source to be processed using the DNN 3a in which the network parameter 3b is switched. Accordingly, the acoustic signal separation device 1B can separate the acoustic signal with high accuracy.
  • FIG. 12 is a block diagram showing a configuration of an acoustic signal separation device 1C according to Embodiment 4 of the present invention.
  • the acoustic signal separation device 1C includes a feature extraction unit 2A, a data estimation unit 3A, a data classification unit 5A, a signal regeneration unit 6A, a data calculation unit 7A, and a section information acquisition unit 9, and is included in the input signal a.
  • the acoustic signal is separated into component-specific acoustic signals, and an output signal g including the component-specific acoustic signals is output.
  • the feature quantity extraction unit 2A extracts the feature quantity b for classification and the feature quantity c for signal regeneration from the input signal a for each section based on the section information h.
  • the section is a section of the input signal a
  • the input signal a is sectioned into sections according to the change of the acoustic signal.
  • the section information h is information indicating the position of the section separating the input signal a.
  • the section information h may be information that can specify the position of the section in the input signal a, such as time information of the input signal a or a change value of a feature of the input signal a.
  • the data estimation unit 3A estimates the classification data d based on the classification feature b extracted from the input signal a for each section using the DNN 3a.
  • a network parameter 3b learned in advance so as to estimate classification data d based on the classification feature value b in section units is set.
  • the DNN 3a in which the network parameter 3b is set estimates the classification data d for each section by hierarchically operating the classification feature amount b.
  • DNN 3a for example, RNN or CNN may be used.
  • the feature quantity extraction unit 2A performs short-time Fourier transformation on the acoustic signal included in the input signal a for each section to obtain the amplitude on the frequency axis, and calculates the feature quantity based on the amplitude on the frequency axis Do.
  • Data in which the feature quantities calculated from the acoustic signal in this manner are arranged in time series may be used as the classification feature quantity b.
  • the data classification unit 5A classifies classification data d for each section for each component based on the classification condition data e calculated by the data calculation unit 7.
  • Classification result information f which is classification data d classified for each section and each component is output to the signal regeneration unit 6A.
  • Classification methods such as k-means or GMM may be used to classify the classification data d.
  • the signal regeneration unit 6A is configured for each component based on classification data d for each section and each component in the classification result information f, section information h, and a feature amount c for signal regeneration of the input signal a for each section. Regenerate the acoustic signal.
  • the signal regeneration unit 6A outputs an output signal g which is an acoustic signal for each component.
  • the output signal g may include an image signal and text information corresponding to the acoustic signal for each component.
  • the data calculation unit 7A calculates the classification condition data e for each section based on the section information h input from the section information acquisition unit 9 and the input signal a for each section.
  • the input signal a for each section includes, in addition to the acoustic signal, sensor information used to identify the number of components of the acoustic signal corresponding to the section and the type of the component of the acoustic signal.
  • the sensor information includes, for example, ecological information of a person who can be a sound source, image information in which a person who can be a sound source is photographed, and physical information such as vibration or temperature change caused by speech of the person.
  • the section information acquisition unit 9 acquires the section information h, and outputs the section information h to each of the feature extraction unit 2A, the signal regeneration unit 6A, and the data calculation unit 7A.
  • the section information acquisition unit 9 may acquire the section information h created by the external device, but may acquire the section information h input from the user of the acoustic signal separation device 1C.
  • the acoustic signal separation device 1C includes a processing circuit for executing the processing from step ST1c to step ST5c described later with reference to FIG.
  • the processing circuit may be the processing circuit 103 that is dedicated hardware shown in FIG. 2A, or may be the processor 104 that executes a program stored in the memory 105 shown in FIG. 2B.
  • FIG. 13 is a flowchart showing an acoustic signal separation method according to the fourth embodiment.
  • the feature extraction unit 2A extracts the classification feature b and the signal regeneration feature c from the input signal a divided into sections based on the section information h input from the section information acquisition unit 9 (Step ST1 c).
  • the feature extraction unit 2A outputs the classification feature b extracted from the input signal a for each section to the data estimation unit 3A, and regenerates the signal regeneration feature c extracted from the input signal a for each section. Output to section 6A.
  • the data estimation unit 3A estimates classification data d for each section of the input signal a from the classification feature b extracted from the input signal a for each section using the DNN 3a (step ST2c).
  • the data estimation unit 3A outputs classification data d for each section to the data classification unit 5A.
  • the data classification unit 5A classifies classification data d estimated for each section of the input signal a for each component based on the classification condition data e calculated by the data calculation unit 7A (step ST3c). Classification result information f that is classification data d classified for each component is output from the data classification unit 5A to the signal regeneration unit 6A.
  • the signal regeneration unit 6A is configured for each component based on classification data d classified for each component in the classification result information f, section information h, and a signal regeneration feature amount c of the input signal a for each section.
  • the acoustic signal is regenerated (step ST4c).
  • the feature quantity extraction unit 2A confirms whether or not an unprocessed section remains in the input signal a for each section (step ST5c).
  • step ST5c if an unprocessed section remains (step ST5c; YES), the process returns to the process of step ST1c, and the above-described series of processes is performed on the input signal a of the remaining section. If no unprocessed section remains (step ST5c; NO), the acoustic signal separation device 1C ends the process of FIG.
  • the feature quantity extraction unit 2A extracts the feature quantity from the input signal a for each section based on the section information h.
  • the data estimation unit 3A estimates classification data d for each section based on the classification feature b using DNN 3a.
  • the data classification unit 5A classifies classification data d for each section for each component based on the classification condition data e.
  • the signal regeneration unit 6A is for each section of the input signal a. Regenerate an acoustic signal for each component.
  • the present invention is not limited to the above embodiment, and within the scope of the present invention, variations or embodiments of respective free combinations of the embodiments or respective optional components of the embodiments.
  • An optional component can be omitted in each of the above.
  • the audio signal separation apparatus can separate an audio signal for each component with high accuracy, and therefore can be used for a conference system or the like in which a plurality of sound sources exist.
  • 1, 1A, 1B, 1C acoustic signal separation device 2, 2A feature quantity extraction unit, 3, 3A data estimation unit, 3a DNN, 3b network parameter, 4 data acquisition unit, 5, 5A data classification unit, 6, 6A signal Regeneration unit, 7, 7 A data calculation unit, 8 parameter switching unit, 9 section information acquisition unit, 100 acoustic interface, 101 image interface, 102 text input interface, 103 processing circuit, 104 processor, 105 memory, 106 sensor interface.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A feature extraction unit (2) extracts a feature (b) for classification and a feature (c) for signal regeneration from an input signal (a). A data estimation unit (3) estimates data (d) for classification for inter-component mapping of acoustic signals output from the same sound source using DNN (3a) on the basis of the feature (b) for classification. A data classification unit (5) classifies the data (d) for classification for each component on the basis of classification parameter data (e) in which a classification condition is established, the classification parameter including information associated with at least either the number of components in the acoustic signals or the types of components in the acoustic signals. A signal regeneration unit (6) regenerates a signal for each component on the basis of data (f) for classification and the feature (c) for signal regeneration.

Description

音響信号分離装置および音響信号分離方法Acoustic signal separation apparatus and acoustic signal separation method
 この発明は、1つ以上の成分が混合された音響信号を成分ごとの音響信号に分離する音響信号分離装置および音響信号分離方法に関する。 The present invention relates to an acoustic signal separation device and an acoustic signal separation method for separating an acoustic signal in which one or more components are mixed into an acoustic signal for each component.
 1つ以上の成分が混合された音響信号を成分ごとの音響信号に分離する従来の技術として、例えば、特許文献1に記載される方法がある。この方法では、深層ニューラルネットワーク(以下、DNNと記載する)を用いて、1つ以上の成分が混合された音響信号を成分ごとの音響信号に分離している。 As a conventional technique for separating an acoustic signal in which one or more components are mixed into an acoustic signal for each component, there is a method described in Patent Document 1, for example. In this method, a deep neural network (hereinafter referred to as DNN) is used to separate an acoustic signal in which one or more components are mixed into an acoustic signal for each component.
国際公開第2017/007035号International Publication No. 2017/007035
 特許文献1に記載された方法では、DNNが音響信号の成分の数および音響信号の成分の種類に依らずに音響信号を分離するため、設計時に考慮されていない音響信号の成分の数である場合、あるいは、音響信号の成分の数または音響信号の成分の種類が動的に変化する場合に、音響信号を精度よく分離できないという課題があった。 In the method described in Patent Document 1, DNN is the number of components of the acoustic signal not considered in the design because it separates the acoustic signal regardless of the number of components of the acoustic signal and the type of the component of the acoustic signal. In the case where the number of components of the acoustic signal or the type of the component of the acoustic signal dynamically changes, there is a problem that the acoustic signal can not be separated with high accuracy.
 この発明は上記課題を解決するものであって、音響信号を成分ごとに精度よく分離することができる音響信号分離装置および音響信号分離方法を得ることを目的とする。 This invention solves the said subject, and it aims at obtaining the acoustic signal separation apparatus and the acoustic signal separation method which can isolate | separate an acoustic signal precisely for every component.
 この発明に係る音響信号分離装置は、特徴量抽出部、データ推定部、データ分類部および信号再生成部を備える。特徴量抽出部は、1つ以上の成分が混合された音響信号を含む入力信号から特徴量を抽出する。データ推定部は、同一の音源から出力された音響信号の成分間を対応付ける第1のデータを推定するようにネットワークパラメータが学習されたDNNを用いて、特徴量抽出部によって抽出された特徴量に基づいて第1のデータを推定する。データ分類部は、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定された第2のデータに基づいて、データ推定部によって推定された第1のデータを成分ごとに分類する。信号再生成部は、データ分類部によって成分ごとに分類された第1のデータおよび特徴量抽出部によって抽出された特徴量に基づいて、成分ごとの音響信号を再生成する。 An acoustic signal separation device according to the present invention includes a feature extraction unit, a data estimation unit, a data classification unit, and a signal regeneration unit. The feature amount extraction unit extracts a feature amount from an input signal including an acoustic signal in which one or more components are mixed. The data estimation unit uses the DNN in which the network parameters have been learned so as to estimate the first data for correlating the components of the acoustic signal output from the same sound source to the feature amount extracted by the feature amount extraction unit The first data is estimated based on. The data classification unit estimates the first data estimated by the data estimation unit based on second data in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. The data of are classified by component. The signal regeneration unit regenerates an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature amount extracted by the feature amount extraction unit.
 この発明によれば、特徴量抽出部が、入力信号から特徴量を抽出し、データ推定部が、同一の音源から出力された音響信号の成分間を対応付ける第1のデータを、DNNを用いて特徴量に基づき推定し、データ分類部が、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定された第2のデータに基づいて第1のデータを成分ごとに分類し、信号再生成部が、成分ごとに分類された第1のデータおよび特徴量に基づいて、成分ごとの音響信号を再生成する。これにより、設計時に考慮されていない音響信号の成分の数である場合、あるいは音響信号の成分の数または音響信号の成分の種類が動的に変化する場合であっても、音響信号分離装置が、音響信号を成分ごとに精度よく分離することができる。 According to the present invention, the feature quantity extraction unit extracts the feature quantity from the input signal, and the data estimation unit uses DNN for the first data for correlating the components of the acoustic signal output from the same sound source. The first data is estimated based on the feature amount, and the data classification unit is configured to set classification conditions including information on at least one of the number of components of the acoustic signal and the type of the components of the acoustic signal. The data is classified for each component, and the signal regeneration unit regenerates an acoustic signal for each component based on the first data and the feature value classified for each component. Thereby, the acoustic signal separation device is not limited to the number of components of the acoustic signal not considered in the design, or even if the number of the components of the acoustic signal or the type of the component of the acoustic signal dynamically changes. The acoustic signal can be separated into components with high accuracy.
この発明の実施の形態1に係る音響信号分離装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 1 of this invention. 図2Aは、実施の形態1に係る音響信号分離装置の機能を実現するハードウェア構成を示すブロック図である。図2Bは、実施の形態1に係る音響信号分離装置の機能を実現するソフトウェアを実行するハードウェア構成を示すブロック図である。FIG. 2A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device according to the first embodiment. FIG. 2B is a block diagram showing a hardware configuration for executing software that implements the function of the acoustic signal separation device according to the first embodiment. 実施の形態1に係る音響信号分離方法を示すフローチャートである。3 is a flowchart showing an acoustic signal separation method according to Embodiment 1; 図4Aは、2次元空間にマッピングされた、異なる2つの成分に対応する分類用データを示す図である。図4Bは、成分ごとに分類された図4Aの分類用データを示す図である。FIG. 4A is a diagram showing classification data corresponding to two different components mapped in a two-dimensional space. FIG. 4B is a diagram showing the classification data of FIG. 4A classified by component. 図5Aは、2次元空間にマッピングされた、同一の成分に対応する分類用データを示す図である。図5Bは、誤って2つの成分に分類された図5Aの分類用データを示す図である。図5Cは、1つの成分に分類された図5Aの分類用データを示す図である。FIG. 5A is a diagram showing classification data corresponding to the same component mapped in a two-dimensional space. FIG. 5B is a diagram showing the classification data of FIG. 5A incorrectly classified into two components. FIG. 5C is a diagram showing the classification data of FIG. 5A classified into one component. 図6Aは、時系列に並べられた2つの成分に対応する分類用データおよび分類条件データを示す図である。図6Bは、成分ごとに分類された図6Aの分類用データおよび平滑化された分類条件データを示す図である。FIG. 6A is a view showing classification data and classification condition data corresponding to two components arranged in time series. FIG. 6B is a view showing the classification data and the smoothed classification condition data of FIG. 6A classified by component. この発明の実施の形態2に係る音響信号分離装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 2 of this invention. 図8Aは、実施の形態2に係る音響信号分離装置の機能を実現するハードウェア構成を示すブロック図である。図8Bは、実施の形態2に係る音響信号分離装置の機能を実現するソフトウェアを実行するハードウェア構成を示すブロック図である。FIG. 8A is a block diagram showing a hardware configuration that implements the function of the acoustic signal separation device according to the second embodiment. FIG. 8B is a block diagram showing a hardware configuration that executes software that implements the function of the acoustic signal separation device according to Embodiment 2. 実施の形態2に係る音響信号分離方法を示すフローチャートである。7 is a flowchart showing an acoustic signal separation method according to Embodiment 2; この発明の実施の形態3に係る音響信号分離装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 3 of this invention. 実施の形態3に係る音響信号分離方法を示すフローチャートである。10 is a flowchart showing an acoustic signal separation method according to Embodiment 3. この発明の実施の形態4に係る音響信号分離装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic signal separation apparatus which concerns on Embodiment 4 of this invention. 実施の形態4に係る音響信号分離方法を示すフローチャートである。15 is a flowchart showing an acoustic signal separation method according to Embodiment 4.
 以下、この発明をより詳細に説明するため、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態1.
 図1は、この発明の実施の形態1に係る音響信号分離装置1の構成を示すブロック図である。音響信号分離装置1は、特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6を備えており、入力信号aに含まれる音響信号を成分ごとの音響信号に分離して、成分ごとの音響信号を含む出力信号gを出力する。
Hereinafter, in order to explain the present invention in more detail, embodiments for carrying out the present invention will be described according to the attached drawings.
Embodiment 1
FIG. 1 is a block diagram showing a configuration of an acoustic signal separation device 1 according to Embodiment 1 of the present invention. The acoustic signal separation device 1 includes a feature extraction unit 2, a data estimation unit 3, a data acquisition unit 4, a data classification unit 5, and a signal regeneration unit 6, and the acoustic signal included in the input signal a is divided into components. The sound signal is separated, and an output signal g including the sound signal for each component is output.
 特徴量抽出部2は、入力信号aから特徴量を抽出する。入力信号aは、1つ以上の成分が混合された音響信号であってもよいが、音響信号と他の信号とを含んだ信号であってもよい。例えば、入力信号aは、音響信号に加え、この音響信号に対応付けられた画像信号またはテキストデータを含んだ信号であってもよい。 The feature amount extraction unit 2 extracts a feature amount from the input signal a. The input signal a may be an acoustic signal in which one or more components are mixed, but may be a signal including an acoustic signal and another signal. For example, the input signal a may be a signal including, in addition to the acoustic signal, an image signal or text data associated with the acoustic signal.
 特徴量抽出部2によって入力信号aから抽出される特徴量は、分類用特徴量bおよび信号再生成用特徴量cである。分類用特徴量bは、データ推定部3による分類用データdの推定に用いられる特徴量である。例えば、特徴量抽出部2が、入力信号aに含まれる音響信号に対して短時間フーリエ変換を施して周波数軸上の振幅を求め、周波数軸上の振幅に基づいて特徴量を算出する。このように音響信号から算出された特徴量を時系列に並べたデータを分類用特徴量bとしてもよい。 The feature quantities extracted from the input signal a by the feature quantity extraction unit 2 are the classification feature quantity b and the signal regeneration feature quantity c. The classification feature amount b is a feature amount used for estimation of the classification data d by the data estimation unit 3. For example, the feature amount extraction unit 2 performs short-time Fourier transformation on the acoustic signal included in the input signal a to obtain the amplitude on the frequency axis, and calculates the feature amount based on the amplitude on the frequency axis. Data in which the feature quantities calculated from the acoustic signal in this manner are arranged in time series may be used as the classification feature quantity b.
 信号再生成用特徴量cは、信号再生成部6による成分ごとの信号の再生成に用いられる特徴量である。例えば、信号再生成用特徴量cは、特徴量抽出部2が、入力信号aに含まれる音響信号に対して短時間フーリエ変換を施して算出したスペクトル係数であってもよく、入力信号aに含まれる画像情報またはテキストデータであってもよい。 The signal regeneration feature quantity c is a feature quantity used for regeneration of the signal for each component by the signal regeneration unit 6. For example, the feature amount c for signal regeneration may be a spectrum coefficient calculated by performing the short-time Fourier transform on the acoustic signal included in the input signal a by the feature amount extraction unit 2. It may be included image information or text data.
 データ推定部3は、DNN3aを用いて、特徴量抽出部2によって入力信号aから抽出された分類用特徴量bに基づき分類用データdを推定する。分類用データdは、同一の音源から出力された音響信号の成分間を対応付ける第1のデータである。例えば、分類用データdは、同一の音源から出力された音響信号の時間周波数成分間の距離が小さくなるように変換された音響信号の成分間のコストであってもよい。 The data estimation unit 3 estimates classification data d based on the classification feature amount b extracted from the input signal a by the feature amount extraction unit 2 using the DNN 3a. The classification data d is first data for correlating the components of the acoustic signal output from the same sound source. For example, the classification data d may be the cost between the components of the acoustic signal converted so that the distance between the time frequency components of the acoustic signal output from the same sound source becomes small.
 DNN3aには、分類用特徴量bに基づいて分類用データdを推定するように事前に学習されたネットワークパラメータ3bが設定されている。ネットワークパラメータ3bが設定されたDNN3aは、分類用特徴量bに対して階層的に演算を施すことにより、分類用データdを推定する。DNN3aには、例えば、RNN(Recurrent Neural Network)またはCNN(Convolutional Neural Network)を使用してもよい。 In the DNN 3a, a network parameter 3b learned in advance so as to estimate the classification data d based on the classification feature b is set. The DNN 3a in which the network parameter 3b is set estimates the classification data d by hierarchically calculating the classification feature amount b. For DNN 3a, for example, Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN) may be used.
 データ取得部4は、分類条件データeの入力を受け付けて取得する。データ取得部4に取得された分類条件データeは、データ分類部5に出力される。
 なお、音響信号分離装置1では、データ分類部5が、分類条件データeを直接取得してもよく、データ取得部4が、音響信号分離装置1とは別の装置に設けられてもよい。
 すなわち、音響信号分離装置1では、データ分類部5が、分類条件データeを取得する機能を有していればよく、データ取得部4を備えていなくてもよい。
The data acquisition unit 4 receives and acquires the input of the classification condition data e. The classification condition data e acquired by the data acquisition unit 4 is output to the data classification unit 5.
In the acoustic signal separation device 1, the data classification unit 5 may obtain classification condition data e directly, and the data acquisition unit 4 may be provided in a device different from the acoustic signal separation device 1.
That is, in the acoustic signal separation device 1, the data classification unit 5 may have the function of acquiring the classification condition data e, and the data acquisition unit 4 may not be provided.
 分類条件データeは、分類用データdの分類条件が設定された第2のデータである。分類条件データeに設定される分類条件には、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報が含まれる。音響信号の成分の数に関する情報は、動的に変化する音源の数を示すデータであってもよく、例えば、音源の数が時系列に並んだ音源数列データであってもよい。 The classification condition data e is second data in which the classification condition of the classification data d is set. The classification condition set in the classification condition data e includes information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal. The information on the number of components of the sound signal may be data indicating the number of dynamically changing sound sources, and may be, for example, sound source sequence data in which the number of sound sources is arranged in time series.
 音響信号の成分の種類に関する情報は、話者の性別、言語の種類、出力された音の種類といった音源を特定できる情報であればよい。例えば、音響信号の成分の種類がサイレンと動物の鳴き声である場合、音響信号には、警報器から出力されたサイレンの成分と動物から発せられた鳴き声の成分が混合されていることになる。 The information on the type of component of the acoustic signal may be any information that can specify the sound source, such as the gender of the speaker, the type of language, and the type of the output sound. For example, if the type of the component of the sound signal is a siren and an animal stalk, the sound signal will be a mixture of the siren component output from the alarm and the component of the sound uttered from the animal.
 データ分類部5は、データ取得部4によって取得された分類条件データeに基づいて、データ推定部3によって推定された分類用データdを成分ごとに分類する。
 分類用データdの分類には、k平均法(k-means clustering)またはGMM(Gaussian Mixture Models)といった分類方法を用いてもよい。
 データ分類部5によって分類された分類用データdである分類結果情報fは、信号再生成部6に出力される。
The data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e acquired by the data acquisition unit 4.
Classification methods such as k-means clustering or GMM (Gaussian Mixture Models) may be used to classify the classification data d.
Classification result information f that is classification data d classified by the data classification unit 5 is output to the signal regeneration unit 6.
 信号再生成部6は、データ分類部5から分類結果情報fを入力して、分類結果情報fにおける成分ごとの分類用データdに基づいて、信号再生成用特徴量cから成分ごとの音響信号を再生成する。信号再生成部6は、再生成した成分ごとの音響信号である出力信号gを出力する。なお、出力信号gには、再生成した成分ごとの音響信号に対応する画像信号およびテキスト情報が含まれてもよい。 The signal regeneration unit 6 receives the classification result information f from the data classification unit 5, and based on the classification data d for each component in the classification result information f, an acoustic signal for each component from the feature amount c for signal regeneration Regenerate The signal regeneration unit 6 outputs an output signal g which is an acoustic signal for each regenerated component. The output signal g may include an image signal and text information corresponding to the acoustic signal for each regenerated component.
 図2Aは、音響信号分離装置1の機能を実現するハードウェア構成を示すブロック図である。図2Bは、音響信号分離装置1の機能を実現するソフトウェアを実行するハードウェア構成を示すブロック図である。図2Aおよび図2Bにおいて、音響インタフェース100は、入力信号aに含まれる音響信号を入力し、出力信号gに含まれる音響信号を出力するインタフェースである。例えば、音響インタフェース100は、音響信号を集音するマイクに接続し、音響信号を出力するスピーカに接続している。 FIG. 2A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device 1. FIG. 2B is a block diagram showing a hardware configuration for executing software for realizing the function of the acoustic signal separation device 1. In FIGS. 2A and 2B, the acoustic interface 100 is an interface that receives an acoustic signal included in an input signal a and outputs an acoustic signal included in an output signal g. For example, the acoustic interface 100 is connected to a microphone that collects an acoustic signal, and is connected to a speaker that outputs the acoustic signal.
 画像インタフェース101は、入力信号aに含まれる画像信号を入力し、出力信号gに含まれる画像信号を出力するインタフェースである。例えば、画像インタフェース101は、画像信号を撮影するカメラに接続し、画像信号を表示する表示器に接続している。
 テキスト入力インタフェース102は、入力信号aに含まれるテキスト情報を入力するインタフェースである。例えば、テキスト入力インタフェース102は、テキスト情報を入力するためのキーボードまたはマウスに接続している。
The image interface 101 is an interface that receives an image signal included in an input signal a and outputs an image signal included in an output signal g. For example, the image interface 101 is connected to a camera for capturing an image signal and connected to a display for displaying the image signal.
The text input interface 102 is an interface for inputting text information included in the input signal a. For example, the text input interface 102 is connected to a keyboard or mouse for inputting text information.
 図2Aに示す処理回路103が備える不図示のメモリまたは図2Bに示すメモリ105には、入力信号a、分類用特徴量b、信号再生成用特徴量c、分類用データd、分類条件データe、分類結果情報fおよび出力信号gが一時的に記憶される。処理回路103またはプロセッサ104は、これらのデータを適宜読み出して音響信号の分離処理を行う。 The memory (not shown) included in the processing circuit 103 shown in FIG. 2A or the memory 105 shown in FIG. 2B has an input signal a, classification feature b, signal regeneration feature c, classification data d, and classification condition data e. , Classification result information f and output signal g are temporarily stored. The processing circuit 103 or the processor 104 appropriately reads these data to separate the acoustic signal.
 音響信号分離装置1における、特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6のそれぞれの機能は、処理回路により実現される。
 すなわち、音響信号分離装置1は、図3を用いて後述するステップST1からステップST4までの処理を実行するための処理回路を備える。処理回路は、専用のハードウェアであってもよいが、メモリに記憶されたプログラムを実行するCPU(Central Processing Unit)であってもよい。
The respective functions of the feature quantity extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 in the acoustic signal separation device 1 are realized by a processing circuit.
That is, the acoustic signal separation device 1 includes a processing circuit for executing the processing from step ST1 to step ST4 described later with reference to FIG. The processing circuit may be dedicated hardware or may be a CPU (Central Processing Unit) that executes a program stored in a memory.
 処理回路が図2Aに示す専用のハードウェアの処理回路103である場合、処理回路103は、例えば、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ASIC(Application Specific Integrated Circuit)、FPGA(Field-Programmable Gate Array)またはこれらを組み合わせたものが該当する。特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6のそれぞれの機能を別々の処理回路で実現してもよいし、これらの機能をまとめて1つの処理回路で実現してもよい。 When the processing circuit is the dedicated hardware processing circuit 103 shown in FIG. 2A, the processing circuit 103 may be, for example, a single circuit, a complex circuit, a programmed processor, a parallel programmed processor, or an ASIC (Application Specific Integrated Circuit). ), FPGA (Field-Programmable Gate Array), or a combination thereof. The respective functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 may be realized by separate processing circuits. It may be realized by one processing circuit.
 処理回路が図2Bに示すプロセッサ104である場合、特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6のそれぞれの機能は、ソフトウェア、ファームウェアまたはソフトウェアとファームウェアとの組み合わせによって実現される。ソフトウェアまたはファームウェアは、プログラムとして記述されて、メモリ105に記憶される。 When the processing circuit is the processor 104 shown in FIG. 2B, the functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6 are software, firmware or software It is realized by the combination of the and the firmware. The software or firmware is written as a program and stored in the memory 105.
 プロセッサ104は、メモリ105に記憶されたプログラムを読み出して実行することにより、特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6のそれぞれの機能を実現する。すなわち、音響信号分離装置1は、プロセッサ104によって実行されるときに、図3に示すステップST1からステップST4までの処理が結果的に実行されるプログラムを記憶するためのメモリ105を備える。これらのプログラムは、特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6の手順または方法をコンピュータに実行させるものである。
 メモリ105は、特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6として、コンピュータを機能させるためのプログラムが記憶されたコンピュータ可読記憶媒体であってもよい。
The processor 104 reads out and executes the program stored in the memory 105 to thereby perform the respective functions of the feature amount extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5 and the signal regeneration unit 6. To realize. That is, the acoustic signal separation device 1 includes the memory 105 for storing a program that is to be executed as a result of the processing from step ST1 to step ST4 shown in FIG. 3 when executed by the processor 104. These programs cause a computer to execute the procedure or method of the feature extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6.
The memory 105 is a computer-readable storage medium storing a program for causing a computer to function as the feature extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6. It is also good.
 メモリ105には、例えば、RAM(Random Access Memory)、ROM(Read Only Memory)、フラッシュメモリ、EPROM(Erasable Programmable Read Only Memory)、EEPROM(Electrically-EPROM)などの不揮発性または揮発性の半導体メモリ、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、DVDなどが該当する。また、メモリ105は、USB(Universal Serial Bus)メモリといった外部メモリであってもよい。 The memory 105 may be, for example, a nonvolatile or volatile semiconductor memory such as a random access memory (RAM), a read only memory (ROM), a flash memory, an erasable programmable read only memory (EPROM), or an EEPROM (electrically-EPROM). A magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD, etc. correspond. Also, the memory 105 may be an external memory such as a USB (Universal Serial Bus) memory.
 特徴量抽出部2、データ推定部3、データ取得部4、データ分類部5および信号再生成部6のそれぞれの機能について一部を専用のハードウェアで実現し、一部をソフトウェアまたはファームウェアで実現してもよい。例えば、特徴量抽出部2、データ推定部3、データ取得部4については、専用のハードウェアである処理回路で機能を実現する。データ分類部5および信号再生成部6については、プロセッサ104がメモリ105に記憶されたプログラムを読み出して実行することにより機能を実現してもよい。このように、処理回路は、ハードウェア、ソフトウェア、ファームウェアまたはこれらの組み合わせにより上記機能のそれぞれを実現することができる。 The functions of the feature quantity extraction unit 2, the data estimation unit 3, the data acquisition unit 4, the data classification unit 5, and the signal regeneration unit 6 are partially realized by dedicated hardware, and partially realized by software or firmware. You may For example, the functions of the feature amount extraction unit 2, the data estimation unit 3, and the data acquisition unit 4 are realized by processing circuits that are dedicated hardware. The functions of the data classification unit 5 and the signal regeneration unit 6 may be realized by the processor 104 reading and executing a program stored in the memory 105. Thus, the processing circuit can realize each of the above functions by hardware, software, firmware or a combination thereof.
 次に動作について説明する。
 図3は、実施の形態1に係る音響信号分離方法を示すフローチャートである。
 特徴量抽出部2が、入力信号aから分類用特徴量bおよび信号再生成用特徴量cを抽出する(ステップST1)。分類用特徴量bは、特徴量抽出部2からデータ推定部3に出力され、信号再生成用特徴量cは、特徴量抽出部2から信号再生成部6に出力される。
Next, the operation will be described.
FIG. 3 is a flowchart showing an acoustic signal separation method according to the first embodiment.
The feature quantity extraction unit 2 extracts the classification feature quantity b and the signal regeneration feature quantity c from the input signal a (step ST1). The classification feature amount b is output from the feature extraction unit 2 to the data estimation unit 3, and the signal regeneration feature amount c is output from the feature extraction unit 2 to the signal regeneration unit 6.
 入力信号aには、音響インタフェース100で入力が受け付けられた音響信号に加え、画像インタフェース101で入力された画像信号、あるいはテキスト入力インタフェース102で入力されたテキスト情報が含まれてもよい。
 また、特徴量抽出部2は、処理回路103が備える不図示のメモリまたはメモリ105から入力信号aを読み出して特徴量を抽出してもよい。
 さらに、入力信号aは、ストリームデータであってもよい。
The input signal a may include an image signal input by the image interface 101 or text information input by the text input interface 102 in addition to the sound signal input by the audio interface 100.
Also, the feature quantity extraction unit 2 may extract the feature quantity by reading the input signal a from a memory (not shown) included in the processing circuit 103 or the memory 105.
Furthermore, the input signal a may be stream data.
 次に、データ推定部3が、DNN3aを用いて、分類用特徴量bに基づいて分類用データdを推定する(ステップST2)。分類用データdは、データ推定部3からデータ分類部5に出力される。 Next, the data estimation unit 3 estimates classification data d based on the classification feature b using DNN 3a (step ST2). The classification data d is output from the data estimation unit 3 to the data classification unit 5.
 データ分類部5が、分類条件データeに基づいて、データ推定部3によって推定された分類用データdを成分ごとに分類する(ステップST3)。データ分類部5は、成分ごとに分類した分類用データdである分類結果情報fを信号再生成部6に出力する。 The data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e (step ST3). The data classification unit 5 outputs, to the signal regeneration unit 6, classification result information f which is classification data d classified for each component.
 図4Aは、2次元空間にマッピングされた、異なる2つの音響信号の成分に対応する分類用データd1,d2を示す図である。図4Aに示す例では、音源の数が音源Aと音源Bとの2つであり、入力信号aには、音源Aから出力された音響信号の成分と、音源Bから出力された音響信号の成分とが混合されているものとする。 FIG. 4A is a view showing classification data d1 and d2 corresponding to components of two different acoustic signals mapped in a two-dimensional space. In the example shown in FIG. 4A, the number of sound sources is two, that is, the sound source A and the sound source B, and the input signal a includes the component of the acoustic signal output from the sound source A and the acoustic signal output from the sound source B. It is assumed that the ingredients are mixed.
 丸記号で示す分類用データd1は、音源Aから出力された音響信号の成分間を対応付けるデータであり、三角記号で示す分類用データd2は、音源Bから出力された音響信号の成分間を対応付けるデータである。
 しかしながら、例えば、音源からの音響信号の出力状態が変化した場合、これに応じて分類用特徴量bも変化する。このため、データ推定部3が、DNN3aを用いて、分類用特徴量bに基づいて分類用データdを推定したときに、同一の音源から出力された音響信号の成分に対応する分類用データdであっても、分類用特徴量bの変化に応じて分類用データdの値にばらつきが生じる場合がある。
 データ分類部5には、複数の値にばらついた分類用データdが、音源Aに属する分類用データd1であるのか、音源Bに属する分類用データd2であるのかが分からない状態で入力される。
Classification data d1 indicated by a circle symbol is data for correlating the components of the sound signal output from the sound source A, and classification data d2 indicated by a triangle symbol corresponds between the components of the sound signal output from the sound source B It is data.
However, for example, when the output state of the acoustic signal from the sound source changes, the classification feature b also changes accordingly. Therefore, when the data estimation unit 3 estimates the classification data d based on the classification feature b using the DNN 3a, the classification data d corresponding to the component of the acoustic signal output from the same sound source Even in this case, the value of the classification data d may vary depending on the change of the classification feature value b.
Data classification unit 5 is input in a state in which it is not known whether classification data d dispersed into a plurality of values is classification data d1 belonging to sound source A or classification data d2 belonging to sound source B. .
 図4Bは成分ごとに分類された分類用データd1,d2を示す図である。図4Aおよび図4Bにおいて、分類条件データeには、音源の数が2であるという分類条件が設定されているものとする。データ分類部5は、音源の数が2であるので、音源Aに対応する第1のグループA1に分類用データd1を分類し、音源Bに対応する第2のグループA2に分類用データd2を分類する。 FIG. 4B is a view showing classification data d1 and d2 classified by component. In FIG. 4A and FIG. 4B, it is assumed that the classification condition that the number of sound sources is two is set in the classification condition data e. Since the number of sound sources is 2, the data classification unit 5 classifies the classification data d1 into the first group A1 corresponding to the sound source A, and the classification data d2 into the second group A2 corresponding to the sound source B. Classify.
 図5Aは、2次元空間にマッピングされた、同一の成分に対応する分類用データd1を示す図である。図5Bは、誤って2つの成分に分類された分類用データd1を示す図である。図5Cは、1つの成分に分類された分類用データd1を示す図である。
 図5Aに示す例では、音源の数が音源Aのみであり、入力信号aに含まれる音響信号には、音源Aから出力された音響信号の成分のみが混合されている。
 ただし、図5Aにおいて、分類用データd1の値にばらつきがあるため、複数の値の分類用データd1が生じている。さらに、データ分類部5には、複数の値にばらついた分類用データdがどの音源に属するのか分からない状態で入力される。
FIG. 5A is a diagram showing classification data d1 corresponding to the same component mapped in a two-dimensional space. FIG. 5B is a diagram showing classification data d1 incorrectly classified into two components. FIG. 5C is a diagram showing classification data d1 classified into one component.
In the example shown to FIG. 5A, the number of sound sources is only the sound source A, and only the component of the acoustic signal output from the sound source A is mixed with the acoustic signal contained in the input signal a.
However, in FIG. 5A, since the values of the classification data d1 are dispersed, classification data d1 of a plurality of values are generated. Furthermore, the data classification unit 5 is input in a state where it does not know to which sound source classification data d dispersed into a plurality of values belongs.
 特許文献1に記載される方法では、DNNに予め設定された分離数(例えば、分離数=2)で音響信号が分離される。従って、このようなDNNを分類用データdの分類に適用した場合、図5Bに示すように、1つの音源Aに属する複数の分類用データd1が、互いに異なる第1のグループB1と第2のグループB2に誤って分離される可能性が高い。 In the method described in Patent Document 1, the acoustic signal is separated by the separation number (for example, separation number = 2) preset in DNN. Therefore, when such DNN is applied to classification of classification data d, as shown in FIG. 5B, a plurality of classification data d1 belonging to one sound source A are different from each other in the first group B1 and the second group B1. There is a high possibility of being erroneously separated into the group B2.
 これに対して、音響信号分離装置1では、データ分類部5が、例えば分類条件データeに設定された音源の数に基づいて分類用データdを分類する。図5Aでは、音源数が1(音源Aのみ)であることから、データ分類部5は、図5Cに示すように、複数の分類用データdを音源Aに対応するグループCに正しく分類する。このように、音響信号分離装置1は、分離数と音源の数との不一致に起因した音響信号の分離誤りを防止することができる。 On the other hand, in the acoustic signal separation device 1, the data classification unit 5 classifies classification data d based on, for example, the number of sound sources set in the classification condition data e. In FIG. 5A, since the number of sound sources is 1 (only sound source A), the data classification unit 5 correctly classifies the plurality of classification data d into the group C corresponding to the sound source A, as shown in FIG. 5C. Thus, the acoustic signal separation device 1 can prevent separation errors of the acoustic signal due to the mismatch between the number of separations and the number of sound sources.
 図6Aは、時系列に並べられた2つの成分に対応する分類用データd1,d2および分類条件データeを示す図である。例えば、複数の話者が音源であり、複数の話者が話した場合、音源の数は動的に変化する。図6Aに示す例では、音源Aから音響信号が出力されてから、その後に音源Bからも音響信号が出力された場合を示している。 FIG. 6A is a view showing classification data d1 and d2 and classification condition data e corresponding to two components arranged in time series. For example, when a plurality of speakers are sound sources and a plurality of speakers speak, the number of sound sources changes dynamically. The example shown in FIG. 6A shows the case where the sound signal is also output from the sound source B after the sound signal is output from the sound source A.
 前述したように、特許文献1に記載される方法では、音源の数が動的に変化した場合、音響信号を精度よく分離することができない。
 これに対して、音響信号分離装置1では、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件とすることで、分類用データdを精度よく分類することができる。例えば、図6Aに示すように、データ分類部5は、時系列な音源の数を示す音源数列である分類条件データeを使用して、分類用データd1を音源Aに対応する第1のグループD1に分類し、分類用データd2を音源Bに対応する第2のグループD2に分類する。
 このように、データ分類部5は、分類条件データeが示す音源数列を参照することで、音源の数が動的に変化しても分類用データdを精度よく分類することができる。
As described above, according to the method described in Patent Document 1, when the number of sound sources dynamically changes, the acoustic signal can not be separated with high accuracy.
On the other hand, in the acoustic signal separation device 1, the classification data d is classified with high accuracy by using the classification condition including the information on at least one of the number of acoustic signal components and the type of acoustic signal components. be able to. For example, as shown in FIG. 6A, the data classification unit 5 uses the classification condition data e, which is a sound source number sequence indicating the number of time-series sound sources, to classify the classification data d1 into the first group corresponding to the sound source A. Classification is made to D1, and classification data d2 is classified to a second group D2 corresponding to the sound source B.
As described above, by referring to the sound source number sequence indicated by the classification condition data e, the data classification unit 5 can classify the classification data d with high accuracy even if the number of sound sources dynamically changes.
 図6Bは、成分ごとに分類された分類用データd1,d2および平滑化された分類条件データを示す図である。図6Aに示した分類条件データeでは、時間e1の前後の時間における音源の数が“2”で連続しているにもかかわらず、時間e1における音源の数は“1”となっている。このように前後の時間における音源の数の傾向を考慮して誤った音源数であると予想される場合、データ分類部5は、音源数列の時系列変化を平滑化する。例えば、データ分類部5は、時間e1の前後の時間における値“2”の平均値を時間e1における音源数に変換する。これにより、音源数列における音源の数が突発的に変化しても、データ分類部5は、分類用データdを精度よく分類することができる。 FIG. 6B is a view showing classification data d1 and d2 classified according to components and smoothed classification condition data. In the classification condition data e shown in FIG. 6A, the number of sound sources at time e1 is “1” although the number of sound sources at times before and after time e1 is “2”. As described above, in the case where it is predicted that the number of sound sources is incorrect in consideration of the tendency of the number of sound sources in the preceding and succeeding times, the data classification unit 5 smoothes time-series changes of sound source number sequences. For example, the data classification unit 5 converts the average value of the values “2” in the time before and after the time e1 into the number of sound sources at the time e1. As a result, even if the number of sound sources in the sound source sequence changes suddenly, the data classification unit 5 can classify the classification data d with high accuracy.
 分類条件が音源の数の時系列データである場合示したが、分類条件は、音響信号の成分の種類の時系列データであってもよい。例えば、図6Aにおいて、音源Aが音響信号を出力する期間では、音源Aから出力された音響信号の成分の種類に関する情報が分類条件データeの分類条件に設定される。音源Aおよび音源Bの両方が音響信号を出力する期間は音源Aおよび音源Bのそれぞれから出力された音響信号の成分の種類に関する情報が分類条件データeの分類条件に設定される。データ分類部5は、音響信号の成分の種類の時系列データを参照することで、音響信号の成分の種類が動的に変化しても分類用データdを精度よく分類することができる。 Although the case where the classification condition is time series data of the number of sound sources is shown, the classification condition may be time series data of the type of component of the acoustic signal. For example, in FIG. 6A, during a period in which the sound source A outputs an acoustic signal, information on the type of component of the acoustic signal output from the sound source A is set as the classification condition of the classification condition data e. During a period in which both the sound source A and the sound source B output acoustic signals, information on the type of the component of the sound signal output from each of the sound sources A and B is set as the classification condition of the classification condition data e. The data classification unit 5 can classify the classification data d with high accuracy even if the type of the component of the acoustic signal changes dynamically by referring to the time-series data of the type of the component of the acoustic signal.
 図3の説明に戻る。信号再生成部6は、特徴量抽出部2から信号再生成用特徴量cを入力し、データ分類部5から分類結果情報fを入力し、分類結果情報fにおける成分ごとの分類用データdおよび信号再生成用特徴量cに基づいて成分ごとの音響信号を再生成する(ステップST4)。 It returns to the explanation of FIG. The signal regeneration unit 6 receives the feature amount c for signal regeneration from the feature amount extraction unit 2, and receives the classification result information f from the data classification unit 5, and classification data d for each component in the classification result information f and An acoustic signal for each component is regenerated based on the signal regeneration feature value c (step ST4).
 例えば、信号再生成部6は、図4Bに示した第1のグループA1に分類された分類用データd1を用いて音源Aに対応する信号再生成用特徴量cを特定し、特定した信号再生成用特徴量cおよび分類用データd1に基づいて成分ごとの音響信号を再生成する。
 同様に、信号再生成部6は、図4Bに示した第2のグループA2に分類された分類用データd2を用いて音源Bに対応する信号再生成用特徴量cを特定し、特定した信号再生成用特徴量cおよび分類用データd2に基づいて音源Bから出力された成分の信号を再生成する。これにより、入力信号aに含まれる音響信号は、音源Aから出力された成分の音響信号と音源Bから出力された成分の音響信号とに分離される。
For example, the signal regeneration unit 6 identifies the signal regeneration feature amount c corresponding to the sound source A using the classification data d1 classified into the first group A1 shown in FIG. An acoustic signal for each component is regenerated based on the generation feature c and the classification data d1.
Similarly, the signal regeneration unit 6 identifies the signal regeneration feature amount c corresponding to the sound source B using the classification data d2 classified into the second group A2 shown in FIG. 4B, and identifies the signal The component signal output from the sound source B is regenerated based on the regeneration feature amount c and the classification data d2. Thereby, the acoustic signal included in the input signal a is separated into the acoustic signal of the component output from the sound source A and the acoustic signal of the component output from the sound source B.
 また、信号再生成部6は、再生成した成分ごとの音響信号に分類条件データeを対応付けて出力してもよい。例えば、音響信号の成分の種類に関する情報として、音源が音源Aおよび音源Bであることを示す情報が分類条件データeに設定されていた場合、信号再生成部6は、再生成した成分ごとの音響信号のうち、音源Aに対応する信号に対して音源が音源Aであることを示す情報を対応付け、音源Bに対応する信号に対しては音源が音源Bであることを示す情報を対応付けて出力する。これにより、信号再生成部6は、成分ごとの音響信号を出力した音源を特定することが可能な出力信号gを提供できる。 In addition, the signal regeneration unit 6 may output the classification condition data e in association with the acoustic signal for each regenerated component. For example, when the information indicating that the sound source is the sound source A and the sound source B is set in the classification condition data e as the information regarding the type of the component of the acoustic signal, the signal regeneration unit 6 generates the information for each regenerated component. Among the sound signals, the signal corresponding to the sound source A is associated with the information indicating that the sound source is the sound source A, and the signal corresponding to the sound source B is associated with the information indicating that the sound source is the sound source B And output. Thereby, the signal regeneration unit 6 can provide the output signal g capable of specifying the sound source that has output the acoustic signal of each component.
 なお、前述した対応付けを行う方法として、信号再生成部6が、音響信号の成分の種類に関連する教師信号と再生成した成分の音響信号との距離に基づいて、音響信号の成分の種類と再生成した成分の音響信号とを対応付けてもよい。
 また、信号再生成部6は、入力信号aから抽出された画像信号から分析した話者の発話タイミングと再生成した成分の音響信号の出力タイミングとを比較して、タイミングが重なる話者と音響信号とを対応付けてもよい。例えば、信号再生成部6は、画像信号に含まれる話者の口唇情報から話者の発話タイミングを特定してもよい。
Note that, as a method of performing the above-described association, the type of component of the acoustic signal is determined based on the distance between the teacher signal related to the type of the component of the acoustic signal and the acoustic signal of the regenerated component. And the acoustic signal of the regenerated component may be associated.
Further, the signal regeneration unit 6 compares the utterance timing of the speaker analyzed from the image signal extracted from the input signal a with the output timing of the acoustic signal of the regenerated component, and the speaker and the audio whose timing overlaps with each other. It may be associated with a signal. For example, the signal regeneration unit 6 may specify the speech timing of the speaker from the lip information of the speaker included in the image signal.
 また、分類条件データeに設定される分類条件は、音響信号の成分の数でもよいが、音響信号の成分の数の下限値、上限値および範囲のうちの少なくとも一つであってもよい。例えば、会議の発話音声を出席者ごとの音響信号に分離する場合において、データ分類部5は、音響信号の成分の数が下限値未満になると、分類用データdの分類を停止してもよい。また、データ分類部5は、音響信号の成分の数が上限値を超えたときに、分類用データdの分類を停止してもよく、音響信号の成分の数が分類条件の範囲外になったときに、分類用データdの分類を停止してもよい。これらの場合、音響信号の分離が停止される。このように、分類条件には、分類用データdの分類の停止基準を設定してもよく、データ分類部5は、分類条件で指定された範囲内で、分類用データdを精度よく分類することができる。 Further, the classification condition set in the classification condition data e may be the number of components of the acoustic signal, but may be at least one of the lower limit value, the upper limit value, and the range of the number of components of the acoustic signal. For example, in the case of separating a speech uttered in a meeting into an audio signal for each attendee, the data classification unit 5 may stop classification of the classification data d when the number of components of the audio signal becomes less than the lower limit. . In addition, the data classification unit 5 may stop classification of the classification data d when the number of components of the acoustic signal exceeds the upper limit value, and the number of components of the acoustic signal is out of the range of the classification condition. At this time, the classification of the classification data d may be stopped. In these cases, the separation of the acoustic signal is stopped. As described above, the classification criteria may be set to a stop criterion for classification of the classification data d, and the data classification unit 5 classifies the classification data d with high accuracy within the range designated by the classification conditions. be able to.
 分類条件データeに設定される分類条件には、音源の特徴を示す特徴情報を設定してもよい。音源の特徴を示す特徴情報は、音源による音響信号の出力態様であってもよく、音源の物理的な特徴であってもよい。例えば、音響信号の出力態様は、音源から音響信号が出力される平均出力時間であってもよい。音源が人物である場合、音源の物理的な特徴は人物の体格に関する情報であってもよい。データ分類部5は、音源の特徴を示す特徴情報に応じて、分類用データdを精度よく分類することができる。 As the classification condition set in the classification condition data e, feature information indicating the feature of the sound source may be set. The feature information indicating the feature of the sound source may be an output aspect of an acoustic signal by the sound source, or may be a physical feature of the sound source. For example, the output mode of the sound signal may be an average output time at which the sound signal is output from the sound source. When the sound source is a person, the physical feature of the sound source may be information on the physical constitution of the person. The data classification unit 5 can classify the classification data d with high accuracy according to the feature information indicating the feature of the sound source.
 以上のように、実施の形態1に係る音響信号分離装置1において、特徴量抽出部2が、入力信号aから特徴量を抽出する。データ推定部3が、DNN3aを用いて分類用特徴量bに基づき分類用データdを推定する。データ分類部5が、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定された分類条件データeに基づいて、分類用データdを成分ごとに分類する。信号再生成部6が、成分ごとに分類された分類用データdおよび信号再生成用特徴量cに基づいて、成分ごとの音響信号を再生成する。これにより、設計時に考慮されていない音響信号の成分の数である場合、あるいは、音響信号の成分の数または音響信号の成分の種類が動的に変化する場合であっても、音響信号分離装置1が、音響信号を成分ごとに精度よく分離することができる。 As described above, in the acoustic signal separation device 1 according to the first embodiment, the feature quantity extraction unit 2 extracts the feature quantity from the input signal a. The data estimation unit 3 estimates classification data d based on the classification feature b using DNN 3a. The data classification unit 5 performs classification data d for each component based on classification condition data e in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. Classify. The signal regeneration unit 6 regenerates an acoustic signal for each component based on the classification data d classified for each component and the feature value c for signal regeneration. Thereby, the acoustic signal separation device is the number of acoustic signal components not considered in the design, or even if the number of acoustic signal components or the type of acoustic signal component dynamically changes. 1 can accurately separate the acoustic signal into components.
 実施の形態1に係る音響信号分離装置1において、分類条件データeに設定された分類条件は、音響信号の成分の数の下限値、音源の数の上限値および範囲のうちの少なくとも一つである。これにより、データ分類部5が、分類条件で指定された範囲内で、分類用データdを精度よく分類することができる。 In the acoustic signal separation device 1 according to the first embodiment, the classification condition set in the classification condition data e is at least one of the lower limit value of the number of components of the acoustic signal, the upper limit value of the number of sound sources, and the range. is there. Thereby, the data classification unit 5 can classify the classification data d with high accuracy within the range designated by the classification condition.
 実施の形態1に係る音響信号分離装置1において、分類条件データeに設定された分類条件は、音源の数の時系列な変化を示す音源数列である。データ分類部5は、音源数列を参照することで、音源の数が動的に変化しても分類用データdを精度よく分類することができる。 In the acoustic signal separation device 1 according to the first embodiment, the classification condition set in the classification condition data e is a sound source number sequence indicating a time-series change in the number of sound sources. By referring to the sound source sequence, the data classification unit 5 can classify the classification data d with high accuracy even if the number of sound sources changes dynamically.
 実施の形態1に係る音響信号分離装置1において、データ分類部5が、音源数列の時系列変化を平滑化する。これにより、データ分類部5が、音源数列における音源の数が突発的に変化しても、分類用データdを精度よく分類することができる。 In the acoustic signal separation device 1 according to the first embodiment, the data classification unit 5 smoothes the time-series change of the sound source sequence. Thereby, even if the number of sound sources in the sound source sequence changes suddenly, the data classification unit 5 can classify the classification data d with high accuracy.
 実施の形態1に係る音響信号分離装置1において、分類条件データeに設定された分類条件が音源の特徴を示す特徴情報である。これにより、データ分類部5が、音源の特徴を示す特徴情報に応じて分類用データdを精度よく分類することができる。 In the acoustic signal separation device 1 according to the first embodiment, the classification condition set in the classification condition data e is feature information indicating the feature of the sound source. Thereby, the data classification unit 5 can classify the classification data d with high accuracy according to the feature information indicating the feature of the sound source.
 実施の形態1に係る音響信号分離装置1において、信号再生成部6が、音響信号の成分の種類に関する情報を成分ごとの信号に対応付けて出力する。これにより、信号再生成部6が、再生成した音響信号を出力した音源を特定することが可能な出力信号gを提供することができる。 In the acoustic signal separation device 1 according to the first embodiment, the signal regeneration unit 6 outputs the information on the type of the component of the acoustic signal in association with the signal of each component. Thereby, the signal regeneration unit 6 can provide the output signal g capable of specifying the sound source that has output the regenerated acoustic signal.
実施の形態2.
 図7は、この発明の実施の形態2に係る音響信号分離装置1Aの構成を示すブロック図である。図7において、図1と同一の構成要素には同一の符号を付して説明を省略する。音響信号分離装置1Aは、特徴量抽出部2、データ推定部3、データ分類部5、信号再生成部6およびデータ算出部7を備えており、入力信号aに含まれる音響信号を成分ごとの音響信号に分離して、成分ごとの音響信号を含む出力信号gを出力する。
Second Embodiment
FIG. 7 is a block diagram showing a configuration of an acoustic signal separation device 1A according to Embodiment 2 of the present invention. In FIG. 7, the same components as in FIG. 1 are assigned the same reference numerals and descriptions thereof will be omitted. The acoustic signal separation device 1A includes a feature extraction unit 2, a data estimation unit 3, a data classification unit 5, a signal regeneration unit 6, and a data calculation unit 7. The acoustic signal included in the input signal a is divided into components. The sound signal is separated, and an output signal g including the sound signal for each component is output.
 データ算出部7は、入力信号aから分類条件データeを算出する。入力信号aには、音響信号に加え、音響信号の成分の数および音響信号の成分の種類を特定するためのセンサ情報が含まれる。センサ情報には、例えば、脳波、心拍数といった、音源となり得る人物の生態情報、音源となり得る人物が撮影された画像情報、人物の発話で生じた振動または温度変化といった物理情報がある。データ算出部7は、入力信号aに含まれるセンサ情報を用いて、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定された分類条件データeを算出する。 The data calculation unit 7 calculates classification condition data e from the input signal a. The input signal a includes, in addition to the acoustic signal, sensor information for specifying the number of components of the acoustic signal and the type of the component of the acoustic signal. The sensor information includes, for example, brain waves, ecological information of a person who can be a sound source, such as heart rate, image information of a person who can be a sound source, and physical information such as vibration or temperature change caused by speech of the person. The data calculation unit 7 uses the sensor information included in the input signal a to set classification condition data e in which classification conditions including information on at least one of the number of acoustic signal components and the type of acoustic signal components are set. Calculate
 図8Aは、音響信号分離装置1Aの機能を実現するハードウェア構成を示すブロック図である。図8Bは、音響信号分離装置1Aの機能を実現するソフトウェアを実行するハードウェア構成を示すブロック図である。図8Aおよび図8Bにおいて、図2Aおよび図2Bと同一の構成要素には同一の符号を付して説明を省略する。センサインタフェース106は、前述したセンサ情報を入力するインタフェースである。例えば、センサインタフェース106は、音源となり得る人物の生態情報を検出する生態センサ、音源となり得る人物を撮影するカメラ、あるいは、人物の発話で生じた振動または温度変化を検出する物理センサに接続している。 FIG. 8A is a block diagram showing a hardware configuration for realizing the function of the acoustic signal separation device 1A. FIG. 8B is a block diagram showing a hardware configuration for executing software for realizing the function of the acoustic signal separation device 1A. In FIG. 8A and FIG. 8B, the same components as in FIG. 2A and FIG. 2B are assigned the same reference numerals and descriptions thereof will be omitted. The sensor interface 106 is an interface for inputting the sensor information described above. For example, the sensor interface 106 is connected to an ecological sensor that detects ecological information of a person who can be a sound source, a camera that captures a person who can be a sound source, or a physical sensor that detects vibration or temperature change caused by speech of a person. There is.
 音響信号分離装置1Aにおける、特徴量抽出部2、データ推定部3、データ分類部5、信号再生成部6およびデータ算出部7のそれぞれの機能は、処理回路により実現される。すなわち、音響信号分離装置1Aは、図9を用いて後述するステップST1aからステップST5aまでの処理を実行するための処理回路を備える。処理回路は、専用のハードウェアである処理回路103であってもよいが、メモリ105に記憶されたプログラムを実行するプロセッサ104であってもよい。 The respective functions of the feature extraction unit 2, the data estimation unit 3, the data classification unit 5, the signal regeneration unit 6, and the data calculation unit 7 in the acoustic signal separation device 1A are realized by a processing circuit. That is, the acoustic signal separation device 1A includes a processing circuit for executing the processing from step ST1a to step ST5a described later with reference to FIG. The processing circuit may be the processing circuit 103 that is dedicated hardware, or may be the processor 104 that executes a program stored in the memory 105.
 次に動作について説明する。
 図9は、実施の形態2に係る音響信号分離方法を示すフローチャートである。図9において、ステップST1a、ステップST3a、ステップST4aおよびステップST5aの処理は、図3に示したステップST1、ステップST2、ステップST3およびステップST4と同一であるので説明を省略する。
Next, the operation will be described.
FIG. 9 is a flowchart showing an acoustic signal separation method according to the second embodiment. In FIG. 9, the processes of step ST1a, step ST3a, step ST4a and step ST5a are the same as step ST1, step ST2, step ST3 and step ST4 shown in FIG.
 ステップST2aにおいて、データ算出部7は、センサ情報を含む入力信号aに基づいて、分類条件データeを算出する。例えば、データ算出部7は、センサ情報に基づいて、人物の有無および人物の数を特定し、特定した情報を含む分類条件が設定された分類条件データeを算出する。 In step ST2a, the data calculation unit 7 calculates classification condition data e based on the input signal a including sensor information. For example, the data calculation unit 7 specifies the presence or absence of a person and the number of persons based on sensor information, and calculates classification condition data e in which classification conditions including the specified information are set.
 また、データ算出部7が、人物の口唇情報に基づいて人物の発話を検出して、音響信号の成分の数を特定してもよい。
 さらに、データ算出部7が、入力信号aに含まれるセンサ情報から、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方を出力するように機械学習されたDNNを用いて、音響信号の成分の数および音響信号の成分の種類を特定してもよい。
 さらに、データ算出部7が、音響信号に基づいて音響発信源の数を検出し、音響信号の成分の数を特定してもよい。例えば、特定の話者を検出する話者照合技術を利用して検出した話者数を音響信号の成分の数としてもよい。
Alternatively, the data calculation unit 7 may detect the speech of the person based on the lip information of the person and specify the number of components of the acoustic signal.
Furthermore, using DNN machine-learned so that the data calculation unit 7 outputs at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal from the sensor information included in the input signal a. The number of components of the acoustic signal and the type of component of the acoustic signal may be specified.
Furthermore, the data calculation unit 7 may detect the number of sound sources based on the sound signal, and specify the number of components of the sound signal. For example, the number of speakers detected using a speaker verification technique for detecting a specific speaker may be the number of components of the acoustic signal.
 データ分類部5は、データ算出部7によって算出された分類条件データeに基づいて、データ推定部3によって推定された分類用データdを成分ごとに分類する。データ分類部5は、成分ごとに分類した分類用データdから、同一の成分に属する分類用データd間を対応付けた分類結果情報fを生成して信号再生成部6に出力する。 The data classification unit 5 classifies the classification data d estimated by the data estimation unit 3 for each component based on the classification condition data e calculated by the data calculation unit 7. The data classification unit 5 generates, from the classification data d classified for each component, classification result information f in which the classification data d belonging to the same component are associated with each other and outputs the generated classification result information f to the signal regeneration unit 6.
 以上のように、実施の形態2に係る音響信号分離装置1Aは、入力信号aから分類条件データeを算出するデータ算出部7を備える。この構成を備えることで、音響信号分離装置1Aは、入力信号aから分類条件データeを得ることができる。これにより、音響信号分離装置1Aは、音響信号の成分の数または音響信号の成分の種類が未知であっても、音響信号を精度よく分離することができる。 As described above, the acoustic signal separation device 1A according to the second embodiment includes the data calculation unit 7 that calculates the classification condition data e from the input signal a. With this configuration, the acoustic signal separation device 1A can obtain classification condition data e from the input signal a. Thus, the acoustic signal separation device 1A can accurately separate the acoustic signal even if the number of components of the acoustic signal or the type of the component of the acoustic signal is unknown.
実施の形態3.
 図10は、この発明の実施の形態3に係る音響信号分離装置1Bの構成を示すブロック図である。図10において、図1および図7と同一の構成要素には、同一の符号を付して説明を省略する。音響信号分離装置1Bは、特徴量抽出部2、データ推定部3、データ分類部5、信号再生成部6、データ算出部7およびパラメータ切り替え部8を備えており、入力信号aに含まれる音響信号を成分ごとの音響信号に分離して、成分ごとの音響信号を含む出力信号gを出力する。
Third Embodiment
FIG. 10 is a block diagram showing a configuration of an acoustic signal separation device 1B in accordance with Embodiment 3 of the present invention. In FIG. 10, the same components as in FIG. 1 and FIG. The acoustic signal separation device 1B includes a feature extraction unit 2, a data estimation unit 3, a data classification unit 5, a signal regeneration unit 6, a data calculation unit 7, and a parameter switching unit 8, and the acoustic signal included in the input signal a The signal is separated into component-specific acoustic signals, and an output signal g including the component-specific acoustic signals is output.
 パラメータ切り替え部8は、分類条件データeに基づいて、DNN3aのネットワークパラメータ3bを切り替える。ネットワークパラメータ3bは、様々な音源の特徴に応じて事前に学習されている。例えば、パラメータ切り替え部8は、分類条件データeを参照して、事前に学習された様々なネットワークパラメータ3bの中から、処理対象の音源の特徴に対応するネットワークパラメータ3bを選択し、このネットワークパラメータ3bをDNN3aに設定する。 The parameter switching unit 8 switches the network parameter 3b of the DNN 3a based on the classification condition data e. The network parameters 3b are learned in advance according to the characteristics of various sound sources. For example, with reference to the classification condition data e, the parameter switching unit 8 selects a network parameter 3b corresponding to the feature of the sound source to be processed from among various network parameters 3b learned in advance, and this network parameter Set 3b to DNN 3a.
 音響信号分離装置1Bにおける、特徴量抽出部2、データ推定部3、データ分類部5、信号再生成部6、データ算出部7、およびパラメータ切り替え部8のそれぞれの機能は、処理回路によって実現される。すなわち、音響信号分離装置1Bは、図11を用いて後述するステップST1bからステップST6bまでの処理を実行するための処理回路を備える。処理回路は、図2Aに示した専用のハードウェアである処理回路103であってもよいが、図2Bに示したメモリ105に記憶されたプログラムを実行するプロセッサ104であってもよい。 Each function of feature amount extraction unit 2, data estimation unit 3, data classification unit 5, signal regeneration unit 6, data calculation unit 7, and parameter switching unit 8 in acoustic signal separation apparatus 1B is realized by a processing circuit. Ru. That is, the acoustic signal separation device 1B includes a processing circuit for executing the processing from step ST1b to step ST6b described later with reference to FIG. The processing circuit may be the processing circuit 103 that is dedicated hardware shown in FIG. 2A, or may be the processor 104 that executes a program stored in the memory 105 shown in FIG. 2B.
 次に動作について説明する。
 図11は、実施の形態3に係る音響信号分離方法を示すフローチャートである。図11において、ステップST1b、ステップST2b、およびステップST4bからステップST6bまでの処理は、図7に示したステップST1a、ステップST2a、ステップST3aからステップST5aまでの処理と同一であるので説明を省略する。
Next, the operation will be described.
FIG. 11 is a flowchart showing an acoustic signal separation method according to the third embodiment. In FIG. 11, the processes of step ST1b, step ST2b, and step ST4b to step ST6b are the same as the processes of step ST1a, step ST2a, and step ST3a to step ST5a shown in FIG.
 ステップST3bにおいて、パラメータ切り替え部8は、分類条件データeを参照することで、事前に学習された様々なネットワークパラメータ3bの中から、処理対象の音源に対応するネットワークパラメータ3bを選択し、選択したネットワークパラメータ3bをDNN3aに設定する。 In step ST3b, the parameter switching unit 8 selects and selects the network parameter 3b corresponding to the sound source to be processed from among various network parameters 3b learned in advance by referring to the classification condition data e. Set network parameter 3b to DNN 3a.
 分類条件データeには、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定されている。音響信号の成分の種類に関する情報には、例えば、話者の性別、言語の種類および音の種類といった音源の特定が可能な情報が設定される。分類条件には、音響信号の成分の種類の他に、音源の数、音源による音響信号の平均出力時間、音源である人物の体格など、音源の特徴に関する情報が設定されてもよい。 In the classification condition data e, classification conditions including information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal are set. For example, information that can specify the sound source, such as the gender of the speaker, the type of language, and the type of sound, is set as the information on the type of component of the acoustic signal. As the classification condition, in addition to the type of the component of the sound signal, information on the characteristics of the sound source, such as the number of sound sources, the average output time of the sound signal by the sound source, and the physique of the person who is the sound source may be set.
 パラメータ切り替え部8は、予め定めた選択ルールに基づく決定木を使用して、ネットワークパラメータ3bを選択してもよい。選択ルールとしては、例えば、“男性二人”、“女性三人以上”というように、音響信号の成分の種類と音響信号の成分の数を規定したルールであってもよい。 The parameter switching unit 8 may select the network parameter 3b using a decision tree based on a predetermined selection rule. The selection rule may be, for example, a rule that defines the type of acoustic signal component and the number of acoustic signal components, such as “two males” and “three or more females”.
 また、パラメータ切り替え部8は、分類条件データeを入力として、処理対象の音源に対応するネットワークパラメータ3bを選択するように機械学習されたニューラルネットワークまたはGMMを用いて、ネットワークパラメータ3bを選択してもよい。 The parameter switching unit 8 selects the network parameter 3b using a neural network or GMM machine-learned to select the network parameter 3b corresponding to the processing target sound source using the classification condition data e as an input. It is also good.
 以上のように、実施の形態3に係る音響信号分離装置1Bは、分類条件データeに基づいて、DNN3aに設定されるネットワークパラメータ3bを切り替えるパラメータ切り替え部8を備える。この構成を備えることで、データ推定部3が、ネットワークパラメータ3bが切り替えられたDNN3aを用いて、処理対象の音源に対応する分類用データdを推定することができる。これにより、音響信号分離装置1Bは、音響信号を精度よく分離することができる。 As described above, the acoustic signal separation device 1B according to the third embodiment includes the parameter switching unit 8 that switches the network parameter 3b set in the DNN 3a based on the classification condition data e. With this configuration, the data estimation unit 3 can estimate the classification data d corresponding to the sound source to be processed using the DNN 3a in which the network parameter 3b is switched. Accordingly, the acoustic signal separation device 1B can separate the acoustic signal with high accuracy.
実施の形態4.
 図12は、この発明の実施の形態4に係る音響信号分離装置1Cの構成を示すブロック図である。図12において、図1および図7と同一の構成要素には、同一の符号を付して説明を省略する。音響信号分離装置1Cは、特徴量抽出部2A、データ推定部3A、データ分類部5A、信号再生成部6A、データ算出部7Aおよびセクション情報取得部9を備えており、入力信号aに含まれる音響信号を成分ごとの音響信号に分離して、成分ごとの音響信号を含む出力信号gを出力する。
Fourth Embodiment
FIG. 12 is a block diagram showing a configuration of an acoustic signal separation device 1C according to Embodiment 4 of the present invention. In FIG. 12, the same components as in FIG. 1 and FIG. The acoustic signal separation device 1C includes a feature extraction unit 2A, a data estimation unit 3A, a data classification unit 5A, a signal regeneration unit 6A, a data calculation unit 7A, and a section information acquisition unit 9, and is included in the input signal a. The acoustic signal is separated into component-specific acoustic signals, and an output signal g including the component-specific acoustic signals is output.
 特徴量抽出部2Aは、セクション情報hに基づいて、セクションごとの入力信号aから分類用特徴量bおよび信号再生成用特徴量cを抽出する。ここで、セクションは、入力信号aの区切りであり、入力信号aは、音響信号の変化に応じてセクションごとに区切られている。セクション情報hは、入力信号aを区切るセクションの位置を示す情報である。例えば、セクション情報hは、入力信号aの時間情報または入力信号aの特徴量の変化値といった、入力信号aにおけるセクションの位置を特定できる情報であればよい。 The feature quantity extraction unit 2A extracts the feature quantity b for classification and the feature quantity c for signal regeneration from the input signal a for each section based on the section information h. Here, the section is a section of the input signal a, and the input signal a is sectioned into sections according to the change of the acoustic signal. The section information h is information indicating the position of the section separating the input signal a. For example, the section information h may be information that can specify the position of the section in the input signal a, such as time information of the input signal a or a change value of a feature of the input signal a.
 データ推定部3Aは、DNN3aを用いて、セクションごとの入力信号aから抽出された分類用特徴量bに基づいて、分類用データdを推定する。DNN3aには、セクション単位の分類用特徴量bに基づいて分類用データdを推定するように事前に学習されたネットワークパラメータ3bが設定されている。当該ネットワークパラメータ3bが設定されたDNN3aは、分類用特徴量bに対して階層的に演算を施すことにより分類用データdをセクションごとに推定する。DNN3aには、例えば、RNNまたはCNNを用いてもよい。 The data estimation unit 3A estimates the classification data d based on the classification feature b extracted from the input signal a for each section using the DNN 3a. In the DNN 3a, a network parameter 3b learned in advance so as to estimate classification data d based on the classification feature value b in section units is set. The DNN 3a in which the network parameter 3b is set estimates the classification data d for each section by hierarchically operating the classification feature amount b. For DNN 3a, for example, RNN or CNN may be used.
 なお、特徴量抽出部2Aが、セクションごとの入力信号aに含まれる音響信号に対して短時間フーリエ変換を施して周波数軸上の振幅を求め、周波数軸上の振幅に基づいて特徴量を算出する。このように音響信号から算出された特徴量を時系列に並べたデータを分類用特徴量bとしてもよい。 Note that the feature quantity extraction unit 2A performs short-time Fourier transformation on the acoustic signal included in the input signal a for each section to obtain the amplitude on the frequency axis, and calculates the feature quantity based on the amplitude on the frequency axis Do. Data in which the feature quantities calculated from the acoustic signal in this manner are arranged in time series may be used as the classification feature quantity b.
 データ分類部5Aは、データ算出部7により算出された分類条件データeに基づいて、セクションごとの分類用データdを成分ごとに分類する。セクションごとおよび成分ごとに分類された分類用データdである分類結果情報fは、信号再生成部6Aに出力される。分類用データdの分類には、k平均法またはGMMといった分類方法を使用してもよい。 The data classification unit 5A classifies classification data d for each section for each component based on the classification condition data e calculated by the data calculation unit 7. Classification result information f which is classification data d classified for each section and each component is output to the signal regeneration unit 6A. Classification methods such as k-means or GMM may be used to classify the classification data d.
 信号再生成部6Aは、分類結果情報fにおけるセクションごとおよび成分ごとの分類用データd、セクション情報h、および、セクションごとの入力信号aの信号再生成用特徴量cに基づいて、成分ごとの音響信号を再生成する。信号再生成部6Aは、成分ごとの音響信号である出力信号gを出力する。なお、出力信号gには、成分ごとの音響信号に対応する画像信号およびテキスト情報が含まれてもよい。 The signal regeneration unit 6A is configured for each component based on classification data d for each section and each component in the classification result information f, section information h, and a feature amount c for signal regeneration of the input signal a for each section. Regenerate the acoustic signal. The signal regeneration unit 6A outputs an output signal g which is an acoustic signal for each component. The output signal g may include an image signal and text information corresponding to the acoustic signal for each component.
 データ算出部7Aは、セクション情報取得部9から入力したセクション情報hおよびセクションごとの入力信号aに基づいて、セクションごとの分類条件データeを算出する。セクションごとの入力信号aには、音響信号に加えて、セクションに対応する音響信号の成分の数および音響信号の成分の種類の特定に使用されるセンサ情報が含まれる。センサ情報には、例えば、音源となり得る人物の生態情報、音源となり得る人物が撮影された画像情報、人物の発話で生じた振動または温度変化といった物理情報がある。 The data calculation unit 7A calculates the classification condition data e for each section based on the section information h input from the section information acquisition unit 9 and the input signal a for each section. The input signal a for each section includes, in addition to the acoustic signal, sensor information used to identify the number of components of the acoustic signal corresponding to the section and the type of the component of the acoustic signal. The sensor information includes, for example, ecological information of a person who can be a sound source, image information in which a person who can be a sound source is photographed, and physical information such as vibration or temperature change caused by speech of the person.
 セクション情報取得部9は、セクション情報hを取得して、セクション情報hを特徴量抽出部2A、信号再生成部6Aおよびデータ算出部7Aのそれぞれに出力する。例えば、セクション情報取得部9は、外部装置によって作成されたセクション情報hを取得してもよいが、音響信号分離装置1Cの使用者から入力されたセクション情報hを取得してもよい。 The section information acquisition unit 9 acquires the section information h, and outputs the section information h to each of the feature extraction unit 2A, the signal regeneration unit 6A, and the data calculation unit 7A. For example, the section information acquisition unit 9 may acquire the section information h created by the external device, but may acquire the section information h input from the user of the acoustic signal separation device 1C.
 音響信号分離装置1Cにおける、特徴量抽出部2A、データ推定部3A、データ分類部5A、信号再生成部6A、データ算出部7およびセクション情報取得部9のそれぞれの機能は、処理回路によって実現される。すなわち、音響信号分離装置1Cは、図13を用いて後述するステップST1cからステップST5cまでの処理を実行するための処理回路を備える。処理回路は、図2Aに示した専用のハードウェアである処理回路103であってもよいが、図2Bに示したメモリ105に記憶されたプログラムを実行するプロセッサ104であってもよい。 The functions of the feature extraction unit 2A, the data estimation unit 3A, the data classification unit 5A, the signal regeneration unit 6A, the data calculation unit 7 and the section information acquisition unit 9 in the acoustic signal separation device 1C are realized by a processing circuit. Ru. That is, the acoustic signal separation device 1C includes a processing circuit for executing the processing from step ST1c to step ST5c described later with reference to FIG. The processing circuit may be the processing circuit 103 that is dedicated hardware shown in FIG. 2A, or may be the processor 104 that executes a program stored in the memory 105 shown in FIG. 2B.
 次に動作について説明する。
 図13は、実施の形態4に係る音響信号分離方法を示すフローチャートである。
 まず、特徴量抽出部2Aが、セクション情報取得部9から入力したセクション情報hに基づいて、セクションごとに区切られた入力信号aから、分類用特徴量bおよび信号再生成用特徴量cを抽出する(ステップST1c)。特徴量抽出部2Aは、セクションごとの入力信号aから抽出した分類用特徴量bをデータ推定部3Aに出力し、セクションごとの入力信号aから抽出した信号再生成用特徴量cを信号再生成部6Aに出力する。
Next, the operation will be described.
FIG. 13 is a flowchart showing an acoustic signal separation method according to the fourth embodiment.
First, the feature extraction unit 2A extracts the classification feature b and the signal regeneration feature c from the input signal a divided into sections based on the section information h input from the section information acquisition unit 9 (Step ST1 c). The feature extraction unit 2A outputs the classification feature b extracted from the input signal a for each section to the data estimation unit 3A, and regenerates the signal regeneration feature c extracted from the input signal a for each section. Output to section 6A.
 続いて、データ推定部3Aは、DNN3aを用いて、セクションごとの入力信号aから抽出された分類用特徴量bから、入力信号aのセクションごとに分類用データdを推定する(ステップST2c)。データ推定部3Aは、セクションごとの分類用データdをデータ分類部5Aに出力する。 Subsequently, the data estimation unit 3A estimates classification data d for each section of the input signal a from the classification feature b extracted from the input signal a for each section using the DNN 3a (step ST2c). The data estimation unit 3A outputs classification data d for each section to the data classification unit 5A.
 データ分類部5Aは、データ算出部7Aによって算出された分類条件データeに基づいて、入力信号aのセクションごとに推定された分類用データdを、成分ごとに分類する(ステップST3c)。成分ごとに分類された分類用データdである分類結果情報fは、データ分類部5Aから信号再生成部6Aに出力される。 The data classification unit 5A classifies classification data d estimated for each section of the input signal a for each component based on the classification condition data e calculated by the data calculation unit 7A (step ST3c). Classification result information f that is classification data d classified for each component is output from the data classification unit 5A to the signal regeneration unit 6A.
 信号再生成部6Aは、分類結果情報fにおける成分ごとに分類された分類用データd、セクション情報h、および、セクションごとの入力信号aの信号再生成用特徴量cに基づいて、成分ごとの音響信号を再生成する(ステップST4c)。 The signal regeneration unit 6A is configured for each component based on classification data d classified for each component in the classification result information f, section information h, and a signal regeneration feature amount c of the input signal a for each section. The acoustic signal is regenerated (step ST4c).
 次に、特徴量抽出部2Aが、セクションごとの入力信号aのうち、未処理のセクションが残っているかどうかを確認する(ステップST5c)。ここで、未処理のセクションが残っている場合(ステップST5c;YES)、ステップST1cの処理に戻り、残りのセクションの入力信号aに対して、前述した一連の処理が実行される。
 また、未処理のセクションが残っていなければ(ステップST5c;NO)、音響信号分離装置1Cは、図13の処理を終了する。
Next, the feature quantity extraction unit 2A confirms whether or not an unprocessed section remains in the input signal a for each section (step ST5c). Here, if an unprocessed section remains (step ST5c; YES), the process returns to the process of step ST1c, and the above-described series of processes is performed on the input signal a of the remaining section.
If no unprocessed section remains (step ST5c; NO), the acoustic signal separation device 1C ends the process of FIG.
 以上のように、実施の形態4に係る音響信号分離装置1Cにおいて、特徴量抽出部2Aが、セクション情報hに基づいて、セクションごとの入力信号aから特徴量を抽出する。データ推定部3Aは、DNN3aを用いて、分類用特徴量bに基づき、セクションごとの分類用データdを推定する。データ分類部5Aは、分類条件データeに基づいて、セクションごとの分類用データdを成分ごとに分類する。信号再生成部6Aが、データ分類部5Aによって分類された分類用データd、セクション情報h、およびセクションごとの入力信号aの信号再生成用特徴量cに基づいて、入力信号aのセクションごとに成分ごとの音響信号を再生成する。これにより、音響信号の成分の数または音響信号の成分の種類がセクションごとに異なる場合であっても、音響信号分離装置1Cが音響信号を精度よく分離することができる。 As described above, in the acoustic signal separation device 1C according to the fourth embodiment, the feature quantity extraction unit 2A extracts the feature quantity from the input signal a for each section based on the section information h. The data estimation unit 3A estimates classification data d for each section based on the classification feature b using DNN 3a. The data classification unit 5A classifies classification data d for each section for each component based on the classification condition data e. Based on the classification data d classified by the data classification unit 5A, the section information h, and the signal regeneration feature value c of the input signal a for each section, the signal regeneration unit 6A is for each section of the input signal a. Regenerate an acoustic signal for each component. Thereby, even when the number of components of the acoustic signal or the type of the component of the acoustic signal is different for each section, the acoustic signal separation device 1C can accurately separate the acoustic signal.
 なお、本発明は上記実施の形態に限定されるものではなく、本発明の範囲内において、実施の形態のそれぞれの自由な組み合わせまたは実施の形態のそれぞれの任意の構成要素の変形もしくは実施の形態のそれぞれにおいて任意の構成要素の省略が可能である。 The present invention is not limited to the above embodiment, and within the scope of the present invention, variations or embodiments of respective free combinations of the embodiments or respective optional components of the embodiments. An optional component can be omitted in each of the above.
 この発明に係る音響信号分離装置は、音響信号を成分ごとに精度よく分離できるので、複数の音源が存在する会議システムなどに利用可能である。 The audio signal separation apparatus according to the present invention can separate an audio signal for each component with high accuracy, and therefore can be used for a conference system or the like in which a plurality of sound sources exist.
 1,1A,1B,1C 音響信号分離装置、2,2A 特徴量抽出部、3,3A データ推定部、3a DNN、3b ネットワークパラメータ、4 データ取得部、5,5A データ分類部、6,6A 信号再生成部、7,7A データ算出部、8 パラメータ切り替え部、9 セクション情報取得部、100 音響インタフェース、101 画像インタフェース、102 テキスト入力インタフェース、103 処理回路、104 プロセッサ、105 メモリ、106 センサインタフェース。 1, 1A, 1B, 1C acoustic signal separation device, 2, 2A feature quantity extraction unit, 3, 3A data estimation unit, 3a DNN, 3b network parameter, 4 data acquisition unit, 5, 5A data classification unit, 6, 6A signal Regeneration unit, 7, 7 A data calculation unit, 8 parameter switching unit, 9 section information acquisition unit, 100 acoustic interface, 101 image interface, 102 text input interface, 103 processing circuit, 104 processor, 105 memory, 106 sensor interface.

Claims (10)

  1.  1つ以上の成分が混合された音響信号を含む入力信号から特徴量を抽出する特徴量抽出部と、
     同一の音源から出力された音響信号の成分間を対応付ける第1のデータを推定するようにネットワークパラメータが学習された深層ニューラルネットワークを用いて、前記特徴量抽出部によって抽出された特徴量に基づいて前記第1のデータを推定するデータ推定部と、
     音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定された第2のデータに基づいて、前記データ推定部によって推定された前記第1のデータを成分ごとに分類するデータ分類部と、
     前記データ分類部によって成分ごとに分類された前記第1のデータおよび前記特徴量抽出部によって抽出された特徴量に基づいて、成分ごとの音響信号を再生成する信号再生成部と
     を備えたことを特徴とする音響信号分離装置。
    A feature amount extraction unit that extracts a feature amount from an input signal including an acoustic signal in which one or more components are mixed;
    Based on the feature quantity extracted by the feature quantity extraction unit using a deep layer neural network in which network parameters are learned so as to estimate first data correlating components of an acoustic signal output from the same sound source A data estimation unit that estimates the first data;
    The first data estimated by the data estimation unit is set based on second data in which a classification condition including information on at least one of the number of acoustic signal components and the type of acoustic signal components is set. A data classification unit that classifies each component;
    And a signal regeneration unit for regenerating an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature quantity extracted by the feature quantity extraction unit. An acoustic signal separation device characterized by
  2.  前記特徴量抽出部は、前記入力信号を区切るセクションの位置を示すセクション情報に基づいて、セクションごとの前記入力信号から特徴量を抽出し、
     前記データ推定部は、前記深層ニューラルネットワークを用いて、前記特徴量抽出部によって抽出された特徴量に基づいてセクションごとの前記第1のデータを推定し、
     前記データ分類部は、前記第2のデータに基づいてセクションごとの前記第1のデータを成分ごとに分類し、
     前記信号再生成部は、前記データ分類部によって成分ごとに分類された前記第1のデータ、前記セクション情報、および前記特徴量抽出部によって抽出されたセクションごとの前記入力信号の特徴量に基づいて、成分ごとの音響信号を再生成すること
     を特徴とする請求項1記載の音響信号分離装置。
    The feature quantity extraction unit extracts a feature quantity from the input signal for each section based on section information indicating a position of a section that divides the input signal.
    The data estimation unit estimates the first data for each section based on the feature quantity extracted by the feature quantity extraction unit using the deep layer neural network.
    The data classification unit classifies the first data of each section for each component based on the second data,
    The signal regeneration unit is based on the first data classified by component by the data classification unit, the section information, and the feature amount of the input signal for each section extracted by the feature amount extraction unit. The acoustic signal separation device according to claim 1, wherein the acoustic signal of each component is regenerated.
  3.  前記入力信号から前記第2のデータを算出するデータ算出部を備えたこと
     を特徴とする請求項1または請求項2記載の音響信号分離装置。
    The sound signal separation device according to claim 1 or 2, further comprising: a data calculation unit that calculates the second data from the input signal.
  4.  前記第2のデータに基づいて、前記深層ニューラルネットワークに設定されるネットワークパラメータを切り替えるパラメータ切り替え部を備えたこと
     を特徴とする請求項1から請求項3のうちのいずれか1項記載の音響信号分離装置。
    The acoustic signal according to any one of claims 1 to 3, further comprising a parameter switching unit that switches a network parameter set in the deep layer neural network based on the second data. Separation device.
  5.  前記第2のデータに設定された分類条件は、音響信号の成分の数の下限値、上限値および範囲のうちの少なくとも一つであること
     を特徴とする請求項1から請求項3のうちのいずれか1項記載の音響信号分離装置。
    The classification condition set in the second data is at least one of a lower limit value, an upper limit value, and a range of the number of components of an acoustic signal. The acoustic signal separation device according to any one of the items.
  6.  前記第2のデータに設定された分類条件は、音源の数の時系列な変化を示す音源数列であること
     を特徴とする請求項1から請求項3のうちのいずれか1項記載の音響信号分離装置。
    The sound signal according to any one of claims 1 to 3, wherein the classification condition set in the second data is a sound source number sequence indicating a time-series change of the number of sound sources. Separation device.
  7.  前記データ分類部は、前記音源数列の時系列変化を平滑化すること
     を特徴とする請求項6記載の音響信号分離装置。
    The sound signal separation device according to claim 6, wherein the data classification unit smoothes time-series changes of the sound source sequence.
  8.  前記第2のデータに設定された分類条件は、音源の特徴を示す特徴情報であること
     を特徴とする請求項1から請求項3のうちのいずれか1項記載の音響信号分離装置。
    The acoustic signal separation device according to any one of claims 1 to 3, wherein the classification condition set in the second data is feature information indicating a feature of a sound source.
  9.  前記信号再生成部は、前記第2のデータを成分ごとの信号に対応付けて出力すること
     を特徴とする請求項1から請求項3のうちのいずれか1項記載の音響信号分離装置。
    The acoustic signal separation device according to any one of claims 1 to 3, wherein the signal regeneration unit associates the second data with a signal of each component and outputs the second data.
  10.  特徴量抽出部が、1つ以上の成分が混合された音響信号を含む入力信号から特徴量を抽出するステップと、
     データ推定部が、同一の音源から出力された音響信号の成分間を対応付ける第1のデータを推定するようにネットワークパラメータが学習された深層ニューラルネットワークを用いて、前記特徴量抽出部によって抽出された特徴量に基づいて前記第1のデータを推定するステップと、
     データ分類部が、音響信号の成分の数および音響信号の成分の種類のうちの少なくとも一方に関する情報を含む分類条件が設定された第2のデータに基づいて、前記データ推定部によって推定された前記第1のデータを成分ごとに分類するステップと、
     信号再生成部が、前記データ分類部によって成分ごとに分類された前記第1のデータおよび前記特徴量抽出部によって抽出された特徴量に基づいて、成分ごとの音響信号を再生成するステップと
     を備えたことを特徴とする音響信号分離方法。
    Extracting a feature amount from an input signal including an acoustic signal in which one or more components are mixed;
    The feature quantity extraction unit extracts the data using the deep layer neural network in which the network parameter is learned so that the data estimation unit estimates first data correlating the components of the acoustic signal output from the same sound source. Estimating the first data based on the feature amount;
    The data estimated by the data estimation unit based on second data in which a classification condition includes information on at least one of the number of components of the acoustic signal and the type of the component of the acoustic signal. Classifying the first data by component;
    The signal regeneration unit regenerates an acoustic signal for each component based on the first data classified for each component by the data classification unit and the feature amount extracted by the feature amount extraction unit An acoustic signal separation method characterized by comprising.
PCT/JP2017/042222 2017-11-24 2017-11-24 Acoustic signal separation device and method for separating acoustic signal WO2019102585A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2017/042222 WO2019102585A1 (en) 2017-11-24 2017-11-24 Acoustic signal separation device and method for separating acoustic signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2017/042222 WO2019102585A1 (en) 2017-11-24 2017-11-24 Acoustic signal separation device and method for separating acoustic signal

Publications (1)

Publication Number Publication Date
WO2019102585A1 true WO2019102585A1 (en) 2019-05-31

Family

ID=66631855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/042222 WO2019102585A1 (en) 2017-11-24 2017-11-24 Acoustic signal separation device and method for separating acoustic signal

Country Status (1)

Country Link
WO (1) WO2019102585A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017007035A1 (en) * 2015-07-07 2017-01-12 Mitsubishi Electric Corporation Method for distinguishing one or more components of signal

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017007035A1 (en) * 2015-07-07 2017-01-12 Mitsubishi Electric Corporation Method for distinguishing one or more components of signal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HERSHEY, JOHN R. ET AL.: "DEEP CLUSTERING: DISCRIMINATIVE EMBEDDINGS FOR SEGMENTATION AND SEPARATION", PROC. ICASSP 2016, 20 March 2016 (2016-03-20), pages 31 - 35, XP032900557, doi:10.1109/ICASSP.2016.7471631 *
HIGUCHI, TAKUYA ET AL.: "Deep clustering-based beamforming for separation with unknown number of sources", PROC. INTERSPEECH 2017, 20 August 2017 (2017-08-20) - 24 August 2017 (2017-08-24), pages 1183 - 1187, XP055618559 *

Similar Documents

Publication Publication Date Title
JP6596376B2 (en) Speaker identification method and speaker identification apparatus
CN107077860B (en) Method for converting a noisy audio signal into an enhanced audio signal
US9666183B2 (en) Deep neural net based filter prediction for audio event classification and extraction
US20110125496A1 (en) Speech recognition device, speech recognition method, and program
US9224392B2 (en) Audio signal processing apparatus and audio signal processing method
JP2021500616A (en) Object identification method and its computer equipment and computer equipment readable storage medium
US9478232B2 (en) Signal processing apparatus, signal processing method and computer program product for separating acoustic signals
KR20160024858A (en) Voice data recognition method, device and server for distinguishing regional accent
JP2007233239A (en) Method, system, and program for utterance event separation
JP7370014B2 (en) Sound collection device, sound collection method, and program
JP6725186B2 (en) Learning device, voice section detection device, and voice section detection method
US9460714B2 (en) Speech processing apparatus and method
WO2018051945A1 (en) Speech processing device, speech processing method, and recording medium
JP6553015B2 (en) Speaker attribute estimation system, learning device, estimation device, speaker attribute estimation method, and program
CN111863005A (en) Sound signal acquisition method and device, storage medium and electronic equipment
JP2019066339A (en) Diagnostic device, diagnostic method and diagnostic system each using sound
JP2016143042A (en) Noise removal system and noise removal program
Poorjam et al. A parametric approach for classification of distortions in pathological voices
JP2008039694A (en) Signal count estimation system and method
JP6449331B2 (en) Excitation signal formation method of glottal pulse model based on parametric speech synthesis system
US20150208167A1 (en) Sound processing apparatus and sound processing method
JP2013257418A (en) Information processing device, information processing method, and program
WO2019102585A1 (en) Acoustic signal separation device and method for separating acoustic signal
JP6404780B2 (en) Wiener filter design apparatus, sound enhancement apparatus, acoustic feature quantity selection apparatus, method and program thereof
JP5705190B2 (en) Acoustic signal enhancement apparatus, acoustic signal enhancement method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17933114

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17933114

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP