WO2023276235A1 - プログラム、情報処理方法、記録媒体および情報処理装置 - Google Patents
プログラム、情報処理方法、記録媒体および情報処理装置 Download PDFInfo
- Publication number
- WO2023276235A1 WO2023276235A1 PCT/JP2022/005007 JP2022005007W WO2023276235A1 WO 2023276235 A1 WO2023276235 A1 WO 2023276235A1 JP 2022005007 W JP2022005007 W JP 2022005007W WO 2023276235 A1 WO2023276235 A1 WO 2023276235A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- sound source
- encoder
- sub
- unit
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 87
- 238000003672 processing method Methods 0.000 title claims abstract description 22
- 238000013528 artificial neural network Methods 0.000 claims abstract description 232
- 238000012545 processing Methods 0.000 claims abstract description 124
- 230000005236 sound signal Effects 0.000 claims abstract description 92
- 238000000926 separation method Methods 0.000 claims abstract description 61
- 238000004364 calculation method Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 77
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 48
- 238000000605 extraction Methods 0.000 claims description 19
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 claims description 13
- 230000000306 recurrent effect Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000006403 short-term memory Effects 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 abstract 1
- 238000000034 method Methods 0.000 description 40
- 230000008569 process Effects 0.000 description 37
- 230000009466 transformation Effects 0.000 description 30
- 238000010586 diagram Methods 0.000 description 14
- 238000001228 spectrum Methods 0.000 description 11
- 230000000694 effects Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 6
- 230000015654 memory Effects 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000001131 transforming effect Effects 0.000 description 5
- 230000001755 vocal effect Effects 0.000 description 4
- 230000005284 excitation Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000002542 deteriorative effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
Definitions
- the present disclosure relates to programs, information processing methods, recording media, and information processing apparatuses.
- Patent Literature 1 discloses a sound source separation technique using a DNN (Deep Neural Network).
- One of the purposes of the present disclosure is to provide a program, an information processing method, a recording medium, and an information processing apparatus that minimize the amount of computation while obtaining sound source separation performance above a certain level.
- the neural network unit generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals
- the encoder of the neural network unit converts the feature amount extracted from the mixed sound signal
- Encoder processing results are input to each of a plurality of sub-neural network units possessed by the neural network unit
- the neural network unit generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals
- the encoder of the neural network unit converts the feature amount extracted from the mixed sound signal
- Encoder processing results are input to each of a plurality of sub-neural network units possessed by the neural network unit,
- the neural network unit generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals
- the encoder of the neural network unit converts the feature amount extracted from the mixed sound signal
- Encoder processing results are input to each of a plurality of sub-neural network units possessed by the neural network unit
- a recording medium recording a program for causing a computer to execute an information processing method in which a processing result of an encoder and each processing result of a plurality of sub-neural networks are input to a decoder of a neural network unit.
- a neural network unit that generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals
- the neural network section an encoder that transforms the feature quantity extracted from the mixed sound signal; a plurality of sub-neural network units to which encoder processing results are input;
- An information processing apparatus comprising: a decoder to which a processing result of an encoder and a processing result of each of a plurality of sub-neural networks are input.
- Each of the plurality of neural network units generates sound source separation information for separating different sound source signals from a mixed sound signal containing a plurality of sound source signals
- An encoder included in one neural network unit of the plurality of neural network units converts the feature amount extracted from the mixed sound signal, It is a program that causes a computer to execute an information processing method in which a processing result of an encoder is input to each of sub-neural network units provided in a plurality of neural network units.
- Each of the plurality of neural network units generates sound source separation information for separating different sound source signals from a mixed sound signal containing a plurality of sound source signals
- An encoder included in one neural network unit of the plurality of neural network units converts the feature amount extracted from the mixed sound signal, This is an information processing method in which a processing result of an encoder is input to each of sub-neural network units provided in a plurality of neural network units.
- Each of the plurality of neural network units generates sound source separation information for separating different sound source signals from a mixed sound signal containing a plurality of sound source signals
- An encoder included in one neural network unit of the plurality of neural network units converts the feature amount extracted from the mixed sound signal
- a recording medium recording a program for causing a computer to execute an information processing method in which a processing result of an encoder is input to each of sub-neural network units provided in a plurality of neural network units.
- a plurality of neural network units that generate sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals,
- Each neural network part is a sub-neural network unit; a decoder to which the processing result of the sub-neural network is input;
- One neural network unit of the plurality of neural network units Equipped with an encoder that converts features extracted from mixed sound signals, An information processing apparatus in which a processing result of an encoder is input to each of sub-neural network units provided in a plurality of neural network units.
- FIG. 1 is a block diagram that is referred to when describing technology related to the present disclosure.
- FIG. 2 is a block diagram that is referred to when describing technology related to the present disclosure.
- FIG. 3 is a block diagram that is referred to when describing technology related to the present disclosure.
- FIG. 4 is a block diagram showing a configuration example of the information processing apparatus according to the first embodiment.
- FIG. 5 is a flow chart showing the flow of processing performed by the information processing apparatus according to the first embodiment.
- FIG. 6 is a diagram that is referred to when describing the effects obtained by the first embodiment.
- FIG. 7 is a block diagram showing a configuration example of an information processing apparatus according to the second embodiment.
- FIG. 8 is a flow chart showing the flow of processing performed by the information processing apparatus according to the second embodiment.
- FIG. 9 is a block diagram showing a configuration example of an information processing apparatus according to the third embodiment.
- FIG. 10 is a diagram for explaining an example of the effect obtained in the embodiment;
- FIG. 1 is a block diagram showing a configuration example of an information processing device (information processing device 1A) according to technology related to the present disclosure.
- the information processing device 1A is a sound source separation device that separates a desired sound source signal from a mixed sound signal containing a plurality of sound source signals (for example, each instrumental sound that constitutes vocals and accompaniment sounds).
- the information processing device 1A is incorporated in a smart phone, a personal computer, or an in-vehicle device.
- the accompaniment sound signal is separated from a mixed sound signal stored in media such as a CD (Compact Disc) or a semiconductor memory, or from a mixed sound signal distributed via a network such as the Internet. .
- the separated accompaniment sound signal is reproduced.
- the user sings along with the reproduction of the accompaniment sound signal.
- a text transcription process or the like may be performed using the sound source separation result of the information processing apparatus 1A.
- the sound source separation processing performed by the information processing device 1A may be performed as online (real time) processing or as offline (batch) processing.
- the information processing apparatus 1A generally includes a feature extraction unit 2, a DNN unit 3, a multiplication unit 4 which is an example of a calculation unit, and a separated sound source signal generation unit 5. .
- a mixed sound signal is input to the feature extraction unit 2 .
- the separated sound source signal generator 5 outputs a sound source signal separated from the mixed sound signal (hereinafter also referred to as a separated sound source signal SA).
- the mixed sound signal is a signal obtained by mixing a plurality of sound source signals, and is a signal digitized by PCM (Pulse Code Modulation) or the like.
- the source of the mixed sound signal may be anything, such as a recording medium, a server device on the network, or the like.
- the feature quantity extraction unit 2 performs feature quantity extraction processing for extracting the feature quantity of the mixed sound signal.
- the feature amount extraction unit 2 cuts out data of the mixed sound signal for each fixed section (frame) of a predetermined length, and performs frequency conversion (for example, short-time Fourier transform) on each cut-out frame.
- a time-series signal of a frequency spectrum is obtained by such frequency conversion processing.
- the frame length is 2048
- the frequency transform length is also 2048, and the spectrum is transformed into a frequency spectrum of 1025, which is equal to or less than the alias frequency. That is, by the processing of the feature quantity extraction unit 2, a frequency spectrum, specifically a multidimensional vector (in this example, a 1025-dimensional vector) is obtained as an example of the feature quantity.
- a processing result of the feature amount extraction unit 2 is supplied to the DNN unit 3 in the subsequent stage.
- the DNN unit 3 generates sound source separation information for separating a predetermined sound source signal from the mixed sound signal.
- DNN3 is a multi-layered algorithm modeled on a human brain neural network (neural network) designed by machine learning to generate sound source separation information.
- the DNN unit 3 includes an encoder 31 that transforms the feature amount extracted from the mixed sound signal by the feature amount extraction unit 2, a sub-neural network unit 32 that receives the processing result of the encoder 31, the processing result of the encoder 31, and , and a decoder 33 to which the processing results of the sub-neural network unit 32 are input.
- the encoder 31 includes one or more affine transform units.
- the affine transformation unit performs processing represented by the following equation (1).
- y f(Wx+b) (1)
- x is the input vector
- y is the output vector
- W is the obtained weighting factor
- b is the bias factor
- f is the nonlinear function.
- the values of W and b are numerical values obtained by pre-learning using a large data set.
- f for example, a ReLU (Rectified Linear Unit) function, a Sigmaid function, or the like can be applied.
- the encoder 31 includes a first affine transformation section 31A and a second affine transformation section 31B.
- the number of affine transform units included in the encoder 31 is appropriately set so as to secure a certain level of sound source separation performance.
- the encoder 31 converts the feature quantity, for example, by compressing the size of the feature quantity. More specifically, the encoder 31 compresses the dimensionality of the multidimensional vector.
- the sub-neural network unit 32 is a neural network that exists within the DNN unit 3.
- RNN Recurrent Neural Network
- Future processing results are available for batch processing.
- a recurrent neural network a neural network using GRU (Gated Recurrent Unit) or LSTM (Long Short Term Memory) as an algorithm can be applied.
- the sub-neural network unit 32 includes a first RNN unit 32A, a second RNN unit 32B, and a third RNN unit 32C.
- the number of RNN units included in the sub-neural network unit 32 is appropriately set so as to ensure a certain level of sound source separation performance.
- the parameters used by each RNN unit are different, and the parameters are stored in the ROM (Read Only Memory) and RAM (Random Access Memory) of each RNN unit (not shown). In the following description, when there is no particular need to distinguish between ROM and RAM, they are referred to as memory cells as appropriate. Processing results of the encoder 31 are sequentially processed by the first RNN unit 32A, the second RNN unit 32B, and the third RNN unit 32C.
- the decoder 33 generates sound source separation information based on the processing result of the encoder 31 and the processing result of the sub-neural network unit 32.
- the decoder 33 includes, for example, a third affine transform section 33A and a fourth affine transform section 33B.
- the third affine transformation unit 33A connects the processing result of the encoder 31, that is, the processing result of skipping the sub-neural network unit 32 and the output of the sub-neural network unit 32 (also called skip connection).
- the fourth affine transformation unit 33B performs the affine transformation indicated by the above-described formula (1) on the processing result of the third affine transformation unit 33A.
- the processing of the third and fourth affine transform units 33A and 33B restores the feature amount compressed by the encoder 31, thereby obtaining a mask, which is an example of sound source separation information.
- Mask information is output from the DNN unit 3 and supplied to the multiplier unit 4 .
- the multiplication unit 4 multiplies the feature amount extracted by the feature amount extraction unit 2 by the mask supplied from the DNN unit 3 .
- the separated sound source signal generation unit 5 performs processing (for example, short-time inverse Fourier transform) to return the calculation result of the multiplication unit 4 to a signal on the time axis. As a result, a desired sound source signal (a sound source signal to be separated and a signal on the time axis) is generated.
- the separated sound source signal SA generated by the separated sound source signal generator 5 is used for purposes corresponding to applications.
- FIG. 2 shows an example of the input/output size of each module that configures the DNN unit 3 .
- a 1025-dimensional frequency spectrum is input to the first affine transforming section 31A, and the first affine transforming section 31A performs affine transforming to output a 256-dimensional vector.
- a 256-dimensional frequency spectrum (output of the first affine transforming unit 31A) is input to the second affine transforming unit 31B. to output
- the first affine transformation unit 31A and the second affine transformation unit 31B reduce the size (number of dimensions) of the multidimensional vector input to the sub-neural network unit 32 in this embodiment.
- the generalization performance of the DNN unit 3 can be improved.
- the first RNN section 32A, the second RNN section 32B, and the third RNN section 32C input 256-dimensional multidimensional vectors and output them with the same number of dimensions.
- the input of the third affine transformation unit 33A is a 512-dimensional vector obtained by connecting the outputs of the second affine transformation unit 31B and the third RNN unit 32C.
- the performance of the DNN section 3 can be improved by concatenating the vectors before the processing of the sub-neural network section 32 .
- the third affine transformation unit 33A receives a 512-dimensional vector as an input and performs affine transformation on the input to output a 256-dimensional vector.
- the fourth affine transformation unit 33B receives a 256-dimensional vector as an input and performs affine transformation on the input to output a 1025-dimensional vector.
- a 1025-dimensional vector corresponds to a mask by which the frequency spectrum supplied from the feature amount extraction unit 2 is multiplied by the multiplication unit 4 . It should be noted that the number of connected modules constituting the DNN unit 3 and the vector size of each input and output are examples, and the effective configuration differs depending on the data set.
- FIG. 3 is a block diagram showing a configuration example of another information processing device (information processing device 1B). While the information processing device 1A is configured to separate one sound source signal from the mixed sound signal, the information processing device 1B separates two sound source signals from the mixed sound signal. For example, the information processing device 1B separates the separated sound source signal SA and the separated sound source signal SB from the mixed sound signal.
- the information processing device 1B includes a DNN unit 6, a multiplication unit 7, and a separated sound source signal generation unit 8 in addition to the configuration of the information processing device 1A.
- the DNN section 6 includes an encoder 61 , a sub-neural network section 62 and a decoder 63 .
- the encoder 61 has a first affine transformation section 61A and a second affine transformation section 61B.
- the sub-neural network unit 62 includes a first RNN unit 62A, a second RNN unit 62B, and a third RNN unit 62C.
- the decoder 63 includes a third affine transform section 63A and a fourth affine transform section 63B.
- the operation flow of the DNN unit 6 is roughly the same as that of the DNN unit 3. That is, the DNN section 6 performs the same processing as the DNN section 3 on the feature amount of the mixed sound signal extracted by the feature amount extraction section 2 . Thereby, a mask for obtaining the separated sound source signal SB is generated.
- the multiplication unit 7 multiplies the feature amount of the mixed sound signal by the mask.
- the separated sound source signal generation unit 8 converts the multiplication result into a signal on the time axis, thereby generating the separated sound source signal SB.
- the learning of the DNN 3 and the DNN unit 6 is performed separately. That is, even if the arrangement of modules in each DNN unit is the same, the values of the weighting coefficients and bias coefficients in the affine transform unit and the coefficients used in the RNN unit are different. has been made Thus, when the number of sound source signals to be separated increases N-fold, the number of sum-of-products required for the DNN unit and the amount of memory cells used also increase N-fold. Details of the present disclosure made in view of such points will be described in detail with reference to embodiments.
- FIG. 4 is a block diagram showing a configuration example of an information processing apparatus (information processing apparatus 100) according to the first embodiment.
- the configurations similar to those of the information processing apparatus 1A or the information processing apparatus 1B are denoted by the same reference numerals, and overlapping descriptions are omitted as appropriate.
- the matters described for the information processing apparatuses 1A and 1B can be applied to each embodiment.
- the information processing device 100 includes a DNN section 11 in place of the DNN section 3 .
- the DNN unit 11 generates a mask for separating and outputting a predetermined sound source signal (eg, separated sound source signal SA) from the mixed sound signal.
- a predetermined sound source signal eg, separated sound source signal SA
- the DNN unit 11 includes the encoder 31 and decoder 33 described above.
- the DNN unit 11 also includes a plurality of sub-neural network units, specifically, two sub-neural network units (sub-neural network units 12 and 13) arranged in parallel.
- the sub-neural network unit 12 includes a first RNN unit 12A, a second RNN unit 12B, and a third RNN unit 12C.
- the sub-neural network unit 13 also includes a first RNN unit 13A, a second RNN unit 13B, and a third RNN unit 13C. Each sub-neural network unit performs RNN-based processing on the input to itself.
- the output of the encoder 31 is divided.
- a 256-dimensional vector is output from the encoder 31 (see FIG. 2), the number of dimensions of the vector is divided into two to generate a 128-dimensional first vector and a 128-dimensional second vector.
- Such processing is performed by the encoder 31, for example.
- the first vector is input to the sub-neural network unit 12, for example, and the second vector is input to the sub-neural network unit 13, for example.
- the sub-neural network unit 12 outputs a 128-dimensional vector by performing processing using RNN on the first vector.
- the sub-neural network unit 13 outputs a 128-dimensional vector by performing processing using RNN on the second vector.
- the third affine transform unit 33A of the decoder 33 converts the 128-dimensional vector output from the sub-neural network unit 12, the 128-dimensional vector output from the sub-neural network unit 13, and the 256-dimensional vectors are concatenated and affine transformation is performed on the concatenated vectors. Since other processing is the same as that of the information processing apparatus 1A, redundant description is omitted.
- step ST1 each module constituting the DNN unit 3 reads coefficients stored in a ROM (not shown) or the like. Then, the process proceeds to step ST2.
- step ST2 the mixed sound signal is input to the information processing device 100. Then, the process proceeds to step ST3.
- step ST3 the feature amount extraction unit 2 extracts feature vectors from the mixed sound signal. For example, a 1025-dimensional feature vector is input to the encoder 31 of the DNN section 11 . Then, the process proceeds to step ST4.
- step ST4 encoding processing is performed by the encoder 31, specifically, the first affine transformation section 31A and the second affine transformation section 31B.
- a 256-dimensional vector for example, is output from the second affine transformation unit 31B. Then, the process proceeds to step ST5.
- step ST5 two 128-dimensional vectors (first and second vectors) are generated by equally dividing the 256-dimensional vector into two.
- the first vector is input to sub-neural network section 12 and the second vector is input to sub-neural network section 13 .
- the process of step ST5 may be included in the encoding process of step ST4. Then, the process proceeds to steps ST6 and ST7.
- step ST6 processing is performed by the sub-neural network unit 12 using the first vector. Further, in step ST7, processing is performed by the sub-neural network section 13 using the second vector. Note that the processes related to steps ST6 and ST7 may be performed in parallel, or may be performed sequentially. Then, the process proceeds to step ST8.
- step ST8 a process of concatenating vectors is performed. This processing is performed by the decoder 33, for example.
- the third affine transformation unit 33A transforms the 256-dimensional vector output from the second affine transformation unit 31B, the 128-dimensional vector output from the sub-neural network unit 12, and the 128-dimensional vector output from the sub-neural network unit 13.
- a 512-dimensional vector is generated by concatenating the vectors of . Then, the process proceeds to step ST9.
- step ST9 decoding processing is performed by the third affine transformation section 33A and the fourth affine transformation section 33B of the decoder 33.
- a mask represented by a 1025-dimensional vector is output from the fourth affine transformation unit 33B. Note that the process of step ST8 described above may be included in the decoding process of step ST9. Then, the process proceeds to step ST10.
- step ST10 multiplication processing is performed. Specifically, the multiplication unit 4 multiplies the vector output from the feature quantity extraction unit 2 by the mask obtained by the DNN unit 11 . Then, the process proceeds to step ST11.
- step ST11 separated sound source signal generation processing is performed. Specifically, the separated sound source signal generation section 5 converts the frequency spectrum obtained by the calculation of the multiplication section 4 into a signal on the time axis. Then, the process proceeds to step ST12.
- step ST12 it is determined whether or not the mixed sound signal is being input. Such determination is made by, for example, a CPU (Central Processing Unit) (not shown) that controls the operation of the information processing apparatus 100 . If no mixed sound signal is input (No), the process ends. If the input of the mixed sound signal continues (Yes), the process returns to step ST2, and the above-described processes are repeated.
- a CPU Central Processing Unit
- FIG. 6 is a graph showing the relationship between the number of coefficients that the DNN section has and the sound source separation performance.
- the horizontal axis of the graph (Number of Weights) is the number of coefficients in the DNN (affine transformation and sub-neural network), which is roughly proportional to the number of operations and memory cell capacity required for DNN processing. is the value to
- the vertical axis of the graph indicates SDR (Signal to Distortion Ratio) [dB].
- the SDR is an index that represents the accuracy with which a target sound source is separated, and is an index that indicates that the larger the value, the higher the separation performance. Therefore, in the graph shown in FIG. 6, the more data is plotted in the upper left, the less computing resources are used and the higher the sound source separation performance.
- the pattern PA in FIG. 6 corresponds to the case where the general configuration (configuration shown in FIG. 1) is used and the input/output vector size of the sub-neural network unit is set to 256 dimensions (1 Grouped-GRU [256]).
- Pattern PB in FIG. 6 corresponds to the case where the general configuration (configuration shown in FIG. 1) is used and the input/output vector size of the sub-neural network unit is 84 dimensions (1 Grouped-GRU [84]).
- the pattern PC in FIG. 6 corresponds to the case where two sub-neural network units are used as in the configuration according to the present embodiment, and each input/output vector size is evenly divided (128 dimensions) (2 Grouped-GRU [ 128,128]).
- the pattern PD in FIG. 6 corresponds to the case (4 Grouped-GRU [128,64,32,32]).
- the number of coefficients was approximately 2,000,000, and the SDR was approximately 12.4. Although the sound source separation performance is high, the number of calculations increases due to the large number of coefficients.
- the configuration corresponding to pattern PB the vector size, that is, the configuration of the DNN section, is the same as in the case of pattern PA, and the vector size is reduced, the number of coefficients is less than about 500000, and the number of operations is reduced. can be reduced.
- the SDR for the pattern PB was approximately 11.9, and the sound source separation performance was lower than that for the pattern PA. Therefore, simply reducing the coefficients will degrade the performance of sound source separation.
- the number of coefficients was about 1,500,000 or more. It was possible to reduce the number of coefficients and the number of calculations compared to the pattern PA. Furthermore, the configuration corresponding to the pattern PC, the SDR of the vector size was about 12.5 or more, and the sound source separation performance higher than that of the pattern PA related to the general configuration was obtained. Also, in the case of the configuration and vector size corresponding to the pattern PD, the number of coefficients could be reduced (a little less than 1,500,000) compared to the pattern PA, and a result exceeding the SDR was obtained.
- the number of coefficients can be reduced as compared to the pattern PC, and the SDR is almost the same as that of the pattern PC.
- both patterns PC and PD are located on the upper left of the line connecting patterns PA and PB, and it was confirmed that high sound source separation performance was obtained while reducing the number of calculations compared to the conventional method.
- the information processing apparatus According to the information processing apparatus according to the present embodiment, it is possible to reduce the number of operations compared to an information processing apparatus according to a general configuration, and improve the performance of sound source separation without degrading it. confirmed that it is possible.
- the number of sub-neural network units is not limited to two, and even if the vector size input to each sub-neural network unit is different (unequally divided (also) was confirmed to be good.
- FIG. 7 is a block diagram showing a configuration example of an information processing device (information processing device 200) according to the second embodiment. Note that in FIG. 7, the configuration of the DNN unit 3 is appropriately simplified due to the space limitation in the drawing.
- the information processing apparatus 200 has a common configuration related to an encoder in a configuration (for example, the configuration of the information processing apparatus 1B shown in FIG. 3) that corresponds to a case where there are a plurality of sound sources to be separated.
- the encoders are separated from the encoders 31 and 61, but the content of the processing is to compress the vector size (the number of dimensions in this example) of the feature vector extracted from the mixed sound signal. It is the same processing content. Therefore, as shown in FIG. 6, in the information processing apparatus 200, the encoders in a plurality of DNN units (for example, DNN units 3 and 6) are shared. This makes it possible to reduce the calculation load in the information processing apparatus 200 .
- the output of encoder 31 is input to sub-neural network section 32 and decoder 33 of DNN section 3 and sub-neural network section 62 and decoder 63 of DNN section 6 . Since other processes are basically the same as those of the information processing apparatus 1B, redundant description is omitted.
- step ST21 each module constituting the DNN unit 3 reads coefficients stored in a ROM (not shown) or the like. Then, the process proceeds to step ST22.
- step ST22 the mixed sound signal is input to the information processing device 200. Then, the process proceeds to step ST23.
- step ST23 the feature quantity extraction unit 2 extracts feature vectors from the mixed sound signal. For example, a 1025-dimensional feature vector is input to the encoder 31 of the DNN section 11 . Then, the process proceeds to step ST24.
- step ST24 encoding processing is performed by the encoder 31, specifically, the first affine transformation section 31A and the second affine transformation section 31B.
- the second affine transformation unit 31B outputs a vector compressed to, for example, 256 dimensions.
- Such vectors are input to the sub-neural network section 32 and decoder 33 of the DNN section 3 and to the sub-neural network section 62 and decoder 63 of the DNN section 6, respectively. Then, the process proceeds to steps ST25 and ST29.
- the processing related to steps ST25 to ST28 includes processing performed by the sub-neural network unit 32, decoding processing performed by the decoder 33, multiplication processing performed by the multiplication unit 4, and separated excitation signal generation processing performed by the separated excitation signal generation unit 5.
- a separated sound source signal SA is generated in the separated sound source signal generation process.
- the processing related to steps ST29 to ST32 includes the processing performed by the sub-neural network unit 62, the decoding processing performed by the decoder 63, the multiplication processing performed by the multiplier 7, and the separated sound source signal performed by the separated sound source signal generator 8. generation process.
- a separated sound source signal SB is generated in the separated sound source signal generation process. Since the contents of each process have already been explained, redundant explanations will be omitted as appropriate.
- the processing related to step ST33 is performed for the processing related to steps ST28 and ST32.
- step ST33 it is determined whether or not the mixed sound signal is being input. Such determination is made by, for example, a CPU (not shown) that controls the operation of the information processing apparatus 200 . If no mixed sound signal is input (No), the process ends. If the input of the mixed sound signal continues (Yes), the process returns to step ST22, and the above-described processes are repeated.
- the decoder and the decoder 63 may be shared.
- the input to each of the decoders 33 and 63 is via a sub-neural network unit whose coefficients are optimized for the respective source signals to be separated. Therefore, it is preferable that the coefficients of the decoder 33 are optimized for the sound source signal to be separated, from the viewpoint of not lowering the performance of sound source separation. Therefore, it is preferable that the decoder 33 and the decoder 63 are provided for each sound source signal to be separated.
- 3rd Embodiment is the structure which combined 1st and 2nd Embodiment roughly.
- FIG. 9 is a block diagram showing a configuration example of an information processing device (information processing device 300) according to the third embodiment.
- the DNN section 11 described in the first embodiment is used instead of the DNN section 3 of the information processing apparatus 200 described above.
- a DNN section 6A is used in place of the DNN section 6 of the information processing apparatus 200 described above.
- the DNN unit 6A differs from the DNN unit 6 in the configuration of the sub-neural network unit. In other words, the DNN unit 6A has a plurality of sub-neural network units as in the first embodiment.
- the DNN section 6A includes, for example, a sub-neural network section 65 and a sub-neural network section 66.
- the sub-neural network unit 65 includes a first RNN unit 65A, a second RNN unit 65B, and a third RNN unit 65C.
- the sub-neural network unit 66 also includes a first RNN unit 66A, a second RNN unit 66B, and a third RNN unit 66C.
- the point that the DNN section 6A includes a decoder 63 is the same as the DNN section 6 .
- the details of the processing performed by the information processing apparatus 300 have been described in the first and second embodiments, etc., and redundant description will be omitted. Effects similar to those of the first and second embodiments can also be obtained in the third embodiment.
- FIG. 10 shows specific numerical examples of the number of coefficients used in the DNN section in the first to third embodiments described above.
- the basic configuration includes a general configuration (see FIG. 1), a configuration with multiple sub-neural network units (see FIG. 4), a configuration with a common encoder (see FIG. 7), and multiple sub-neural network units. and a common encoder (see FIG. 9).
- the number of sound sources to be separated was set to 2 or 10, and the sub-neural network section was provided so as to correspond to the number of sound sources to be separated.
- the number of coefficients used in the DNN section is approximately 4,000,000. Also, in a general configuration, when the number of sound sources to be separated is 10, the number of coefficients used in the DNN section is about 20000000.
- the number of coefficients used in the DNN section in other configurations is indicated by a relative value with the number of coefficients used in the DNN section in a general configuration set to 100%, and the approximate number of coefficients.
- the GRU algorithm is adopted, and for a configuration having a plurality of sub-neural network units, the values are obtained when the input/output vector size is equally divided.
- the number of coefficients used in the DNN unit is approximately 3,100,000 (about 76%), and the number of sound sources to be separated is ten.
- the number of coefficients used in the DNN part was approximately 15400000 (about 76%). That is, the number of coefficients could be reduced with respect to the general configuration. In other words, the number of operations could be reduced.
- the number of coefficients used in the DNN section could be reduced as the number of sound sources increased. (In the case of two sound sources, the number of coefficients used in the DNN section is approximately 3,600,000 (about 76%). 16,200,000 (about 80%).
- the number of coefficients used in the DNN unit could be further reduced.
- the number of coefficients used in the DNN section is approximately 2,630,000 (about 65%). 11,300,000 (about 56%).
- the information processing apparatus 300 may be configured to include a filter section 9 (post-filter) after the multiplication section 4 and the multiplication section 7 .
- the filter unit 9 separates a desired sound source signal with higher accuracy using the separated multiple sound source signals (two sound source signals in the example shown in FIG. 11). For example, suppose that the multiplier 4 outputs a separated vocal signal and the multiplier 7 outputs a separated piano accompaniment sound signal.
- the filter unit 9 removes the residual component (noise component) of the piano accompaniment sound signal contained in the vocal signal while referring to the piano accompaniment sound signal, thereby making the vocal signal (an example of the separated sound source signal SA) more clear. Separate with high accuracy.
- a known filter such as a single channel Wiener filter can be used as the filter unit 9 .
- the present disclosure can also adopt a configuration of cloud computing in which a single function is shared and jointly processed by multiple devices via a network.
- the feature quantity extraction unit may be provided in the server device, and the feature quantity extraction processing may be performed in the server device.
- the present disclosure can also be realized in any form, such as a device, method, program, recording medium recording the program, system, and the like.
- a device such as a device, method, program, recording medium recording the program, system, and the like.
- the device can perform the control described in the embodiments. can be done.
- the present disclosure can also be implemented by a server that distributes such programs.
- the items described in each embodiment and modifications can be combined as appropriate.
- the contents of the present disclosure should not be construed as being limited by the effects exemplified herein.
- the present disclosure can also adopt the following configurations.
- the neural network unit generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals,
- An encoder included in the neural network unit converts the feature amount extracted from the mixed sound signal, a processing result of the encoder is input to each of a plurality of sub-neural network units of the neural network unit;
- the sub-neural network unit is a recursive neural network that uses processing results obtained in at least one of the past and the future for the current input.
- the recurrent neural network is a neural network that uses GRU (Gated Recurrent Unit) or LSTM (Long Short Term Memory) as an algorithm.
- the encoder performs the conversion by compressing the size of the feature amount.
- the feature amount and the size of the feature amount are defined by a multidimensional vector and the number of dimensions of the vector, The program according to (4), wherein the encoder compresses the number of dimensions of the vector.
- the neural network unit generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals, An encoder included in the neural network unit converts the feature amount extracted from the mixed sound signal, a processing result of the encoder is input to each of a plurality of sub-neural network units of the neural network unit; An information processing method, wherein a processing result of the encoder and each processing result of the plurality of sub-neural network units are input to a decoder included in the neural network unit.
- the neural network unit generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals,
- An encoder included in the neural network unit converts the feature amount extracted from the mixed sound signal, a processing result of the encoder is input to each of a plurality of sub-neural network units of the neural network unit;
- a recording medium recording a program for causing a computer to execute an information processing method, in which a processing result of the encoder and a processing result of each of the plurality of sub-neural network units are input to a decoder of the neural network unit.
- a neural network unit that generates sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals
- the neural network unit is an encoder that transforms the feature quantity extracted from the mixed sound signal; a plurality of sub-neural network units to which processing results of the encoder are input;
- An information processing apparatus comprising: a decoder to which a processing result of the encoder and a processing result of each of the plurality of sub-neural network units are input.
- Each of the plurality of neural network units generates sound source separation information for separating different sound source signals from a mixed sound signal containing a plurality of sound source signals
- An encoder included in one neural network unit among the plurality of neural network units converts the feature amount extracted from the mixed sound signal
- each said neural network unit comprises a plurality of said sub-neural network units, The program according to (17), wherein the processing result of the encoder is input to each of the plurality of sub-neural network units.
- a calculation unit provided in each of the neural network units multiplies the feature amount of the mixed sound signal by the sound source separation information output from the decoder, The program according to (17) or (18), wherein the filter section separates the predetermined sound source signal based on the processing results of the plurality of calculation sections.
- Each of the plurality of neural network units generates sound source separation information for separating different sound source signals from a mixed sound signal containing a plurality of sound source signals, An encoder included in one neural network unit among the plurality of neural network units converts the feature amount extracted from the mixed sound signal, The information processing method, wherein a processing result of the encoder is input to each of sub-neural network units included in the plurality of neural network units.
- Each of the plurality of neural network units generates sound source separation information for separating different sound source signals from a mixed sound signal containing a plurality of sound source signals
- An encoder included in one neural network unit among the plurality of neural network units converts the feature amount extracted from the mixed sound signal
- a recording medium recording a program for causing a computer to execute an information processing method in which a processing result of the encoder is input to each of sub-neural network units provided in the plurality of neural network units.
- a plurality of neural network units that generate sound source separation information for separating a predetermined sound source signal from a mixed sound signal containing a plurality of sound source signals,
- Each said neural network unit comprises: a sub-neural network unit; a decoder to which the processing result of the sub-neural network is input; One neural network unit among the plurality of neural network units, An encoder that converts the feature amount extracted from the mixed sound signal, An information processing apparatus in which a processing result of the encoder is input to each of the sub-neural network units included in the plurality of neural network units.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Stereophonic System (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
ニューラルネットワーク部が有するエンコーダが、混合音信号から抽出された特徴量を変換し、
ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対してエンコーダの処理結果が入力され、
ニューラルネットワーク部が有するデコーダに対して、エンコーダの処理結果、および、複数のサブニューラルネットワークのそれぞれの処理結果が入力される
情報処理方法をコンピュータに実行させるプログラムである。
ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
ニューラルネットワーク部が有するエンコーダが、混合音信号から抽出された特徴量を変換し、
ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対してエンコーダの処理結果が入力され、
ニューラルネットワーク部が有するデコーダに対して、エンコーダの処理結果、および、複数のサブニューラルネットワークのそれぞれの処理結果が入力される
情報処理方法である。
ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
ニューラルネットワーク部が有するエンコーダが、混合音信号から抽出された特徴量を変換し、
ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対してエンコーダの処理結果が入力され、
ニューラルネットワーク部が有するデコーダに対して、エンコーダの処理結果、および、複数のサブニューラルネットワークのそれぞれの処理結果が入力される
情報処理方法をコンピュータに実行させるプログラムを記録した記録媒体である。
複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成するニューラルネットワーク部を備え、
ニューラルネットワーク部は、
混合音信号から抽出された特徴量を変換するエンコーダと、
エンコーダの処理結果が入力される複数のサブニューラルネットワーク部と、
エンコーダの処理結果、および、複数のサブニューラルネットワークのそれぞれの処理結果が入力されるデコーダと
を備える
情報処理装置である。
複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、混合音信号から抽出された特徴量を変換し、
エンコーダの処理結果が、複数のニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法をコンピュータに実行させるプログラムである。
複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、混合音信号から抽出された特徴量を変換し、
エンコーダの処理結果が、複数のニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法である。
複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、混合音信号から抽出された特徴量を変換し、
エンコーダの処理結果が、複数のニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法をコンピュータに実行させるプログラムを記録した記録媒体である。
複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成するニューラルネットワーク部を複数備え、
それぞれのニューラルネットワーク部は、
サブニューラルネットワーク部と、
サブニューラルネットワークの処理結果が入力されるデコーダと
を備え、
複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部は、
混合音信号から抽出された特徴量を変換するエンコーダを備え、
エンコーダの処理結果が、複数のニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理装置である。
<本開示に関連する技術>
<第1の実施形態>
<第2の実施形態>
<第3の実施形態>
<変形例>
以下に説明する実施形態等は本開示の好適な具体例であり、本開示の内容がこれらの実施形態等に限定されるものではない。
始めに、本開示の理解を容易とするために、本開示に関連する技術について説明する。図1は、本開示に関連する技術に係る情報処理装置(情報処理装置1A)の構成例を示すブロック図である。情報処理装置1Aは、複数の音源信号(例えば、ボーカルや伴奏音を構成するそれぞれの楽器音)が含まれる混合音信号から所望の音源信号を分離する音源分離装置である。具体的には、情報処理装置1Aは、スマートホンやパーソナルコンピュータ、車載機器に組み込まれる。例えば、情報処理装置1Aを用いて、CD(Compact Disc)や半導体メモリ等のメディアに記憶された混合音信号やインターネット等のネットワークを介して配信される混合音信号から伴奏音信号が分離される。分離された伴奏音信号が再生される。伴奏音信号の再生にあわせてユーザーは歌唱を行う。これにより、ユーザーは、伴奏音信号そのものを用意しなくてもカラオケを気軽に行うことが可能となる。勿論、情報処理装置1Aの用途はカラオケに限定されることはない。情報処理装置1Aの音源分離結果を用いてテキストの書き起こし処理等が行われてもよい。なお、情報処理装置1Aにより行われる音源分離処理は、オンライン(リアルタイム)処理で行われてもよいし、オフライン(バッチ)処理で行われてもよい。
y=f(Wx+b) ・・・(1)
但し、式(1)におけるxは入力ベクトル、yは出力ベクトル、Wは得られる重み係数、bはバイアス係数、fは非線形関数である。
Wおよびbの値は、大量のデータセットを用いて事前に学習を行うことにより得られる数値である。
非線形関数fとしては、例えばReLU(Rectified Linear Unit)関数、Sigmoid関数等を適用することができる。
[情報処理装置の構成例]
図4は、第1の実施形態に係る情報処理装置(情報処理装置100)の構成例を示すブロック図である。なお、情報処理装置100が備える構成のうち、情報処理装置1Aまたは情報処理装置1Bと同様の構成については同一の参照符号を付して重複した説明を適宜、省略する。また、特に断らない限り、情報処理装置1A、1Bについて説明した事項は、各実施形態に適用可能である。
図5に示すフローチャートを参照しつつ、情報処理装置100で行われる処理の流れについて説明する。
以上、説明した本実施形態により得られる効果の一例について説明する
分割したベクトルのサイズの合計は、128+128=256次元であることから見かけ上は分割前と同じである。しかしながら、DNN11に記憶する係数量と積和演算回数は削減することができる。以下、具体例を挙げて説明する。
次に、第2の実施形態について説明する。なお、特に断らない限り、第1の実施形態等で説明した事項は第2の実施形態に対しても適用可能である。
図8に示すフローチャートを参照しつつ、情報処理装置200で行われる処理の流れについて説明する。
次に、第3の実施形態について説明する。なお、特に断らない限り、第1、第2の実施形態等で説明した事項は第3の実施形態に対しても適用可能である。第3の実施形態は、概略的には、第1、第2の実施形態を組み合わせた構成である。
以上、説明した第1から第3の実施形態について、DNN部で用いられる係数の数の具体的な数値例を図10に示す。基本的な構成としては、一般的な構成(図1参照)、複数のサブニューラルネットワーク部を備える構成(図4参照)、エンコーダを共通化した構成(図7参照)、複数のサブニューラルネットワーク部を備え、且つ、エンコーダを共通化した構成(図9参照)の4パターンとした。分離対象の音源数は2音源または10音源とし、分離対象の音源数に対応するようにサブニューラルネットワーク部を備える構成とした。
以上、本開示の複数の実施形態について説明したが、本開示は、上述した実施形態に限定されることはなく、本開示の趣旨を逸脱しない範囲で種々の変形が可能である。
(1)
ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
前記ニューラルネットワーク部が有するエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対して前記エンコーダの処理結果が入力され、
前記ニューラルネットワーク部が有するデコーダに対して、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力される
情報処理方法をコンピュータに実行させるプログラム。
(2)
前記サブニューラルネットワーク部は、現在の入力に対して時間的に過去および未来の少なくとも一方で得られた処理結果を利用する回帰型ニューラルネットワークである
(1)に記載のプログラム。
(3)
前記回帰型ニューラルネットワークは、GRU(Gated Recurrent Unit)またはLSTM(Long Short Term Memory)をアルゴリズムとして用いるニューラルネットワークである
(2)に記載のプログラム。
(4)
前記エンコーダは、前記特徴量のサイズを圧縮することで前記変換を行う
(1)から(3)までの何れかに記載のプログラム。
(5)
前記特徴量および当該特徴量のサイズは、多次元のベクトルおよび当該ベクトルの次元数により規定され、
前記エンコーダは、前記ベクトルの次元数を圧縮する
(4)に記載のプログラム。
(6)
前記特徴量のサイズが前記複数のサブニューラルネットワーク部の数に対応するように均等に分割され、
分割されたサイズの各特徴量が、対応するサブニューラルネットワーク部に入力される
(4)または(5)に記載のプログラム。
(7)
前記特徴量のサイズが不均等に分割され、
分割されたサイズの各特徴量が、対応するサブニューラルネットワーク部に入力される
(4)または(5)に記載のプログラム。
(8)
前記エンコーダは、1または複数のアフィン変換部により構成される
(1)から(7)までの何れかに記載のプログラム。
(9)
前記デコーダは、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワークのそれぞれの処理結果に基づいて、前記音源分離情報を生成する
(4)から(7)までの何れかに記載のプログラム。
(10)
前記デコーダは、1または複数のアフィン変換部により構成される
(1)から(9)までの何れかに記載のプログラム。
(11)
特徴量抽出部が、前記混合音信号から前記特徴量を抽出する
(1)から(10)までの何れかに記載のプログラム。
(12)
演算部が、前記混合音信号の特徴量に対して前記デコーダから出力される音源分離情報を乗算する
(1)から(11)までの何れかに記載のプログラム。
(13)
分離音源信号生成部が、前記演算部の演算結果に基づいて前記所定の音源信号を生成する
(12)に記載のプログラム。
(14)
ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
前記ニューラルネットワーク部が有するエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対して前記エンコーダの処理結果が入力され、
前記ニューラルネットワーク部が有するデコーダに対して、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力される
情報処理方法。
(15)
ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
前記ニューラルネットワーク部が有するエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対して前記エンコーダの処理結果が入力され、
前記ニューラルネットワーク部が有するデコーダに対して、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力される
情報処理方法をコンピュータに実行させるプログラムを記録した記録媒体。
(16)
複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成するニューラルネットワーク部を備え、
前記ニューラルネットワーク部は、
前記混合音信号から抽出された特徴量を変換するエンコーダと、
前記エンコーダの処理結果が入力される複数のサブニューラルネットワーク部と、
前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力されるデコーダと
を備える
情報処理装置。
(17)
複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法をコンピュータに実行させるプログラム
(18)
それぞれの前記ニューラルネットワーク部は、複数の前記サブニューラルネットワーク部を備え、
前記エンコーダの処理結果が、前記複数のサブニューラルネットワーク部のそれぞれに対して入力される
(17)に記載のプログラム。
(19)
それぞれの前記ニューラルネットワーク部が備える演算部が、前記混合音信号の特徴量に対して前記デコーダから出力される前記音源分離情報を乗算し、
フィルタ部が、複数の前記演算部の処理結果に基づいて、前記所定の音源信号を分離する
(17)または(18)に記載のプログラム。
(20)
複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法。
(21)
複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法をコンピュータに実行させるプログラムを記録した記録媒体。
(22)
複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成するニューラルネットワーク部を複数備え、
それぞれの前記ニューラルネットワーク部は、
サブニューラルネットワーク部と、
前記サブニューラルネットワークの処理結果が入力されるデコーダと
を備え、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部は、
前記混合音信号から抽出された特徴量を変換するエンコーダを備え、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備える前記サブニューラルネットワーク部のそれぞれに入力される
情報処理装置。
4、7・・・乗算部
5、8・・・分離信号生成部
6、11・・・DNN部
9・・・フィルタ部
12、13・・・サブニューラルネットワーク部
31・・・エンコーダ
32・・・デコーダ
100、200、300・・・情報処理装置
Claims (22)
- ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
前記ニューラルネットワーク部が有するエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対して前記エンコーダの処理結果が入力され、
前記ニューラルネットワーク部が有するデコーダに対して、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力される
情報処理方法をコンピュータに実行させるプログラム。 - 前記サブニューラルネットワーク部は、現在の入力に対して時間的に過去および未来の少なくとも一方で得られた処理結果を利用する回帰型ニューラルネットワークである
請求項1に記載のプログラム。 - 前記回帰型ニューラルネットワークは、GRU(Gated Recurrent Unit)またはLSTM(Long Short Term Memory)をアルゴリズムとして用いるニューラルネットワークである
請求項2に記載のプログラム。 - 前記エンコーダは、前記特徴量のサイズを圧縮することで前記変換を行う
請求項1に記載のプログラム。 - 前記特徴量および当該特徴量のサイズは、多次元のベクトルおよび当該ベクトルの次元数により規定され、
前記エンコーダは、前記ベクトルの次元数を圧縮する
請求項4に記載のプログラム。 - 前記特徴量のサイズが前記複数のサブニューラルネットワーク部の数に対応するように均等に分割され、
分割されたサイズの各特徴量が、対応するサブニューラルネットワーク部に入力される
請求項4に記載のプログラム。 - 前記特徴量のサイズが不均等に分割され、
分割されたサイズの各特徴量が、対応するサブニューラルネットワーク部に入力される
請求項4に記載のプログラム。 - 前記エンコーダは、1または複数のアフィン変換部により構成される
請求項1に記載のプログラム。 - 前記デコーダは、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワークのそれぞれの処理結果に基づいて、前記音源分離情報を生成する
請求項4に記載のプログラム。 - 前記デコーダは、1または複数のアフィン変換部により構成される
請求項1に記載のプログラム。 - 特徴量抽出部が、前記混合音信号から前記特徴量を抽出する
請求項1に記載のプログラム。 - 演算部が、前記混合音信号の特徴量に対して前記デコーダから出力される音源分離情報を乗算する
請求項1に記載のプログラム。 - 分離音源信号生成部が、前記演算部の演算結果に基づいて前記所定の音源信号を生成する
請求項12に記載のプログラム。 - ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
前記ニューラルネットワーク部が有するエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対して前記エンコーダの処理結果が入力され、
前記ニューラルネットワーク部が有するデコーダに対して、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力される
情報処理方法。 - ニューラルネットワーク部が、複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成し、
前記ニューラルネットワーク部が有するエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記ニューラルネットワーク部が有する複数のサブニューラルネットワーク部のそれぞれに対して前記エンコーダの処理結果が入力され、
前記ニューラルネットワーク部が有するデコーダに対して、前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力される
情報処理方法をコンピュータに実行させるプログラムを記録した記録媒体。 - 複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成するニューラルネットワーク部を備え、
前記ニューラルネットワーク部は、
前記混合音信号から抽出された特徴量を変換するエンコーダと、
前記エンコーダの処理結果が入力される複数のサブニューラルネットワーク部と、
前記エンコーダの処理結果、および、前記複数のサブニューラルネットワーク部のそれぞれの処理結果が入力されるデコーダと
を備える
情報処理装置。 - 複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法をコンピュータに実行させるプログラム。 - それぞれの前記ニューラルネットワーク部は、複数の前記サブニューラルネットワーク部を備え、
前記エンコーダの処理結果が、前記複数のサブニューラルネットワーク部のそれぞれに対して入力される
請求項17に記載のプログラム。 - それぞれの前記ニューラルネットワーク部が備える演算部が、前記混合音信号の特徴量に対して前記デコーダから出力される前記音源分離情報を乗算し、
フィルタ部が、複数の前記演算部の処理結果に基づいて、前記所定の音源信号を分離する
請求項18に記載のプログラム。 - 複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法。 - 複数のニューラルネットワーク部のそれぞれが、複数の音源信号が含まれる混合音信号から異なる音源信号を分離するための音源分離情報を生成し、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部が備えるエンコーダが、前記混合音信号から抽出された特徴量を変換し、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備えるサブニューラルネットワーク部のそれぞれに入力される
情報処理方法をコンピュータに実行させるプログラムを記録した記録媒体。 - 複数の音源信号が含まれる混合音信号から所定の音源信号を分離するための音源分離情報を生成するニューラルネットワーク部を複数備え、
それぞれの前記ニューラルネットワーク部は、
サブニューラルネットワーク部と、
前記サブニューラルネットワークの処理結果が入力されるデコーダと
を備え、
前記複数のニューラルネットワーク部のうちの一つのニューラルネットワーク部は、
前記混合音信号から抽出された特徴量を変換するエンコーダを備え、
前記エンコーダの処理結果が、複数の前記ニューラルネットワーク部が備える前記サブニューラルネットワーク部のそれぞれに入力される
情報処理装置。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280044986.5A CN117616500A (zh) | 2021-06-29 | 2022-02-09 | 程序、信息处理方法、记录介质和信息处理装置 |
JP2023531372A JPWO2023276235A1 (ja) | 2021-06-29 | 2022-02-09 | |
EP22832403.4A EP4365897A1 (en) | 2021-06-29 | 2022-02-09 | Program, information processing method, recording medium, and information processing device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021-108134 | 2021-06-29 | ||
JP2021108134 | 2021-06-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023276235A1 true WO2023276235A1 (ja) | 2023-01-05 |
Family
ID=84691122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/005007 WO2023276235A1 (ja) | 2021-06-29 | 2022-02-09 | プログラム、情報処理方法、記録媒体および情報処理装置 |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP4365897A1 (ja) |
JP (1) | JPWO2023276235A1 (ja) |
CN (1) | CN117616500A (ja) |
WO (1) | WO2023276235A1 (ja) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02287859A (ja) * | 1989-04-28 | 1990-11-27 | Hitachi Ltd | データ処理装置 |
WO2018047643A1 (ja) | 2016-09-09 | 2018-03-15 | ソニー株式会社 | 音源分離装置および方法、並びにプログラム |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
WO2020039571A1 (ja) * | 2018-08-24 | 2020-02-27 | 三菱電機株式会社 | 音声分離装置、音声分離方法、音声分離プログラム、及び音声分離システム |
JP2020071482A (ja) * | 2018-10-30 | 2020-05-07 | 富士通株式会社 | 語音分離方法、語音分離モデル訓練方法及びコンピュータ可読媒体 |
JP2020134657A (ja) * | 2019-02-18 | 2020-08-31 | 日本電信電話株式会社 | 信号処理装置、学習装置、信号処理方法、学習方法及びプログラム |
JP2021076831A (ja) * | 2019-10-21 | 2021-05-20 | ソニーグループ株式会社 | 電子機器、方法およびコンピュータプログラム |
-
2022
- 2022-02-09 JP JP2023531372A patent/JPWO2023276235A1/ja active Pending
- 2022-02-09 WO PCT/JP2022/005007 patent/WO2023276235A1/ja active Application Filing
- 2022-02-09 EP EP22832403.4A patent/EP4365897A1/en active Pending
- 2022-02-09 CN CN202280044986.5A patent/CN117616500A/zh active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02287859A (ja) * | 1989-04-28 | 1990-11-27 | Hitachi Ltd | データ処理装置 |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
WO2018047643A1 (ja) | 2016-09-09 | 2018-03-15 | ソニー株式会社 | 音源分離装置および方法、並びにプログラム |
WO2020039571A1 (ja) * | 2018-08-24 | 2020-02-27 | 三菱電機株式会社 | 音声分離装置、音声分離方法、音声分離プログラム、及び音声分離システム |
JP2020071482A (ja) * | 2018-10-30 | 2020-05-07 | 富士通株式会社 | 語音分離方法、語音分離モデル訓練方法及びコンピュータ可読媒体 |
JP2020134657A (ja) * | 2019-02-18 | 2020-08-31 | 日本電信電話株式会社 | 信号処理装置、学習装置、信号処理方法、学習方法及びプログラム |
JP2021076831A (ja) * | 2019-10-21 | 2021-05-20 | ソニーグループ株式会社 | 電子機器、方法およびコンピュータプログラム |
Non-Patent Citations (1)
Title |
---|
WANG ZHONG-QIU; ROUX JONATHAN LE; HERSHEY JOHN R.: "Alternative Objective Functions for Deep Clustering", 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 15 April 2018 (2018-04-15), pages 686 - 690, XP033401818, DOI: 10.1109/ICASSP.2018.8462507 * |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023276235A1 (ja) | 2023-01-05 |
CN117616500A (zh) | 2024-02-27 |
EP4365897A1 (en) | 2024-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lee et al. | Blind source separation of real world signals | |
US4624010A (en) | Speech recognition apparatus | |
US7676374B2 (en) | Low complexity subband-domain filtering in the case of cascaded filter banks | |
AU703046B2 (en) | Speech encoding method | |
CN101057275B (zh) | 矢量变换装置以及矢量变换方法 | |
WO2003069499A1 (en) | Filter set for frequency analysis | |
US7912711B2 (en) | Method and apparatus for speech data | |
US5245662A (en) | Speech coding system | |
Luo et al. | Group communication with context codec for lightweight source separation | |
WO2023276235A1 (ja) | プログラム、情報処理方法、記録媒体および情報処理装置 | |
WO2022079263A1 (en) | A generative neural network model for processing audio samples in a filter-bank domain | |
CN115485769A (zh) | 动态范围减小的域中增强多声道音频的方法、装置和系统 | |
JPH02287399A (ja) | ベクトル量子化制御方式 | |
Thakare et al. | Comparative analysis of emotion recognition system | |
US6078881A (en) | Speech encoding and decoding method and speech encoding and decoding apparatus | |
US5777249A (en) | Electronic musical instrument with reduced storage of waveform information | |
JPH09127985A (ja) | 信号符号化方法及び装置 | |
CN117546237A (zh) | 解码器 | |
US6882976B1 (en) | Efficient finite length POW10 calculation for MPEG audio encoding | |
GB2059726A (en) | Sound synthesizer | |
Raj et al. | Audio signal quality enhancement using multi-layered convolutional neural network based auto encoder–decoder | |
US7283961B2 (en) | High-quality speech synthesis device and method by classification and prediction processing of synthesized sound | |
Lee et al. | Stacked U-Net with high-level feature transfer for parameter efficient speech enhancement | |
JPH09127994A (ja) | 信号符号化方法及び装置 | |
Yecchuri et al. | Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22832403 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023531372 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18572196 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280044986.5 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022832403 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022832403 Country of ref document: EP Effective date: 20240129 |