WO2022215199A1 - 情報処理装置、出力方法、及び出力プログラム - Google Patents
情報処理装置、出力方法、及び出力プログラム Download PDFInfo
- Publication number
- WO2022215199A1 WO2022215199A1 PCT/JP2021/014790 JP2021014790W WO2022215199A1 WO 2022215199 A1 WO2022215199 A1 WO 2022215199A1 JP 2021014790 W JP2021014790 W JP 2021014790W WO 2022215199 A1 WO2022215199 A1 WO 2022215199A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound
- target sound
- target
- sound signal
- signal
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims description 31
- 230000005236 sound signal Effects 0.000 claims abstract description 212
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 230000000873 masking effect Effects 0.000 claims description 62
- 230000002452 interceptive effect Effects 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 7
- 230000004044 response Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 9
- 238000013500 data storage Methods 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Definitions
- the present disclosure relates to an information processing device, an output method, and an output program.
- the device In some cases, even with the above technique, it is difficult to output the target sound signal, which is a signal representing the target sound.
- the purpose of the present disclosure is to output a target sound signal.
- the information processing device includes an acquisition unit that acquires sound source position information that is position information of a sound source of a target sound, a mixed sound signal that is a signal representing a mixed sound including the target sound and an interfering sound, and a trained model; a sound feature extraction unit for extracting a plurality of sound features based on a mixed sound signal; an emphasizing unit for emphasizing a sound feature amount; an estimating unit for estimating the target sound direction based on the plurality of sound feature amounts and the sound source position information; and the estimated target sound direction and the plurality of sound features.
- a masked feature amount extracting unit for extracting a masked feature amount, which is a feature amount in which the feature amount of the target sound direction is masked, based on the amount; and based on the emphasized sound feature amount, the target sound direction a generation unit that generates a target sound direction-enhanced sound signal that is a sound signal in which the is emphasized, and generates a target sound direction masking sound signal that is a sound signal in which the target sound direction is masked, based on the mask feature amount; and a target sound signal output unit that outputs a target sound signal representing the target sound using the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the trained model.
- a target sound signal can be output.
- FIG. 1 is a diagram showing an example of a target sound signal output system according to Embodiment 1;
- FIG. 2 illustrates hardware included in the information processing apparatus according to the first embodiment;
- FIG. 2 is a block diagram showing functions of the information processing apparatus according to Embodiment 1;
- FIG. 4 is a diagram showing a configuration example of a trained model according to Embodiment 1;
- FIG. 4 is a flow chart showing an example of processing executed by the information processing apparatus according to the first embodiment;
- 2 is a block diagram showing functions of the learning device of Embodiment 1;
- FIG. 4 is a flow chart showing an example of processing executed by the learning device according to Embodiment 1;
- 3 is a block diagram showing functions of an information processing apparatus according to a second embodiment;
- FIG. 10 is a flow chart showing an example of processing executed by the information processing apparatus according to the second embodiment
- FIG. 11 is a block diagram showing functions of an information processing apparatus according to a third embodiment
- 10 is a flow chart showing an example of processing executed by the information processing apparatus according to the third embodiment
- FIG. 11 is a block diagram showing functions of an information processing apparatus according to a fourth embodiment
- FIG. 13 is a flow chart showing an example of processing executed by an information processing apparatus according to a fourth embodiment
- FIG. 1 is a diagram showing an example of a target sound signal output system according to Embodiment 1.
- the target sound signal output system includes an information processing device 100 and a learning device 200 .
- the information processing device 100 is a device that executes an output method.
- the information processing apparatus 100 outputs the target sound signal using the learned model.
- a trained model is generated by the learning device 200 .
- the information processing apparatus 100 will be described in the utilization phase.
- the learning device 200 will be described in the learning phase.
- the utilization phase will be explained.
- FIG. 2 is a diagram showing hardware included in the information processing apparatus according to the first embodiment.
- the information processing device 100 has a processor 101 , a volatile memory device 102 and a nonvolatile memory device 103 .
- the processor 101 controls the information processing apparatus 100 as a whole.
- the processor 101 is a CPU (Central Processing Unit), FPGA (Field Programmable Gate Array), or the like.
- Processor 101 may be a multiprocessor.
- the information processing device 100 may have a processing circuit.
- the processing circuit may be a single circuit or multiple circuits.
- the volatile memory device 102 is the main memory device of the information processing device 100 .
- the volatile memory device 102 is RAM (Random Access Memory).
- the nonvolatile storage device 103 is an auxiliary storage device of the information processing device 100 .
- the nonvolatile memory device 103 is a HDD (Hard Disk Drive) or an SSD (Solid State Drive).
- a storage area secured by the volatile storage device 102 or the nonvolatile storage device 103 is called a storage unit.
- FIG. 3 is a block diagram showing functions of the information processing apparatus according to the first embodiment.
- the information processing apparatus 100 includes an acquisition unit 120 , a sound feature extraction unit 130 , an enhancement unit 140 , an estimation unit 150 , a mask feature extraction unit 160 , a generation unit 170 and a target sound signal output unit 180 .
- Part or all of the acquisition unit 120, the sound feature amount extraction unit 130, the enhancement unit 140, the estimation unit 150, the mask feature amount extraction unit 160, the generation unit 170, and the target sound signal output unit 180 are implemented by processing circuits. good too. Some or all of the acquisition unit 120, the sound feature extraction unit 130, the enhancement unit 140, the estimation unit 150, the mask feature extraction unit 160, the generation unit 170, and the target sound signal output unit 180 are executed by the processor 101. It may be implemented as a module of a program that For example, a program executed by the processor 101 is also called an output program. For example, the output program is recorded on a recording medium.
- the storage unit may store the sound source position information 111 and the learned model 112.
- the sound source position information 111 is position information of the sound source of the target sound. For example, when the target sound is a voice uttered by the target sound speaker, the sound source position information 111 is position information of the target sound speaker.
- the acquisition unit 120 acquires the sound source position information 111.
- the acquisition unit 120 acquires the sound source position information 111 from the storage unit.
- the sound source location information 111 may be stored in an external device (for example, a cloud server).
- the obtaining unit 120 obtains the sound source position information 111 from the external device.
- the acquisition unit 120 acquires the learned model 112. For example, the acquisition unit 120 acquires the trained model 112 from the storage unit. Also, for example, the acquisition unit 120 acquires the trained model 112 from the learning device 200 .
- the acquisition unit 120 acquires the mixed sound signal.
- the acquiring unit 120 acquires a mixed sound signal from a microphone array including N (N is an integer equal to or greater than 2) microphones.
- a mixed sound signal is a signal indicating a mixed sound including a target sound and an interfering sound.
- a mixed sound signal may be expressed as N sound signals.
- the target sound is a voice uttered by the target sound speaker, a sound uttered by an animal, or the like.
- An interfering sound is a sound that interferes with a target sound.
- the mixed sound may include noise.
- mixed sound includes target sound, interfering sound, and noise.
- the sound feature amount extraction unit 130 extracts a plurality of sound feature amounts based on the mixed sound signal. For example, the sound feature amount extraction unit 130 uses the power spectrum time series obtained by performing a short-time Fourier transform (STFT) on the mixed sound signal as a plurality of sound feature amounts, Extract. Note that the plurality of extracted sound feature quantities may be expressed as N sound feature quantities.
- STFT short-time Fourier transform
- the emphasizing unit 140 Based on the sound source position information 111, the emphasizing unit 140 emphasizes the sound feature quantity in the direction of the target sound among the plurality of sound feature quantities. For example, the emphasizing unit 140 uses a plurality of sound features, the sound source position information 111, and an MVDR (Minimum Variance Distortionless Response) beamformer to emphasize sound features in the target sound direction.
- MVDR Minimum Variance Distortionless Response
- the estimation unit 150 estimates the target sound direction based on the plurality of sound feature quantities and the sound source position information 111 . Specifically, the estimation unit 150 estimates the direction of the target sound using Equation (1). l indicates the hour. k indicates the frequency. xlk indicates a sound feature quantity corresponding to a sound signal obtained from a microphone closest to the sound source position of the target sound specified based on the sound source position information 111 . xlk may be thought of as the STFT spectrum. a ⁇ ,k indicates a steering vector in a certain angular direction ⁇ . H is the conjugate transpose.
- the mask feature amount extraction unit 160 extracts mask feature amounts based on the estimated target sound direction and the plurality of sound feature amounts.
- the masked feature amount is a feature amount in which the feature amount in the direction of the target sound is masked. In detail, extraction processing of the mask feature amount will be described.
- the mask feature quantity extraction unit 160 creates a direction mask based on the direction of the target sound.
- a direction mask is a mask for extracting a sound in which the target sound direction is emphasized.
- the mask is a matrix of the same size as the sound features.
- the mask feature amount extraction unit 160 extracts mask feature amounts by multiplying a plurality of sound feature amounts by the element product of the mask matrix.
- the generation unit 170 generates a sound signal in which the target sound direction is emphasized (hereinafter referred to as a target sound direction emphasized sound signal) based on the sound feature quantity emphasized by the emphasis unit 140 .
- the generation unit 170 generates a target sound direction emphasized sound signal using the sound feature quantity emphasized by the emphasis unit 140 and an inverse short-time Fourier transform (ISTFT).
- ISTFT inverse short-time Fourier transform
- the generation unit 170 generates a sound signal in which the target sound direction is masked (hereinafter referred to as a target sound direction masking sound signal) based on the mask feature amount. For example, the generation unit 170 generates the target sound direction masking sound signal using the mask feature amount and the inverse short-time Fourier transform.
- the target sound direction emphasized sound signal and the target sound direction masking sound signal may be input to the learning device 200 as learning signals.
- the target sound signal output unit 180 uses the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the learned model 112 to output the target sound signal.
- the learned model 112 a configuration example of the learned model 112 will be described.
- FIG. 4 is a diagram showing a configuration example of a trained model according to Embodiment 1.
- the trained model 112 includes an Encoder 112a, a Separator 112b, and a Decoder 112c.
- the encoder 112a estimates a target sound direction-enhanced time-frequency representation of "M dimensions x time” based on the target sound direction-enhanced sound signal. Also, the encoder 112a estimates a target sound direction masking time-frequency representation of “M dimensions ⁇ time” based on the target sound direction masking sound signal. For example, the Encoder 112a may estimate the power spectrum estimated by the STFT as a target sound direction enhancement time-frequency representation and a target sound direction masking time-frequency representation. Also, for example, the Encoder 112a may estimate the target sound direction enhancement time-frequency representation and the target sound direction masking time-frequency representation using a one-dimensional convolution operation.
- the target sound direction-enhanced time-frequency representation and the target sound direction masking time-frequency representation may be projected onto the same time-frequency representation space or onto different time-frequency representation spaces. Note that, for example, the estimation is described in Non-Patent Document 1.
- the Separator 112b estimates an "M dimension x time" mask matrix based on the target sound direction enhancement time-frequency representation and the target sound direction masking time-frequency representation. Also, when the target sound direction-emphasizing time-frequency representation and the target sound direction masking time-frequency representation are input to the Separator 112b, even if the target sound direction-emphasizing time-frequency representation and the target sound direction masking time-frequency representation are connected in the frequency axis direction, good. As a result, it is converted into a representation of "2M dimensions x time".
- the target sound direction enhancement time-frequency representation and the target sound direction masking time-frequency representation may be connected to axes different from the time axis and the frequency axis.
- the expression is converted to "M dimensions x time x 2".
- the target sound direction enhancement time-frequency representation and the target sound direction masking time-frequency representation may be weighted.
- the weighted target sound direction enhancement time-frequency representation and the weighted target sound direction masking time-frequency representation may be added together. Weights may be estimated in the trained model 112 .
- the Separator 112b is a neural network composed of an input layer, an intermediate layer, and an output layer. For example, for propagation between layers, a method combining a method similar to LSTM (Long Short Term Memory) and a one-dimensional convolution operation may be used.
- LSTM Long Short Term Memory
- the decoder 112c multiplies the "M dimensions x time” target sound direction emphasized time-frequency expression by the "M dimensions x time” mask matrix.
- the Decoder 112c uses the information obtained by the multiplication and a method corresponding to the method used in the Encoder 112a to output the target sound signal. For example, if the method used by the encoder 112a is STFT, the decoder 112c uses the information obtained by the multiplication and the ISTFT to output the target sound signal. Also, for example, if the method used by the Encoder 112a is one-dimensional convolution, the Decoder 112c uses information obtained by multiplication and inverse one-dimensional convolution to output the target sound signal.
- the target sound signal output unit 180 may output the target sound signal to a speaker. As a result, the target sound is output from the speaker. Note that the illustration of the speaker is omitted.
- FIG. 5 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the first embodiment;
- the acquisition unit 120 acquires a mixed sound signal.
- the sound feature amount extraction unit 130 extracts a plurality of sound feature amounts based on the mixed sound signal.
- Step S ⁇ b>13 Based on the sound source position information 111 , the emphasizing section 140 emphasizes the sound feature quantity in the direction of the target sound.
- Step S ⁇ b>14 The estimation unit 150 estimates the target sound direction based on the plurality of sound feature quantities and the sound source position information 111 .
- the mask feature amount extraction unit 160 extracts mask feature amounts based on the estimated target sound direction and a plurality of sound feature amounts.
- Step S ⁇ b>16 The generator 170 generates a target sound direction emphasized sound signal based on the sound feature quantity emphasized by the emphasizer 140 . Further, the generation unit 170 generates a target sound direction masking sound signal based on the mask feature amount.
- the target sound signal output unit 180 uses the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the learned model 112 to output the target sound signal.
- steps S14 and S15 may be executed in parallel with step S13. Moreover, steps S14 and S15 may be performed before step S13.
- FIG. 6 is a block diagram showing functions of the learning device according to the first embodiment.
- the learning device 200 has a sound data storage unit 211 , an impulse response storage unit 212 , a noise storage unit 213 , an impulse response application unit 220 , a mixing unit 230 , a processing execution unit 240 and a learning unit 250 .
- the sound data storage unit 211, the impulse response storage unit 212, and the noise storage unit 213 may be realized as storage areas secured by a volatile storage device or a non-volatile storage device of the learning device 200.
- a part or all of the impulse response application unit 220, the mixing unit 230, the processing execution unit 240, and the learning unit 250 may be realized by a processing circuit of the learning device 200. Also, part or all of the impulse response application unit 220, the mixing unit 230, the processing execution unit 240, and the learning unit 250 may be realized as modules of a program executed by the processor of the learning device 200.
- the sound data storage unit 211 stores the target sound signal and the interfering sound signal.
- the interfering sound signal is a signal indicating interfering sound.
- the impulse response storage unit 212 stores impulse response data.
- the noise storage unit 213 stores noise signals. Note that the noise signal is a signal indicating noise.
- Impulse response application section 220 applies the position of the target sound and the interference to one target sound signal stored in sound data storage section 211 and an arbitrary number of interfering sound signals stored in sound data storage section 211 . Convolve the impulse response data corresponding to the position of the sound.
- the mixing section 230 generates a mixed sound signal based on the sound signal output by the impulse response applying section 220 and the noise signal stored in the noise storage section 213 . Also, the sound signal output by the impulse response applying section 220 may be treated as a mixed sound signal.
- the learning device 200 may transmit the mixed sound signal to the information processing device 100 .
- the processing execution unit 240 generates a target sound direction emphasized sound signal and a target sound direction masking sound signal by executing steps S11 to S16. That is, the processing execution unit 240 generates a learning signal.
- the learning unit 250 learns using the learning signal. That is, the learning unit 250 performs learning for outputting the target sound signal using the target sound direction emphasized sound signal and the target sound direction masking sound signal.
- input weighting factors which are parameters of the neural network, are determined.
- the loss function shown in Non-Patent Document 1 may be used.
- the error may be calculated using the sound signal output by the impulse response application unit 220 and the loss function. Then, for example, in learning, an optimization method such as Adam is used, and an input weighting factor for each layer of the neural network is determined based on the backpropagation method.
- the learning signal may be a learning signal generated by the processing execution unit 240 or a learning signal generated by the information processing apparatus 100 .
- FIG. 7 is a flowchart illustrating an example of processing executed by the learning device according to Embodiment 1.
- FIG. (Step S21)
- the impulse response application unit 220 convolves impulse response data with the target sound signal and the interfering sound signal.
- the mixing section 230 generates a mixed sound signal based on the sound signal output by the impulse response applying section 220 and the noise signal.
- Step S23 The process executing section 240 generates a learning signal by executing steps S11 to S16.
- Step S24 The learning unit 250 learns using the learning signal. Then, the learned model 112 is generated by repeating the learning by the learning device 200 .
- the information processing apparatus 100 uses the trained model 112 to output the target sound signal.
- the trained model 112 is a trained model generated by learning to output the target sound signal based on the target sound direction emphasized sound signal and the target sound direction masking sound signal. Specifically, the trained model 112 discriminates the target sound component that is enhanced or masked from the target sound component that is not enhanced or masked, so that the angle between the target sound direction and the interfering sound direction is small. Outputs the target sound signal even when Therefore, even when the angle between the direction of the target sound and the direction of the interfering sound is small, the information processing device 100 can output the target sound signal by using the trained model 112 .
- Embodiment 2 Next, Embodiment 2 will be described. In Embodiment 2, mainly matters different from Embodiment 1 will be described. In the second embodiment, descriptions of items common to the first embodiment are omitted.
- FIG. 8 is a block diagram showing functions of the information processing apparatus according to the second embodiment.
- the information processing device 100 further has a selection unit 190 .
- a part or all of the selection unit 190 may be implemented by a processing circuit. Also, part or all of the selection unit 190 may be implemented as a program module executed by the processor 101 .
- the selection unit 190 selects the sound signal of the channel in the direction of the target sound. In other words, the selection unit 190 selects the sound signal of the channel in the direction of the target sound from among the N sound signals based on the sound source position information 111 .
- the selected sound signal, the target sound direction emphasized sound signal, and the target sound direction masking sound signal may be input to the learning device 200 as learning signals.
- the target sound signal output unit 180 uses the selected sound signal, the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the learned model 112 to output the target sound signal.
- the encoder 112a estimates a target sound direction-enhanced time-frequency representation of "M dimensions x time” based on the target sound direction-enhanced sound signal. Also, the encoder 112a estimates a target sound direction masking time-frequency representation of “M dimensions ⁇ time” based on the target sound direction masking sound signal. Further, the encoder 112a estimates a mixed sound time-frequency representation of "M dimensions x time” based on the selected sound signal. For example, the Encoder 112a may estimate the power spectrum estimated by the STFT as a target sound direction enhancement time-frequency representation, a target sound direction masking time-frequency representation, and a mixed sound time-frequency representation.
- the Encoder 112a may estimate the target sound direction enhancement time-frequency representation, the target sound direction masking time-frequency representation, and the mixed sound time-frequency representation using a one-dimensional convolution operation.
- the target sound direction-enhanced time-frequency representation, the target sound direction masking time-frequency representation, and the mixed sound time-frequency representation may be projected onto the same time-frequency representation space or onto different time-frequency representation spaces. may be projected. Note that, for example, the estimation is described in Non-Patent Document 1.
- the Separator 112b estimates an "M dimension x time" mask matrix based on the target sound direction enhancement time-frequency representation, the target sound direction masking time-frequency representation, and the mixed sound time-frequency representation. Also, when the target sound direction-emphasizing time-frequency representation, the target sound direction masking time-frequency representation, and the mixed sound time-frequency representation are input to the Separator 112b, the target sound direction-emphasizing time-frequency representation, the target sound direction masking time-frequency representation, and the Mixed sound time-frequency representations may be concatenated along the frequency axis. As a result, it is converted into a representation of "3M dimensions x time".
- the target sound direction enhancement time-frequency representation, the target sound direction masking time-frequency representation, and the mixed sound time-frequency representation may be connected to axes different from the time axis and the frequency axis. As a result, the expression is converted to "M dimensions x time x 3".
- the target sound direction enhancement time-frequency representation, the target sound direction masking time-frequency representation, and the mixed sound time-frequency representation may be weighted.
- the weighted target sound direction enhancement time-frequency representation, the weighted target sound direction masking time-frequency representation, and the weighted mixed sound time-frequency representation may be added together. Weights may be estimated in the trained model 112 .
- the processing of the decoder 112c is the same as in the first embodiment.
- the target sound signal output unit 180 outputs the target sound signal using the selected sound signal, the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the trained model 112 .
- FIG. 9 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the second embodiment; FIG.
- the process of FIG. 9 differs from the process of FIG. 5 in that steps S11a and S17a are executed. Therefore, in FIG. 9, steps S11a and 17a will be explained. The description of the processes other than steps S11a and S17a is omitted.
- Step S11a Using the mixed sound signal and the sound source position information 111, the selection unit 190 selects the sound signal of the channel in the direction of the target sound.
- Step S17a The target sound signal output unit 180 uses the selected sound signal, the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the learned model 112 to output the target sound signal. Note that step S11a may be executed at any timing as long as it is executed before step S17a is executed.
- the learning device 200 learns using a learning signal including the sound signal of the channel in the direction of the target sound (that is, the mixed sound signal in the direction of the target sound).
- the learning signal may be generated by the processing execution unit 240 .
- the learning device 200 learns the difference between the target sound direction emphasized sound signal and the mixed sound signal in the target sound direction. Also, the learning device 200 learns the difference between the target sound direction masking sound signal and the target sound direction mixed sound signal. The learning device 200 learns that a signal having a large difference is the target sound signal. Thus, the learned model 112 is generated by the learning device 200 learning.
- the information processing apparatus 100 can output the target sound signal by using the learned model 112 obtained by learning.
- FIG. 10 is a block diagram showing functions of the information processing apparatus according to the third embodiment.
- the information processing device 100 further has a reliability calculation unit 191 .
- a part or all of the reliability calculation unit 191 may be implemented by a processing circuit. Also, part or all of the reliability calculation unit 191 may be realized as a program module executed by the processor 101 .
- the reliability calculation unit 191 calculates the reliability F i of the mask feature amount by a preset method.
- the confidence F i of the mask features may be called the confidence F i of the directional mask.
- the preset method is represented by the following formula (3). ⁇ indicates the angular range of the direction of the target sound. ⁇ indicates the angular range of directions in which sound is generated.
- the confidence F i is a matrix of the same size as the orientation mask. Note that the reliability F i may be input to the learning device 200 .
- the target sound signal output unit 180 outputs the target sound signal using the reliability F i , the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the trained model 112 .
- Encoder 112a performs the following processing in addition to the processing of the first embodiment.
- the encoder 112a calculates the time-frequency representation FT by multiplying the number of frequency bins F of the reliability F i by the number of frames T.
- FIG. Note that the number of frequency bins F is the number of elements in the frequency axis direction of the time-frequency expression.
- the number of frames T is a number obtained by dividing the mixed sound signal by a preset time.
- the time-frequency representation FT is treated as the mixed sound time-frequency representation of the second embodiment in the subsequent processing. If the target sound direction emphasis time-frequency representation and the time-frequency representation FT do not match, the encoder 112a performs transformation matrix/transformation processing. Specifically, the encoder 112a converts the number of elements in the frequency axis direction of the reliability F i into the number of elements in the frequency axis direction of the target sound direction emphasized time-frequency representation.
- the Separator 112b performs the same processing as the Separator 112b of the second embodiment when the target sound direction-emphasized time-frequency representation and the time-frequency representation FT match. If the target sound direction-emphasized time-frequency representation and the time-frequency representation FT do not match, the Separator 112b integrates the reliability F i obtained by converting the number of elements in the frequency axis direction and the target sound direction-emphasized time-frequency representation. For example, the Separator 112b integrates using the Attention method described in Non-Patent Document 3. The Separator 112b estimates an “M dimension ⁇ time” mask matrix based on the target sound direction-enhanced time-frequency representation and the target sound direction masking time-frequency representation obtained by integration.
- the processing of the decoder 112c is the same as in the first embodiment.
- the target sound signal output unit 180 uses the reliability F i , the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the trained model 112 to output the target sound signal.
- FIG. 11 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the third embodiment; FIG.
- the process of FIG. 11 differs from the process of FIG. 5 in that steps S15b and S17b are executed. Therefore, steps S15b and S17b will be explained in FIG. The description of the processes other than steps S15b and S17b is omitted.
- the reliability calculation unit 191 calculates the reliability F i of the mask feature quantity.
- the target sound signal output unit 180 uses the reliability F i , the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the trained model 112 to output the target sound signal.
- the learning device 200 learns using the reliability F i when learning.
- the learning device 200 may learn using the reliability F i acquired from the information processing device 100 .
- the learning device 200 may learn using the reliability F i stored in the volatile memory device or non-volatile memory device of the learning device 200 .
- the learning device 200 uses the confidence F i to determine how much to consider the target sound direction masking sound signal.
- the learned model 112 is generated by the learning device 200 learning for making the determination.
- the trained model 112 receives the target sound direction emphasized sound signal and the target sound direction masking sound signal.
- a target sound direction masking sound signal is generated based on the mask feature amount.
- the trained model 112 determines how much the target sound direction masking sound signal should be considered using the mask feature reliability F i .
- the trained model 112 outputs the target sound signal based on the determination. In this way, the information processing apparatus 100 can output a more appropriate target sound signal by inputting the reliability F i to the trained model 112 .
- FIG. 12 is a block diagram showing functions of the information processing apparatus according to the fourth embodiment.
- the information processing apparatus 100 further has a noise section detection section 192 .
- a part or all of the noise interval detection unit 192 may be implemented by a processing circuit. Also, part or all of the noise interval detection unit 192 may be implemented as a program module executed by the processor 101 .
- the noise section detection unit 192 detects a noise section based on the target sound direction emphasized sound signal.
- the noise section detection unit 192 uses the method described in Patent Document 2 when detecting a noise section.
- the noise section detection unit 192 detects a speech section based on the target sound direction emphasized sound signal, and then specifies the speech section by correcting the start time and the end time of the speech section.
- the noise section detection unit 192 detects a noise section by excluding the specified speech section from the section indicating the target sound direction emphasized sound signal.
- the detected noise section may be input to the learning device 200 .
- the target sound signal output unit 180 uses the detected noise section, the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the learned model 112 to output the target sound signal.
- the encoder 112a performs the following processing in addition to the processing of the first embodiment.
- the encoder 112a estimates a non-target sound time-frequency representation of “M dimensions ⁇ time” based on the signal corresponding to the noise section of the target sound direction-emphasized sound signal.
- the Encoder 112a may estimate the power spectrum estimated by the STFT as a non-target sound time-frequency representation.
- the Encoder 112a may use a one-dimensional convolution operation to estimate the non-target sound time-frequency representation.
- the non-target sound time-frequency representations may be projected onto the same time-frequency representation space or onto different time-frequency representation spaces. Note that, for example, the estimation is described in Non-Patent Document 1.
- the Separator 112b integrates the non-target sound time-frequency representation and the target sound direction-enhanced time-frequency representation. For example, the Separator 112b integrates using the Attention method described in Non-Patent Document 3.
- the Separator 112b estimates an “M dimension ⁇ time” mask matrix based on the target sound direction-enhanced time-frequency representation and the target sound direction masking time-frequency representation obtained by integration.
- the Separator 112b can estimate the tendency of noise based on the non-target sound time-frequency representation.
- the processing of the decoder 112c is the same as in the first embodiment.
- FIG. 13 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the fourth embodiment; FIG.
- the process of FIG. 13 differs from the process of FIG. 5 in that steps S16c and S17c are executed. Therefore, in FIG. 13, steps S16c and 17c will be explained. Further, description of processes other than steps S16c and S17c will be omitted.
- the noise section detection unit 192 detects a noise section, which is a section indicating noise, based on the target sound direction emphasized sound signal.
- the target sound signal output unit 180 uses the noise section, the target sound direction emphasized sound signal, the target sound direction masking sound signal, and the trained model 112 to output the target sound signal.
- the learning device 200 learns using noise intervals when performing learning.
- the learning device 200 may learn using the noise section acquired from the information processing device 100 .
- the learning device 200 may learn using the noise section detected by the processing execution unit 240 .
- the learning device 200 learns the tendency of noise based on the noise interval.
- the learning device 200 performs learning for outputting the target sound signal based on the target sound direction emphasized sound signal and the target sound direction masking sound signal, taking into account the tendency of the noise.
- the learned model 112 is generated by the learning device 200 learning.
- noise intervals are input to the trained model 112 .
- the trained model 112 estimates the tendency of noise contained in the target sound direction emphasized sound signal and the target sound direction masking sound signal based on the noise section.
- the trained model 112 outputs a target sound signal based on the target sound direction-enhanced sound signal and the target sound direction masking sound signal, taking into account the tendency of noise. Therefore, since the information processing apparatus 100 outputs the target sound signal in consideration of the tendency of noise, it is possible to output a more appropriate target sound signal.
- Information processing device 101 Processor, 102 Volatile storage device, 103 Non-volatile storage device, 111 Sound source position information, 112 Trained model, 120 Acquisition unit, 130 Sound feature value extraction unit, 140 Enhancement unit, 150 Estimation unit, 160 Mask feature extraction unit 170 generation unit 180 target sound signal output unit 190 selection unit 191 reliability calculation unit 192 noise section detection unit 200 learning device 211 sound data storage unit 212 impulse response storage unit 213 A noise storage unit, 220 impulse response application unit, 230 mixing unit, 240 processing execution unit, and 250 learning unit.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
図1は、実施の形態1の目的音信号出力システムの例を示す図である。目的音信号出力システムは、情報処理装置100と学習装置200とを含む。情報処理装置100は、出力方法を実行する装置である。情報処理装置100は、学習済モデルを用いて、目的音信号を出力する。学習済モデルは、学習装置200によって生成される。
<活用フェーズ>
また、揮発性記憶装置102又は不揮発性記憶装置103によって確保された記憶領域は、記憶部と呼ぶ。
図3は、実施の形態1の情報処理装置の機能を示すブロック図である。情報処理装置100は、取得部120、音特徴量抽出部130、強調部140、推定部150、マスク特徴量抽出部160、生成部170、及び目的音信号出力部180を有する。
lは、時間を示す。kは、周波数を示す。xlkは、音源位置情報111に基づいて特定される目的音の音源位置に最も近いマイクロフォンから得られる音信号に対応する音特徴量を示している。xlkは、STFTスペクトルと考えてもよい。aθ,kは、ある角度方向θのステアリングベクトルを示している。Hは、共役転置である。
目的音方向強調音信号と目的音方向マスキング音信号とは、学習信号として、学習装置200に入力されてもよい。
図5は、実施の形態1の情報処理装置が実行する処理の例を示すフローチャートである。
(ステップS11)取得部120は、混合音信号を取得する。
(ステップS12)音特徴量抽出部130は、混合音信号に基づいて、複数の音特徴量を抽出する。
(ステップS13)強調部140は、音源位置情報111に基づいて、目的音方向の音特徴量を強調する。
(ステップS15)マスク特徴量抽出部160は、推定された目的音方向と複数の音特徴量とに基づいて、マスク特徴量を抽出する。
(ステップS16)生成部170は、強調部140によって強調された音特徴量に基づいて、目的音方向強調音信号を生成する。また、生成部170は、マスク特徴量に基づいて、目的音方向マスキング音信号を生成する。
(ステップS17)目的音信号出力部180は、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
<学習フェーズ>
学習フェーズでは、学習済モデル112の生成の一例を説明する。
図6は、実施の形態1の学習装置の機能を示すブロック図である。学習装置200は、音データ記憶部211、インパルス応答記憶部212、ノイズ記憶部213、インパルス応答適用部220、混合部230、処理実行部240、及び学習部250を有する。
なお、学習信号は、処理実行部240が生成した学習信号でもよいし、情報処理装置100が生成した学習信号でもよい。
図7は、実施の形態1の学習装置が実行する処理の例を示すフローチャートである。
(ステップS21)インパルス応答適用部220は、目的音信号と妨害音信号とに、インパルス応答データを畳み込む。
(ステップS22)混合部230は、インパルス応答適用部220が出力した音信号と、ノイズ信号とに基づいて、混合音信号を生成する。
(ステップS24)学習部250は、学習信号を用いて、学習する。
そして、学習装置200が学習を繰り返すことにより、学習済モデル112が、生成される。
次に、実施の形態2を説明する。実施の形態2では、実施の形態1と相違する事項を主に説明する。そして、実施の形態2では、実施の形態1と共通する事項の説明を省略する。
選択部190の一部又は全部は、処理回路によって実現してもよい。また、選択部190の一部又は全部は、プロセッサ101が実行するプログラムのモジュールとして実現してもよい。
ここで、選択された音信号と目的音方向強調音信号と目的音方向マスキング音信号とは、学習信号として、学習装置200に入力されてもよい。
このように、目的音信号出力部180は、選択された音信号、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
図9は、実施の形態2の情報処理装置が実行する処理の例を示すフローチャートである。図9の処理は、ステップS11a,17aが実行される点が図5の処理と異なる。そのため、図9では、ステップS11a,17aを説明する。そして、ステップS11a,17a以外の処理の説明は、省略する。
(ステップS17a)目的音信号出力部180は、選択された音信号、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
なお、ステップS11aは、ステップS17aが実行される前に実行されるのであれば、どのタイミングで実行されてもよい。
次に、実施の形態3を説明する。実施の形態3では、実施の形態1と相違する事項を主に説明する。そして、実施の形態3では、実施の形態1と共通する事項の説明を省略する。
図10は、実施の形態3の情報処理装置の機能を示すブロック図である。情報処理装置100は、さらに、信頼度算出部191を有する。
信頼度算出部191の一部又は全部は、処理回路によって実現してもよい。また、信頼度算出部191の一部又は全部は、プロセッサ101が実行するプログラムのモジュールとして実現してもよい。
目的音信号出力部180は、信頼度Fi、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
Encoder112aは、実施の形態1の処理に加えて、次の処理を行う。Encoder112aは、信頼度Fiの周波数ビン数Fとフレーム数Tとを乗算することにより、時間周波数表現FTを算出する。なお、周波数ビン数Fは、時間周波数表現の周波数軸方向の要素の数である。フレーム数Tは、混合音信号を予め設定された時間で分割することにより得られる数である。
Separator112bは、目的音方向強調時間周波数表現と時間周波数表現FTとが一致しない場合、周波数軸方向の要素数が変換された信頼度Fiと目的音方向強調時間周波数表現とを統合する。例えば、Separator112bは、非特許文献3が示すAttention法を用いて、統合を行う。Separator112bは、統合することにより得られた目的音方向強調時間周波数表現と目的音方向マスキング時間周波数表現とに基づいて、“M次元×時間”のマスク行列を推定する。
このように、目的音信号出力部180は、信頼度Fi、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
図11は、実施の形態3の情報処理装置が実行する処理の例を示すフローチャートである。図11の処理は、ステップS15b,17bが実行される点が図5の処理と異なる。そのため、図11では、ステップS15b,17bを説明する。そして、ステップS15b,17b以外の処理の説明は、省略する。
(ステップS17b)目的音信号出力部180は、信頼度Fi、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
次に、実施の形態4を説明する。実施の形態4では、実施の形態1と相違する事項を主に説明する。そして、実施の形態4では、実施の形態1と共通する事項の説明を省略する。
図12は、実施の形態4の情報処理装置の機能を示すブロック図である。情報処理装置100は、さらに、ノイズ区間検出部192を有する。
Decoder112cの処理は、実施の形態1と同じである。
図13は、実施の形態4の情報処理装置が実行する処理の例を示すフローチャートである。図13の処理は、ステップS16c,17cが実行される点が図5の処理と異なる。そのため、図13では、ステップS16c,17cを説明する。そして、ステップS16c,17c以外の処理の説明は、省略する。
(ステップS17c)目的音信号出力部180は、ノイズ区間、目的音方向強調音信号、目的音方向マスキング音信号、及び学習済モデル112を用いて、目的音信号を出力する。
Claims (7)
- 目的音の音源の位置情報である音源位置情報、前記目的音と妨害音とを含む混合音を示す信号である混合音信号、及び学習済モデルを取得する取得部と、
前記混合音信号に基づいて、複数の音特徴量を抽出する音特徴量抽出部と、
前記音源位置情報に基づいて、前記複数の音特徴量のうち、前記目的音の方向である目的音方向の音特徴量を強調する強調部と、
前記複数の音特徴量と前記音源位置情報とに基づいて、前記目的音方向を推定する推定部と、
推定された前記目的音方向と前記複数の音特徴量とに基づいて、前記目的音方向の特徴量がマスクされた状態の特徴量であるマスク特徴量を抽出するマスク特徴量抽出部と、
強調された音特徴量に基づいて、前記目的音方向が強調された音信号である目的音方向強調音信号を生成し、前記マスク特徴量に基づいて、前記目的音方向がマスキングされた音信号である目的音方向マスキング音信号を生成する生成部と、
前記目的音方向強調音信号、前記目的音方向マスキング音信号、及び前記学習済モデルを用いて、前記目的音を示す信号である目的音信号を出力する目的音信号出力部と、
を有する情報処理装置。 - 前記混合音信号と前記音源位置情報を用いて、前記目的音方向のチャネルの音信号を選択する選択部をさらに有し、
前記目的音信号出力部は、選択された音信号、前記目的音方向強調音信号、前記目的音方向マスキング音信号、及び前記学習済モデルを用いて、前記目的音信号を出力する、
請求項1に記載の情報処理装置。 - 予め設定された方法で、前記マスク特徴量の信頼度を算出する信頼度算出部をさらに有し、
前記目的音信号出力部は、前記信頼度、前記目的音方向強調音信号、前記目的音方向マスキング音信号、及び前記学習済モデルを用いて、前記目的音信号を出力する、
請求項1又は2に記載の情報処理装置。 - 前記混合音は、ノイズを含む、
請求項1から3のいずれか1項に記載の情報処理装置。 - 前記目的音方向強調音信号に基づいて、前記ノイズを示す区間であるノイズ区間を検出するノイズ区間検出部をさらに有し、
前記目的音信号出力部は、前記ノイズ区間、前記目的音方向強調音信号、前記目的音方向マスキング音信号、及び前記学習済モデルを用いて、前記目的音信号を出力する、
請求項4に記載の情報処理装置。 - 情報処理装置が、
目的音の音源の位置情報である音源位置情報、前記目的音と妨害音とを含む混合音を示す信号である混合音信号、及び学習済モデルを取得し、
前記混合音信号に基づいて、複数の音特徴量を抽出し、
前記音源位置情報に基づいて、前記複数の音特徴量のうち、前記目的音の方向である目的音方向の音特徴量を強調し、
前記複数の音特徴量と前記音源位置情報とに基づいて、前記目的音方向を推定し、
推定された前記目的音方向と前記複数の音特徴量とに基づいて、前記目的音方向の特徴量がマスクされた状態の特徴量であるマスク特徴量を抽出し、
強調された音特徴量に基づいて、前記目的音方向が強調された音信号である目的音方向強調音信号を生成し、前記マスク特徴量に基づいて、前記目的音方向がマスキングされた音信号である目的音方向マスキング音信号を生成し、
前記目的音方向強調音信号、前記目的音方向マスキング音信号、及び前記学習済モデルを用いて、前記目的音を示す信号である目的音信号を出力する、
出力方法。 - 情報処理装置に、
目的音の音源の位置情報である音源位置情報、前記目的音と妨害音とを含む混合音を示す信号である混合音信号、及び学習済モデルを取得し、
前記混合音信号に基づいて、複数の音特徴量を抽出し、
前記音源位置情報に基づいて、前記複数の音特徴量のうち、前記目的音の方向である目的音方向の音特徴量を強調し、
前記複数の音特徴量と前記音源位置情報とに基づいて、前記目的音方向を推定し、
推定された前記目的音方向と前記複数の音特徴量とに基づいて、前記目的音方向の特徴量がマスクされた状態の特徴量であるマスク特徴量を抽出し、
強調された音特徴量に基づいて、前記目的音方向が強調された音信号である目的音方向強調音信号を生成し、前記マスク特徴量に基づいて、前記目的音方向がマスキングされた音信号である目的音方向マスキング音信号を生成し、
前記目的音方向強調音信号、前記目的音方向マスキング音信号、及び前記学習済モデルを用いて、前記目的音を示す信号である目的音信号を出力する、
処理を実行させる出力プログラム。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/014790 WO2022215199A1 (ja) | 2021-04-07 | 2021-04-07 | 情報処理装置、出力方法、及び出力プログラム |
JP2023512578A JP7270869B2 (ja) | 2021-04-07 | 2021-04-07 | 情報処理装置、出力方法、及び出力プログラム |
CN202180095532.6A CN116997961A (zh) | 2021-04-07 | 2021-04-07 | 信息处理装置、输出方法和输出程序 |
DE112021007013.4T DE112021007013T5 (de) | 2021-04-07 | 2021-04-07 | Informationsverarbeitungseinrichtung, ausgabeverfahren und ausgabeprogramm |
US18/239,289 US20230419980A1 (en) | 2021-04-07 | 2023-08-29 | Information processing device, and output method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/014790 WO2022215199A1 (ja) | 2021-04-07 | 2021-04-07 | 情報処理装置、出力方法、及び出力プログラム |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/239,289 Continuation US20230419980A1 (en) | 2021-04-07 | 2023-08-29 | Information processing device, and output method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022215199A1 true WO2022215199A1 (ja) | 2022-10-13 |
Family
ID=83545327
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/014790 WO2022215199A1 (ja) | 2021-04-07 | 2021-04-07 | 情報処理装置、出力方法、及び出力プログラム |
Country Status (5)
Country | Link |
---|---|
US (1) | US20230419980A1 (ja) |
JP (1) | JP7270869B2 (ja) |
CN (1) | CN116997961A (ja) |
DE (1) | DE112021007013T5 (ja) |
WO (1) | WO2022215199A1 (ja) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000047699A (ja) * | 1998-07-31 | 2000-02-18 | Toshiba Corp | 雑音抑圧処理装置および雑音抑圧処理方法 |
JP2003271191A (ja) * | 2002-03-15 | 2003-09-25 | Toshiba Corp | 音声認識用雑音抑圧装置及び方法、音声認識装置及び方法並びにプログラム |
JP2012058360A (ja) * | 2010-09-07 | 2012-03-22 | Sony Corp | 雑音除去装置および雑音除去方法 |
JP2018031910A (ja) * | 2016-08-25 | 2018-03-01 | 日本電信電話株式会社 | 音源強調学習装置、音源強調装置、音源強調学習方法、プログラム、信号処理学習装置 |
JP2019078864A (ja) * | 2017-10-24 | 2019-05-23 | 日本電信電話株式会社 | 楽音強調装置、畳み込みオートエンコーダ学習装置、楽音強調方法、プログラム |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5107956B2 (ja) | 2009-03-31 | 2012-12-26 | Kddi株式会社 | 雑音抑圧方法、装置およびプログラム |
JP6444490B2 (ja) | 2015-03-12 | 2018-12-26 | 三菱電機株式会社 | 音声区間検出装置および音声区間検出方法 |
-
2021
- 2021-04-07 JP JP2023512578A patent/JP7270869B2/ja active Active
- 2021-04-07 DE DE112021007013.4T patent/DE112021007013T5/de active Pending
- 2021-04-07 WO PCT/JP2021/014790 patent/WO2022215199A1/ja active Application Filing
- 2021-04-07 CN CN202180095532.6A patent/CN116997961A/zh active Pending
-
2023
- 2023-08-29 US US18/239,289 patent/US20230419980A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000047699A (ja) * | 1998-07-31 | 2000-02-18 | Toshiba Corp | 雑音抑圧処理装置および雑音抑圧処理方法 |
JP2003271191A (ja) * | 2002-03-15 | 2003-09-25 | Toshiba Corp | 音声認識用雑音抑圧装置及び方法、音声認識装置及び方法並びにプログラム |
JP2012058360A (ja) * | 2010-09-07 | 2012-03-22 | Sony Corp | 雑音除去装置および雑音除去方法 |
JP2018031910A (ja) * | 2016-08-25 | 2018-03-01 | 日本電信電話株式会社 | 音源強調学習装置、音源強調装置、音源強調学習方法、プログラム、信号処理学習装置 |
JP2019078864A (ja) * | 2017-10-24 | 2019-05-23 | 日本電信電話株式会社 | 楽音強調装置、畳み込みオートエンコーダ学習装置、楽音強調方法、プログラム |
Also Published As
Publication number | Publication date |
---|---|
JP7270869B2 (ja) | 2023-05-10 |
US20230419980A1 (en) | 2023-12-28 |
JPWO2022215199A1 (ja) | 2022-10-13 |
CN116997961A (zh) | 2023-11-03 |
DE112021007013T5 (de) | 2023-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9741360B1 (en) | Speech enhancement for target speakers | |
US11282505B2 (en) | Acoustic signal processing with neural network using amplitude, phase, and frequency | |
Zhao et al. | A two-stage algorithm for noisy and reverberant speech enhancement | |
Zhang et al. | Multi-channel multi-frame ADL-MVDR for target speech separation | |
Xu et al. | Generalized spatio-temporal RNN beamformer for target speech separation | |
JP6987075B2 (ja) | オーディオ源分離 | |
CN110610718B (zh) | 一种提取期望声源语音信号的方法及装置 | |
CN111866665B (zh) | 麦克风阵列波束形成方法及装置 | |
Nesta et al. | A flexible spatial blind source extraction framework for robust speech recognition in noisy environments | |
JP2024038369A (ja) | 深層フィルタを決定するための方法および装置 | |
Shankar et al. | Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids | |
Luo et al. | Implicit filter-and-sum network for multi-channel speech separation | |
Oo et al. | Phase and reverberation aware DNN for distant-talking speech enhancement | |
JP6815956B2 (ja) | フィルタ係数算出装置、その方法、及びプログラム | |
Lee et al. | Improved mask-based neural beamforming for multichannel speech enhancement by snapshot matching masking | |
JP7270869B2 (ja) | 情報処理装置、出力方法、及び出力プログラム | |
KR102316627B1 (ko) | 심화신경망 기반의 가상 채널 확장을 이용한 wpe 기반 잔향 제거 장치 | |
Wang et al. | Improving frame-online neural speech enhancement with overlapped-frame prediction | |
Dwivedi et al. | Spherical harmonics domain-based approach for source localization in presence of directional interference | |
CN114242104A (zh) | 语音降噪的方法、装置、设备及存储介质 | |
Kovalyov et al. | Dfsnet: A steerable neural beamformer invariant to microphone array configuration for real-time, low-latency speech enhancement | |
Li et al. | On loss functions for deep-learning based T60 estimation | |
Li et al. | Low complex accurate multi-source RTF estimation | |
Kuang et al. | Three-stage hybrid neural beamformer for multi-channel speech enhancement | |
JP2018191255A (ja) | 収音装置、その方法、及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21936003 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023512578 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180095532.6 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 112021007013 Country of ref document: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21936003 Country of ref document: EP Kind code of ref document: A1 |