US9640197B1 - Extraction of target speeches - Google Patents
Extraction of target speeches Download PDFInfo
- Publication number
- US9640197B1 US9640197B1 US15/077,523 US201615077523A US9640197B1 US 9640197 B1 US9640197 B1 US 9640197B1 US 201615077523 A US201615077523 A US 201615077523A US 9640197 B1 US9640197 B1 US 9640197B1
- Authority
- US
- United States
- Prior art keywords
- speeches
- metric
- speech signals
- speech
- arrival
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000605 extraction Methods 0.000 title description 3
- 238000000034 method Methods 0.000 claims abstract description 41
- 230000002708 enhancing effect Effects 0.000 claims abstract description 6
- 238000009826 distribution Methods 0.000 claims description 90
- 238000001228 spectrum Methods 0.000 claims description 49
- 238000003860 storage Methods 0.000 claims description 21
- 230000003044 adaptive effect Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 description 27
- 238000001914 filtration Methods 0.000 description 21
- 238000012545 processing Methods 0.000 description 18
- 238000010586 diagram Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 13
- 230000002452 interceptive effect Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 6
- 238000003491 array Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000010473 stable expression Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
Definitions
- This invention relates generally to an extraction of target speeches and, more particularly, to an extraction of target speeches from a plurality of speeches coming from different directions of arrival.
- ASR Automatic speech recognition
- Call-center monitoring is a good example.
- the agent's speech and the customer's speech on the telephone line are recorded separately by a logger and also transcribed separately.
- the agent's speech is usually used for checking the agent's performance, while the customer's speech is mainly used to detect unhappy customers who should be brought to a supervisor's attention.
- the customer's speech may also be further analyzed for the customer's potential needs.
- Face-to-face conversations are often observed in situations of sales or automobiles.
- sales conversations are made between an agent or a customer over a desk or a counter.
- automobiles conversations are made between a driver and a passenger during the driving.
- an embodiment of the present invention provides a computer-implemented method for extracting target speeches from a plurality of speeches coming from different directions of arrival.
- the method comprises obtaining speech signals from each of speech input devices disposed apart in predetermined distances from one another; for each pair of the speech input devices, calculating, based on the speech signals, a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches; for each pair of the speech input devices, calculating an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches, where the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing; using an adaptive beamformer, enhancing the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals; reading a probability model which is the product of a first normal distribution and a second normal distribution, where the first normal distribution is a model which has learned features of clean
- a system such as a computer system comprising a computer readable storage medium storing a program of instructions executable by the system to perform one or more methods described herein may be provided.
- a computer program product comprising a computer readable storage medium storing a program of instructions executable by the system to perform one or more methods described herein also may be provided.
- FIG. 1 illustrates an exemplified basic block diagram of a computer hardware used in an embodiment of the present invention.
- FIG. 2 illustrates examples of microphones placed between two speakers, according to an embodiment of the present invention.
- FIGS. 3A to 3D illustrate one embodiment of a flowchart of an overall process for extracting target speeches from a plurality of speeches coming from different directions of arrival, according to an embodiment of the present invention.
- FIG. 4A illustrates one embodiment of a block diagram of the system, according to an embodiment of the present invention.
- FIG. 4B illustrates another embodiment of a block diagram of the system, according to an embodiment of the present invention.
- FIG. 5 illustrates an example of an aliasing metric and an aliasing metric, according to an embodiment of the present invention.
- FIG. 6 illustrates experimental results according to one embodiment of the present invention.
- FIG. 1 illustrates an exemplified basic block diagram of a computer hardware used in an embodiment of the present invention.
- a computer ( 101 ) may be, for example, but is not limited to, a desktop, a laptop, a notebook, a tablet or a server computer.
- the server computer may be, for example, but is not limited to, a workstation, a rack-mount type server, a blade type server, or a mainframe server and may run, for example, a hypervisor for creating and running one or more virtual machines.
- the computer ( 101 ) may comprise one or more CPUs ( 102 ) and a main memory ( 103 ) connected to a bus ( 104 ).
- the CPU ( 102 ) may be preferably based on a 32-bit or 64-bit architecture.
- the CPU ( 102 ) may be, for example, but is not limited to, the Power® series of International Business Machines Corporation; the Core ITM series, the Core 2TM series, the AtomTM series, the XeonTM series, the Pentium® series, or the Celeron® series of Intel Corporation; or the PhenomTM series, the AthlonTM series, the TurionTM series, or SempronTM of Advanced Micro Devices, Inc.
- Power is registered trademark of International Business Machines Corporation in the United States, other countries, or both; “Core i”, “Core 2”, “Atom”, and “Xeon” are trademarks, and “Pentium” and “Celeron” are registered trademarks of Intel Corporation in the United States, other countries, or both; “Phenom”, “Athlon”, “Turion”, and “Sempron” are trademarks of Advanced Micro Devices, Inc. in the United States, other countries, or both).
- a display ( 106 ) such as a liquid crystal display (LCD) may be connected to the bus ( 104 ) via a display controller ( 105 ).
- the display ( 106 ) may be used to display, for management of the computer(s), information on a computer connected to a network via a communication line and information on software running on the computer using an appropriate graphics interface.
- a disk ( 108 ) such as a hard disk or a solid state drive, SSD, and a drive ( 109 ) such as a CD, a DVD, or a BD (Blu-ray disk) drive may be connected to the bus ( 104 ) via an SATA or IDE controller ( 107 ).
- a keyboard ( 111 ) and a mouse ( 112 ) may be connected to the bus ( 104 ) via a keyboard-mouse controller ( 110 ) or USB bus (not shown).
- An operating system programs providing Windows®, UNIX® Mac OS®, Linux®, or a Java® processing environment, Java® applications, a Java® virtual machine (VM), and a Java® just-in-time (JIT) compiler, such as J2EE®, other programs, and any data may be stored in the disk ( 108 ) to be loadable to the main memory.
- Windows is a registered trademark of Microsoft corporation in the United States, other countries, or both;
- UNIX is a registered trademark of the Open Group in the United States, other countries, or both;
- Mac OS is a registered trademark of Apple Inc.
- Linus Torvalds in the United States, other countries, or both
- Java and “J2EE” are registered trademarks of Oracle America, Inc. in the United States, other countries, or both.
- the drive ( 109 ) may be used to install a program, such as the computer program of an embodiment of the present invention, readable from a CD-ROM, a DVD-ROM, or a BD to the disk ( 108 ) or to load any data readable from a CD-ROM, a DVD-ROM, or a BD into the main memory ( 103 ) or the disk ( 108 ), if necessary.
- a program such as the computer program of an embodiment of the present invention
- a communication interface ( 114 ) may be based on, for example, but is not limited to, the Ethernet® protocol.
- the communication interface ( 114 ) may be connected to the bus ( 104 ) via a communication controller ( 113 ), physically connects the computer ( 101 ) to a communication line ( 115 ), and may provide a network interface layer to the TCP/IP communication protocol of a communication function of the operating system of the computer ( 101 ).
- the communication line ( 115 ) may be a wired LAN environment or a wireless LAN environment based on wireless LAN connectivity standards, for example, but is not limited to, IEEE® 802.11a/b/g/n (“IEEE” is a registered trademark of Institute of Electrical and Electronics Engineers, Inc. in the United States, other countries, or both).
- FIGS. 2 to 7 an embodiment of the present invention will be described with reference to the following FIGS. 2 to 7 .
- the idea of an embodiment of the present invention is on the basis of extension of the post filtering approach in a probabilistic framework integrating the aliasing metric and speech model.
- FIG. 2 illustrates two examples of microphones which were placed between two speakers.
- FIG. 2 illustrates two scenarios: the upper part ( 201 ) shows that two speech input devices are installed, and the lower part ( 231 ) shows that three or more speech input devices are installed.
- the speech input device may be, for example, a microphone.
- a microphone is used instead of the speech input device, but this does not mean that the speech input device is limited to a microphone.
- two microphones ( 221 - 1 , 221 - 2 ) are placed between a target speaker ( 211 ) and an interfering speaker ( 212 ).
- the target speaker may be, for example, but not limited to, an agent in a company.
- the interfering speaker ( 212 ) may be, for example, but not limited to a customer of the agent.
- three or more microphones are placed between a target speaker ( 241 ) and an interfering speaker ( 242 ).
- the distance between suitable microphone intervals may be determined as similar manner mentioned.
- FIGS. 3A to 3D illustrates one embodiment of a flowchart of an overall process for extracting target speeches from a plurality of speeches coming from different directions of arrival.
- FIG. 3A illustrates one embodiment of a flowchart of an overall process.
- FIG. 3B illustrates a detail of the steps 306 to 308 described in FIG. 3A .
- FIG. 3C illustrates a detail of the steps 309 to 310 described in FIG. 3A .
- FIG. 3D illustrates a detail of the step 311 described in FIG. 3A .
- a system such as the computer ( 101 ) performs each steps described in FIGS. 3A to 3C .
- the system may be implemented as a single computer or plural computers.
- step 301 the system starts the process mentioned above.
- step 302 the system obtains speech signals from each of speech input devices disposed apart in predetermined distances from one another.
- step 303 the system performs a discrete Fourier transform (DFT) for the obtained speech signals to obtain a complex spectrum.
- DFT discrete Fourier transform
- step 304 for each of the all possible pairs of the speech input devices, the system calculates, based on the obtained speech signals, a direction of arrival (DOA) of target speeches and directions of arrival (DOA) of other speeches other than the target speeches.
- DOA direction of arrival
- DOA directions of arrival
- the system may calculate, based on the obtained speech signals, a channel correlation metric which represents a degree of correlation between the speech input devices, a cross-spectrum-based metric between the speech input devices, or combination of these.
- the channel correlation can be calculated using any method known in the art, for example using the cross-power spectrum phase (CSP) analysis. If the number of the speech input devices is more than or equal to three, the correlation metrics for all pairs of the speech input devices are averaged and then used in a post filter in step 307 or input to the probability model described in step 312 .
- CSP cross-power spectrum phase
- cross-spectrum-based metric can be calculated using any method known in the art.
- cross-spectrum-based metric is calculated as the transfer function in the following non-patent literature, Zielinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2578-2581, 1988”.
- step 306 the system enhances, based on the speech signals and the direction of arrival of the target speeches, the speech signals arrived from the direction of arrival of the target speech signals, using an adaptive beamformer, to generate the enhanced speech signals.
- the output of the step 306 may be used in step 307 for obtaining a power spectrum from the output, directly used in step 308 (see step 322 ) for performing a filter bank for the output, or directly input to the probability model described in step 322 (see step 321 ).
- the system may obtain a power spectrum from the enhanced speech signal, using the post filtering, for example, Zelinski's post-filter.
- the output of the step 307 may be used in step 308 for performing a filter bank for the output, or directly input to the probability model described in step 322 (see step 323 ).
- the system may perform a filter bank, for example, the Mel-filter bank, for the power spectrum to obtain a log power spectrum, for example, a log-Mel power spectrum.
- the output of the filter bank may be further logarithmic-converted.
- step 309 for each pair of the speech input devices, the system calculates an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches.
- the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing. If the number of the speech input devices is more than or equal to three, the calculated aliasing metrics are averaged and then processed using a filter bank in step 310 , or directly input to the probability model described in step 312 (see step 331 ).
- the output of the step 309 may be used in step 310 for performing a filter bank to obtain a filtering version of the aliasing metric, or directly input to the probability model described in step 312 (see step 331 ).
- the system may perform a filter bank, for example, the Mel-Filter bank, for the aliasing metric to obtain a filtering version of the aliasing metric, for example, the Mel-filtering version of the aliasing metric.
- a filter bank for example, the Mel-Filter bank
- the system may perform a filter bank, for example, the Mel-Filter bank, for the cross-spectrum-based metric to obtain a filtering version of the channel correlation metric, for example, the Mel-filtering version of the channel correlation metric.
- the step 311 must be performed when the Zelinski's post-filter as the post filtering is used in the step 307 .
- step 312 the system reads, into a memory, a probability model which is the product of a first normal distribution and a second normal distribution.
- the first normal distribution is a model which has learned features of clean speeches.
- the second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals. The details of the first normal distribution and the second normal distribution will be explained below by referring the FIGS. 4A and 4B .
- step 313 the system inputs the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
- the second normal distribution is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero.
- the second normal distribution may be made so as to have a variance smaller than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to one.
- the second normal distribution may be made so as to have a variance larger than that of the first normal distribution in a case where the aliasing metric is close to one.
- the second normal distribution may be made so as to have a variance larger than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to zero.
- the probability model having natural continuity in each of the frequency bands of the speech can be realized.
- step 314 the system judges whether time-frame now processed is a last frame or not. If the judgment is positive, the system proceeds to a final step 315 . Meanwhile, if the judgment is negative, the system proceeds back to step 302 and then repeats the steps 302 to 314 .
- step 315 the system terminates the process mentioned above.
- steps 306 to 308 and the steps 309 and 310 can be performed simultaneously or in parallel.
- steps 306 to 308 the steps 309 and 310 and step 311 can be performed simultaneously or in parallel.
- FIGS. 4A and 4B illustrate embodiments of a block diagram of the system.
- FIG. 4A and FIG. 4B each describes a system according to an embodiment of the present invention.
- Each of the systems ( 401 , 402 ) can be used for extracting target speeches from a plurality of speeches coming from different directions of arrival.
- Each of the systems ( 401 , 402 ) may be the computer ( 101 ) described in FIG. 1 .
- the system ( 401 ) comprises discrete Fourier transform (DFT) sections ( 491 , 492 , . . . , 493 ), a directions of arrival (DOA) & Channel correlation (CC) calculation section ( 411 ), an aliasing metric section ( 412 ), a filter bank section ( 413 ), a minimum variance beamformer (MVBF) section ( 414 ), a post filter section ( 415 ), a filter bank section ( 416 ), a factorial modeling section ( 417 ) and an ASR or logger section ( 418 ).
- DFT discrete Fourier transform
- DOA directions of arrival
- CC Channel correlation
- the system ( 402 ) comprises the common sections ( 491 , 492 , . . . , 493 and 412 to 418 ) as described in FIG. 4A .
- the system ( 402 ) further comprises a DOA & Transfer function (TF) calculation section ( 421 ) instead of DOA & CC calculation section ( 411 ) and further comprises an additional filter bank section ( 422 ).
- TF DOA & Transfer function
- each of the common sections ( 491 , 492 , . . . , 493 and 412 to 418 ) which are commonly comprised in each of the systems ( 401 , 402 ), the DOA and CC calculation section ( 411 ) which are comprised in the system ( 401 ), and the DOA & TF calculation section ( 421 ) and the additional filter bank section ( 422 ) which are comprised in the system ( 402 ) will be explained.
- Each of the common sections ( 491 , 492 , . . . , 493 and 412 to 418 ), the DOA and CC calculation section ( 411 ), and the DOA & TF calculation section ( 421 ) and the additional filter bank section ( 422 ) may perform the steps described in FIG. 3A , as mentioned below.
- the discrete Fourier transform (DFT) sections ( 491 , 492 , . . . , 493 ) may perform the steps 302 and 303 .
- the DOA & CC calculation section ( 411 ) may perform the steps 304 and calculate a channel correlation metric as described in step 305 .
- the DOA & TF calculation section ( 421 ) calculate a cross-spectrum-based metric as described in step 305 .
- the minimum variance beamformer (MVBF) section ( 414 ) may perform step 306 .
- the post filter section ( 415 ) may perform the step 307 .
- the filter bank section ( 416 ) may perform step the 308 .
- the aliasing metric section ( 412 ) may perform the step 309 .
- the filter bank section ( 413 ) may perform step 310 .
- the filter bank section ( 422 ) may perform step 311 .
- the factorial modeling section ( 417 ) may perform the steps 312 and 313 .
- each section ( 412 to 418 and 491 to 493 ) the DOA and CC calculation section ( 411 ), the DOA & TF calculation section ( 421 ) and the additional filter bank section ( 422 ) will be described.
- plural microphones ( 481 , 482 , . . . , 483 ) are disposed apart in predetermined distances from one another between a target speaker and an interfering speaker.
- Each of the microphones ( 481 , 482 , . . . , 483 ) receives speech signals from the target and the customer.
- Each of the microphones ( 481 , 482 , . . . , 483 ) transmits the speech signals, s m,T , to the system ( 401 ).
- m denotes the number of the m-th microphones
- T denotes time-frame number index.
- the speech signal, s m,T may be a time domain signal in one frame at m-th microphone for all m.
- Each of the DFT sections ( 491 , 492 , . . . , 493 ) may receive speech signals, s m,T , from the corresponding microphones ( 481 , 482 , . . . , 483 ).
- the number of DFT sections ( 491 , 492 , . . . , 493 ) may correspond to those of the microphones ( 481 , 482 , . . . , 483 ).
- Each of the DFT sections ( 491 , 492 , . . . , 493 ) then perform a discrete Fourier transform (DFT) for the speech signals, s m,T , at the m-th microphone to obtain a complex spectrum, S m,T .
- the complex spectrum, S m,T can be expressed as S m,T (n).
- the complex spectrum, S m,T (n) can be observed in the m-th microphone at the time-frame T in n-th DFT bin.
- Each of the DFT sections ( 491 , 492 , . . . , 493 ) may transmit the complex spectrum, S m,T , to the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ).
- the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ) each may estimate DOA and calculate a gain for a post filter, using for example, a CSP analysis.
- the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ) carry out the CSP analysis for each complex spectrum, S m,T .
- the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ) may calculate, for each frame, a CSP coefficient in order to estimate directions of arrival (DOA) and calculate a gain for a post filter.
- the CSP coefficient ⁇ may be calculated for all the possible microphone pairs (l, m), according to the following equation (1).
- ⁇ T , l , m ⁇ ( i ) IDFT ⁇ [ W T ⁇ ( n ) ⁇ S l , T ⁇ ( n ) ⁇ S m , T ⁇ ( n ) * ⁇ S l , T ⁇ ( n ) ⁇ ⁇ ⁇ S m , T ⁇ ( n ) ⁇ ] ( 1 )
- ⁇ T (i) denotes a CSP coefficient
- i denotes a time-domain index
- W T (n) denotes a weigh of each DFT bin
- n denotes the DFT bin number
- * denotes a complex conjugate
- ⁇ T ⁇ ( i ) IDFT ⁇ [ W T ⁇ ( n ) ⁇ S 1 , T ⁇ ( n ) ⁇ S 2 , T ⁇ ( n ) * ⁇ S 1 , T ⁇ ( n ) ⁇ ⁇ ⁇ S 2 , T ⁇ ( n ) ⁇ ] ( 1 ⁇ a )
- the CSP coefficient is a representation of the cross-power spectrum phase analysis in a time region and denotes a correlation coefficient corresponding to a delay of i-sample.
- the CSP coefficient, ⁇ T may be a moving average over few frames back and forth in order to obtain stable expression.
- the CSP coefficient, ⁇ T may be given as ⁇ T (î T ), which is a CSP-target, i.e. a CSP coefficient of a direction of the target speaker.
- W T (n) is normally set to one when the weigh is not used in a normal CSP analysis.
- a weighted CSP which is arbitrary weight value, may be used as W T (n).
- the weighted CSP can be calculated, for example, according to an embodiment of the invention described in the U.S. Pat. No. 8,712,770.
- the target speaker direction, î T corresponds to a direction of arrival of target speeches.
- a range where the target speaker may exist is limited to either of a left or right side.
- the interfering speaker direction ⁇ 7 corresponds to directions of arrival of other speeches other than the target speeches.
- a range where the interfering speaker may exist is limited to opposite side of the target speaker.
- a DOA index, î T , of the target speaker can be estimated, according to the following equation (2), as a point which gives a peak in a side of the target speaker.
- the DOA index, î T , of the target speaker may be calculated for each of the all possible pairs of the microphones.
- î T argmax( ⁇ T ( i )), 0 ⁇ i ⁇ i max (2)
- a DOA index, ⁇ T , of the interfering speaker can be estimated as similar that used for estimating the DOA index, î T , of the target speaker.
- the DOA index, ⁇ T , of the interfering speaker may be calculated for each of the all possible pairs of the microphones.
- the DOA index, î T can be used as DOA in the MVBF section ( 414 ) and, therefore, will be passed to the MVBF section ( 414 ).
- the DOA indexes, î T and ⁇ T can be used in the aliasing metric section ( 412 ) and, therefore, will be passed to the aliasing metric section ( 412 ).
- the DOA & CC calculation section ( 411 ) may calculate a channel correlation metric, V T, for all the possible microphone pairs (l, m).
- the channel correlation metric represents a degree of correlation between the microphones.
- the channel correlation metric, v T can be calculated according to the following equation (3), when the number of microphones is three or more.
- the CSP-target is set, to an average of the ⁇ l,m (î l,m ) which are calculated for all the possible microphone pairs (l, m).
- v denotes channel correlation metric which is calculated for all the possible microphone pairs (l, m).
- T The suffix of the frame number, T, is omitted in the equation (3).
- the channel correlation metric, v can be calculated according to the following equation (3a), when the number of microphones is two.
- vT max(0, ⁇ T ( î T )) (3a)
- the aliasing metric section ( 412 ) may calculate an aliasing metric, E T , based on the direction of arrival of the target speeches and the directions of arrival of the other speeches.
- the aliasing metric can be calculated according to the following equations (4) and (5), when the number of microphones is three or more. When the number of microphones is three or more, an average of the aliasing metric for all the possible microphone pairs (l, m) is used.
- E l,m ( n ) cos(2 ⁇ n ⁇ ( î l,m ⁇ l,m )/ N ) (4)
- N denotes the total number of the DFT bin
- î l,m denotes a DOA index for the target speaker when seen from the microphone pair (l, m); and ⁇ l,m denotes a DOA index for the interfering speaker when seen from the microphone pair (l, m).
- T The suffix of the frame number, T, is omitted in the equations (4) and (5).
- aliasing metric is shown by the dashed line.
- the vertical axis denotes the aliasing metric E(n) and the horizontal axis denotes the DFT bin number. This indicates lower-confidence regions are observed at regular intervals in the frequency depending on the directions of the interfering-speaker and the target-speaker.
- the aliasing metric, E(n) will be passed to the filter bank section ( 413 ) in order to carry out a filter bank processing, where d denotes an index of the filter bank.
- the filter bank section ( 413 ) may calculate a filtering version, e d , of the aliasing metric, E(n), using the filter bank, for example, Mel-filtering-bank.
- the aliasing metric, E(n) is reduced to the lower dimensional signal, e d .
- the filtering version, e d of the aliasing metric, E d , can be calculated, according to the following equation (6).
- the filtering version, e d may be a Mel-band-pass filtered version of the aliasing metric, E d .
- e d ⁇ n ⁇ max ⁇ ( 0 , E ⁇ ( n ) ) ⁇ B d , n / ⁇ n ′ ⁇ B d , n ′ ( 6 )
- B d,n is a distribution of the d-th filter in the n-th bin.
- the output, e d will be passed to the factorial modeling ( 417 ).
- the MVBF section ( 414 ) enhancing the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals.
- the MVBF section ( 414 ) receives the DOA index, î T , and then carry out the MVBF in order to obtain an output of the adaptive beamformer, U T .
- the MVBF minimizes ambient noise by maintaining a constant gain in the target direction.
- the output of the adaptive beamformer, U T is a power spectrum.
- the MVBF is described, for example, by the following non-patent literature, F. Asano, H. Asoh, and T. Matsui: “Sound source localization and separation in near field”, IEICE Trans., E83-A, No. 11, pp. 2286-2294, 2000.
- the power spectrum, U T will be passed to the post-filter section ( 415 ).
- the post-filter section ( 415 ) carries out the post-filtering processing for the power spectrum, U T , in order to obtain an output of the post-filter, Y T .
- v T denotes a channel correlation metric
- the power spectrum, Y T can be filtered per spectral bin as Zelinski's post-filter does.
- This another embodiment is only applied for the system ( 402 ) described in FIG. 4B .
- Zelinski's post-filter is described, for example, by the following non-patent literature, Zielinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2578-2581, 1988”. In this embodiment, the value in each frequency band can be obtained.
- the transfer function H of Zelinski's post-filter is represented as the following equation (8).
- ⁇ circumflex over ( ⁇ ) ⁇ i,j (n) is a smoothed auto- or cross-spectral density between the microphones on channels i and j for DFT bin n; and is a function for extracting the real part of the complex number.
- the suffix of the frame number, T, is omitted in the equation (8).
- ⁇ circumflex over ( ⁇ ) ⁇ i,j is calculated as a local average around frame T, according to the following equation (9).
- î T is the DOA index for the target speaker, determined by CSP analysis, and M is the DFT size.
- the output, Y T , from the post-filter section ( 415 ) will be passed to the filter bank section ( 416 ) in order to carry out a filter bank processing.
- the filter bank section ( 416 ) may calculate a filtering version of the output, Y T , using the filter bank, for example, Mel-filtering-bank, and the output is logarithmic-converted to obtain y t .
- the obtained y t is log power spectrum, for example, log Mel-power spectrum.
- the obtained y is actually pre-processed with the gain adaptation so as to maximize the total likelihood of the utterance. This is because a Gaussian mixture model (GMM) in the log-mel spectrum domain has dependency on the input gain.
- GMM Gaussian mixture model
- the obtained y t from the filter bank section ( 416 ) will be passed to the factorial modeling section ( 417 ).
- V T the channel correlation metric
- the DOA & TF calculation section ( 421 ) calculates a cross-spectrum-based metric, H T , for all the possible microphone pairs (l, m).
- the cross-spectrum-based metric, H T can be calculated according to the equation (8) mentioned above.
- Ht is the same as H(n), but (n) corresponding to the index(n) is omitted here.
- the filter bank ( 420 ) may calculate a filtering version of the output, H T , using the filter bank, for example, Mel-filtering-bank. Accordingly, the obtained the cross-spectrum-based metric, v T , is a Mel-filter version of H T .
- the output, v T , form the filter bank ( 420 ) can be calculated according to the following equation (14).
- v d max ⁇ ( 0 , ⁇ n ⁇ H ⁇ ( n ) ⁇ B d , n / ⁇ n ′ ⁇ B d , n ′ ) . ( 14 )
- H(n) is calculated by the equation (8) mentioned above, and B d,n is a distribution of the d-th filter in the n-th bin.
- T the suffix of the frame number
- d the suffix of the filter bank
- the factorial modeling section ( 417 ) is one key feature of an embodiment of the present invention.
- a factorial model comprising two factors is introduced.
- the factorial model is a probability model which is the product of a first normal distribution and a second normal distribution.
- the factorial model is represented as the following equation (15).
- T the suffix, T, is omitted.
- the first normal distribution is represented as a model, p(z
- y) is a model which has learned features of clean speeches. The clean speeches may be obtained in a quiet room.
- y) may be probabilistic distribution of estimated clean speech z based on the output y from the filter bank section ( 416 ).
- y) is in advance trained as Gaussian Mixture Model, using clean speech data ( 471 ).
- the second normal distribution is represented as a model, p(z
- the second normal distribution model is a model having a mean in the probability distribution of the enhanced speech signals.
- the second normal distribution model may be probabilistic distribution of estimated clean speech z based on the confidence metric calculated with the filtering version, e, of the aliasing metric and the channel correlation metric, v.
- the second normal distribution model is designed as a set of Gaussian distribution each associated with the components of the first normal distribution model.
- the second normal distribution model has higher probability of z at the current y. Its variance is designed to be small when the confidence metric is high, and to be large when the confidence metric is low. This controls the product distribution shifted more to the model-based value when the confidence is low and more to y (pass-through) when the confidence is high. Further, the band with higher confidence contributes more for the total probability.
- e,v,y), can be Gaussian mixture model (GMM), because the product of the two Gaussian distribution, i.e. the first normal distribution and a second normal distribution, is also Gaussian distribution.
- GMM Gaussian mixture model
- the first normal distribution model is given as the following equation (16).
- k denotes each index in the mixed normal distribution; and N denotes a normal distribution; ⁇ denotes a mean vector, ⁇ denotes a variance-covariance matrix and a diagonal covariance matrix may be used.
- ⁇ , ⁇ x,k and ⁇ are given at each k-th Gaussian.
- ⁇ k (y) is a posterior probability that k-th normal distribution is selected when y is observed.
- the posterior probability, ⁇ k (y) is given as the following equation (17).
- ⁇ k ⁇ ( y ) ⁇ k ⁇ N ⁇ ( y ; ⁇ x , k , ⁇ x , k ) / ⁇ k ′ ⁇ ⁇ k ′ ⁇ N ⁇ ( y ; ⁇ x , k ′ , ⁇ x , k ′ ) ( 17 )
- ⁇ is the prior probability of the clean speech.
- ⁇ is created by scaling each component in the variance-covariance matrix for the clean speech model.
- the scaling is set to smaller value in a case where the aliasing metric, e, has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one.
- the variance, ⁇ is designed as the scaled version of ⁇ .
- the scaling is performed with the parameters, e, v, or combination of these.
- the variance, ⁇ can be calculate, according to the following equations (19), (20), (21) and (22), by scaling the k-th Gaussian at the d-th band in the speech model.
- equations (19), (20), (21) and (22) ⁇ , ⁇ and ⁇ each denote a constant and E is a very small vale in order to avoid zero.
- Z k is a normalization constant for setting the integral of the probability distribution to one.
- ⁇ x,k is the mean of the clean speech model
- ⁇ x,k is the variance of the clean speech model.
- the mean, ⁇ x,k , and the variance, ⁇ x,k are given in advance.
- the posterior probability, ⁇ k (y), of the k-th normal distribution is expanded to, ⁇ k ′(y,e,v).
- the expanded posterior probability, ⁇ k ′(y,e,v), are given as the following equation (26).
- ⁇ k ′ ⁇ ( y , e , v ) ⁇ k ⁇ N ⁇ ( y ; ⁇ z , k ′ , ⁇ z , k ′ ) / ⁇ k ′ ⁇ ⁇ k ′ ⁇ N ⁇ ( y ; ⁇ z , k ′ ′ , ⁇ z , k ′ ′ ) ( 26 )
- ⁇ k is the prior probability of the clean speech model.
- the prior probability, ⁇ k is given in advance.
- the variances, ⁇ z,k ′, used for the posterior probability, ⁇ ′ becomes smaller than the original variance, ⁇ x,k , for the d-th band in a case where the aliasing metric, e, for the d-th frequency band has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one.
- the aliasing metric, e, for the d-th frequency band has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one
- the variance, ⁇ k,d becomes smaller and the variance, ⁇ ′ z,k,d , becomes smaller than the original variance, ⁇ x,k,d ⁇ 1 .
- e,v,y), is shifted from the model-estimated value toward the y from the filter bank section ( 416 ) in a case where the aliasing metric, e, has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one. This is because the second normal distribution is a model with a higher probability around z y.
- the aliasing metric, e for the d-th frequency band has a value closer to one or the channel correlation metric, v, or a cross-spectrum-based metric is close to zero.
- the d-th band Gaussian has larger variance, thus contribution of such frequency band becomes low in the estimation of the posterior probability.
- the average vector, ⁇ z,k,d ′ shifts to ⁇ x,k,d , and the distribution of the product probability, p(z
- the final estimated output, ⁇ circumflex over (z) ⁇ , from the factorial modeling section ( 417 ) can be obtained, using the minimum mean square error (MMSE).
- the final estimated output, ⁇ circumflex over (z) ⁇ can be calculated according to the following equation (27).
- the final estimated output, ⁇ circumflex over (z) ⁇ will be passed to the ASR or logger section ( 418 ).
- the ASR section ( 418 ) may output the final estimated output, ⁇ circumflex over (z) ⁇ , as a recognized result of the speech.
- the Logger section ( 418 ) may store the final estimated output, ⁇ circumflex over (z) ⁇ , into a storage, such as a disk ( 108 ) described in FIG. 1 .
- FIG. 6 illustrates experimental results according to one embodiment of the present invention.
- two omni-directional microphones were placed on the table between two subject speakers, A and B.
- the distance between the microphones was 12 cm.
- the beamformer operated at a 22.05-kHz sampling frequency.
- the two subject speakers alternately read 100 sentences written in Japanese and the speeches were recorded. Using the recorded speeches as test data, the mixed speech data as the evaluation data was generated.
- the mixed speech data simulates the simultaneous utterance between the two subject speakers.
- part of speech segments obtained from the subject speaker A was extracted and scaled by 50%, then superimposed continuously to the speech segment obtained from the subject speaker B.
- the obtained hundred utterances were used for a target of the ASR.
- the speeches after the superposition were input to the adaptive beamformer and the post-filter and, after the processing, the utterance split was performed.
- Table 1 ( 601 ) shows the Character Error Rate (CER) %. The speech recognition accuracy was evaluated by the CER.
- Case 1 is a baseline of the evaluation, as a reference. Cases 2 to 4 are comparative examples. Case 5 is the Example according to an embodiment of the present invention.
- Case 1 was a baseline using the single microphone nearest to the subject speaker. The result of the CER, 62.1%, is very high.
- Case 2 is the simple MVBF system. It showed much improvement for the mixed speech, but little for the alternating speech. The MVBF achieved some speech separation in the mixed speech segments, but it did not sufficiently suppress the interfering speaker's speech. The effect of the MVBF was observed, but the result of the CER, 39.7%, is still high.
- Case 3 uses the Zelinski's post-filter. The Zelinski's post-filter was further applied the case 2. The effect of the Zelinski's post-filter was observed, but the result of the CER, 20.8%, is not still enough.
- Case 4 The output of the case 3 was completely replaced with the estimation value of a clean speech model, p(z
- Case 5 was the system performing factorial modeling, according to an embodiment of the present invention.
- the output of the case 3 is set to v.
- the output of the case 3 was partially replaced by amending the data having low degree of the reliability with the data having high reliability.
- the CER was further reduced compared to case 4.
- the present invention may be a method, a system, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
î T=argmax(
0<i<i max (2)
vT=max(0,
E l,m(n)=cos(2π·n·(î l,m −ĵ l,m)/N) (4)
Y T(n)=vT·U T(n) (7)
φ(T),1,2(n)=S (T),1(n)·{S (T),2(n)·e iτ}* (10)
τ=2π·î T ·n/M. (11)
H T′(n)=max(H T(n),0.0) (12)
Y T(n)=H T′(n)·U T(n) (13)
p(z|y,e,v)∝p(z|y)·p(z|e,v) (15)
p(z|e,v)=N(z;y,ψ(e,v)) (18)
ψk,d=Σx,k,d·β·(e d+(1−v)+ε) (19)
ψk,d=Σx,k,d·β·(1−√{square root over (v(1−e d))}+ε) (20)
ψk,d=Σx,k,d·β·(e d+ε) (21)
ψk,d=Σx,k,d·β·(1−1/(1+exp)(−α(e d−γ))))−1 (22)
μz,k′=Σz,k′(Σx,k −1μx,k+ψk −1 y) (24)
Σz,k′=(Σx,k −1+ψk −1)−1 (25)
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/077,523 US9640197B1 (en) | 2016-03-22 | 2016-03-22 | Extraction of target speeches |
US15/440,773 US9818428B2 (en) | 2016-03-22 | 2017-02-23 | Extraction of target speeches |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/077,523 US9640197B1 (en) | 2016-03-22 | 2016-03-22 | Extraction of target speeches |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/440,773 Continuation US9818428B2 (en) | 2016-03-22 | 2017-02-23 | Extraction of target speeches |
Publications (1)
Publication Number | Publication Date |
---|---|
US9640197B1 true US9640197B1 (en) | 2017-05-02 |
Family
ID=58629323
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/077,523 Expired - Fee Related US9640197B1 (en) | 2016-03-22 | 2016-03-22 | Extraction of target speeches |
US15/440,773 Expired - Fee Related US9818428B2 (en) | 2016-03-22 | 2017-02-23 | Extraction of target speeches |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/440,773 Expired - Fee Related US9818428B2 (en) | 2016-03-22 | 2017-02-23 | Extraction of target speeches |
Country Status (1)
Country | Link |
---|---|
US (2) | US9640197B1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2566755A (en) * | 2017-09-25 | 2019-03-27 | Cirrus Logic Int Semiconductor Ltd | Talker change detection |
CN110931036A (en) * | 2019-12-07 | 2020-03-27 | 杭州国芯科技股份有限公司 | Microphone array beam forming method |
CN113095258A (en) * | 2021-04-20 | 2021-07-09 | 深圳力维智联技术有限公司 | Directional signal extraction method, system, device and storage medium |
US11295740B2 (en) * | 2019-08-22 | 2022-04-05 | Beijing Xiaomi Intelligent Technology Co., Ltd. | Voice signal response method, electronic device, storage medium and system |
US11443748B2 (en) * | 2020-03-03 | 2022-09-13 | International Business Machines Corporation | Metric learning of speaker diarization |
US11651767B2 (en) | 2020-03-03 | 2023-05-16 | International Business Machines Corporation | Metric learning of speaker diarization |
US11894008B2 (en) * | 2017-12-12 | 2024-02-06 | Sony Corporation | Signal processing apparatus, training apparatus, and method |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10999132B1 (en) * | 2017-10-25 | 2021-05-04 | Amazon Technologies, Inc. | Detecting degraded network monitoring agents |
CN108899044B (en) * | 2018-07-27 | 2020-06-26 | 苏州思必驰信息科技有限公司 | Voice signal processing method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6339758B1 (en) * | 1998-07-31 | 2002-01-15 | Kabushiki Kaisha Toshiba | Noise suppress processing apparatus and method |
US20120140948A1 (en) | 2010-07-02 | 2012-06-07 | Panasonic Corporation | Directional microphone device and directivity control method |
US8503697B2 (en) | 2009-03-25 | 2013-08-06 | Kabushiki Kaisha Toshiba | Pickup signal processing apparatus, method, and program product |
US8712770B2 (en) | 2007-04-27 | 2014-04-29 | Nuance Communications, Inc. | Method, preprocessor, speech recognition system, and program product for extracting target speech by removing noise |
US8762137B2 (en) | 2009-11-30 | 2014-06-24 | International Business Machines Corporation | Target voice extraction method, apparatus and program product |
-
2016
- 2016-03-22 US US15/077,523 patent/US9640197B1/en not_active Expired - Fee Related
-
2017
- 2017-02-23 US US15/440,773 patent/US9818428B2/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6339758B1 (en) * | 1998-07-31 | 2002-01-15 | Kabushiki Kaisha Toshiba | Noise suppress processing apparatus and method |
US8712770B2 (en) | 2007-04-27 | 2014-04-29 | Nuance Communications, Inc. | Method, preprocessor, speech recognition system, and program product for extracting target speech by removing noise |
US8503697B2 (en) | 2009-03-25 | 2013-08-06 | Kabushiki Kaisha Toshiba | Pickup signal processing apparatus, method, and program product |
US8762137B2 (en) | 2009-11-30 | 2014-06-24 | International Business Machines Corporation | Target voice extraction method, apparatus and program product |
US20120140948A1 (en) | 2010-07-02 | 2012-06-07 | Panasonic Corporation | Directional microphone device and directivity control method |
Non-Patent Citations (4)
Title |
---|
Delikaris-Manias, S. et al., "Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays" IEEE Transactions on Audio, Speech, and Language Processing vol. 21, No. 11, Nov. 2013. (pp. 2356-2367). |
Dmour, M.A., "Mixture of Beamformers for Speech Separation and Extraction" Engineering thesis and dissertation collection, Oct. 2010. (pp. 1-211). |
Himawan, I. et al., "Microphone Array Beamforming Approach to Blind Speech Separation" Machine Learning for Multimodal Interaction, vol. 4892, Jun. 2007. (pp. 295-305). |
Maganti, H.K. et al., "Speech Enhancement and Recognition in Meetings With an Audio-Visual Sensor Array" IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 8, Nov. 2007. (pp. 2257-2269). |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2566755A (en) * | 2017-09-25 | 2019-03-27 | Cirrus Logic Int Semiconductor Ltd | Talker change detection |
US10580411B2 (en) | 2017-09-25 | 2020-03-03 | Cirrus Logic, Inc. | Talker change detection |
GB2566755B (en) * | 2017-09-25 | 2021-04-14 | Cirrus Logic Int Semiconductor Ltd | Talker change detection |
US11894008B2 (en) * | 2017-12-12 | 2024-02-06 | Sony Corporation | Signal processing apparatus, training apparatus, and method |
US11295740B2 (en) * | 2019-08-22 | 2022-04-05 | Beijing Xiaomi Intelligent Technology Co., Ltd. | Voice signal response method, electronic device, storage medium and system |
CN110931036A (en) * | 2019-12-07 | 2020-03-27 | 杭州国芯科技股份有限公司 | Microphone array beam forming method |
US11443748B2 (en) * | 2020-03-03 | 2022-09-13 | International Business Machines Corporation | Metric learning of speaker diarization |
US11651767B2 (en) | 2020-03-03 | 2023-05-16 | International Business Machines Corporation | Metric learning of speaker diarization |
CN113095258A (en) * | 2021-04-20 | 2021-07-09 | 深圳力维智联技术有限公司 | Directional signal extraction method, system, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
US20170278524A1 (en) | 2017-09-28 |
US9818428B2 (en) | 2017-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9818428B2 (en) | Extraction of target speeches | |
US11003983B2 (en) | Training of front-end and back-end neural networks | |
JP4906908B2 (en) | Objective speech extraction method, objective speech extraction apparatus, and objective speech extraction program | |
EP3289586B1 (en) | Impulsive noise suppression | |
JP4875656B2 (en) | Signal section estimation device and method, program, and recording medium | |
EP3807878B1 (en) | Deep neural network based speech enhancement | |
US9984680B2 (en) | Method for improving acoustic model, computer for improving acoustic model and computer program thereof | |
US10984814B2 (en) | Denoising a signal | |
US9601124B2 (en) | Acoustic matching and splicing of sound tracks | |
US10152507B2 (en) | Finding of a target document in a spoken language processing | |
US20190385590A1 (en) | Generating device, generating method, and non-transitory computer readable storage medium | |
Abutalebi et al. | Speech enhancement based on β-order mmse estimation of short time spectral amplitude and laplacian speech modeling | |
US10586529B2 (en) | Processing of speech signal | |
JP5726790B2 (en) | Sound source separation device, sound source separation method, and program | |
JP6724290B2 (en) | Sound processing device, sound processing method, and program | |
US12094481B2 (en) | ADL-UFE: all deep learning unified front-end system | |
US10540990B2 (en) | Processing of speech signals | |
US20170213548A1 (en) | Score stabilization for speech classification | |
JP6125953B2 (en) | Voice section detection apparatus, method and program | |
US10170103B2 (en) | Discriminative training of a feature-space transform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FUKUDA, TAKASHI;ICHIKAWA, OSAMU;REEL/FRAME:038071/0987 Effective date: 20160311 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210502 |