US9640197B1 - Extraction of target speeches - Google Patents

Extraction of target speeches Download PDF

Info

Publication number
US9640197B1
US9640197B1 US15/077,523 US201615077523A US9640197B1 US 9640197 B1 US9640197 B1 US 9640197B1 US 201615077523 A US201615077523 A US 201615077523A US 9640197 B1 US9640197 B1 US 9640197B1
Authority
US
United States
Prior art keywords
speeches
metric
speech signals
speech
arrival
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US15/077,523
Inventor
Takashi Fukuda
Osamu Ichikawa
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US15/077,523 priority Critical patent/US9640197B1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FUKUDA, TAKASHI, ICHIKAWA, OSAMU
Priority to US15/440,773 priority patent/US9818428B2/en
Application granted granted Critical
Publication of US9640197B1 publication Critical patent/US9640197B1/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • This invention relates generally to an extraction of target speeches and, more particularly, to an extraction of target speeches from a plurality of speeches coming from different directions of arrival.
  • ASR Automatic speech recognition
  • Call-center monitoring is a good example.
  • the agent's speech and the customer's speech on the telephone line are recorded separately by a logger and also transcribed separately.
  • the agent's speech is usually used for checking the agent's performance, while the customer's speech is mainly used to detect unhappy customers who should be brought to a supervisor's attention.
  • the customer's speech may also be further analyzed for the customer's potential needs.
  • Face-to-face conversations are often observed in situations of sales or automobiles.
  • sales conversations are made between an agent or a customer over a desk or a counter.
  • automobiles conversations are made between a driver and a passenger during the driving.
  • an embodiment of the present invention provides a computer-implemented method for extracting target speeches from a plurality of speeches coming from different directions of arrival.
  • the method comprises obtaining speech signals from each of speech input devices disposed apart in predetermined distances from one another; for each pair of the speech input devices, calculating, based on the speech signals, a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches; for each pair of the speech input devices, calculating an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches, where the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing; using an adaptive beamformer, enhancing the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals; reading a probability model which is the product of a first normal distribution and a second normal distribution, where the first normal distribution is a model which has learned features of clean
  • a system such as a computer system comprising a computer readable storage medium storing a program of instructions executable by the system to perform one or more methods described herein may be provided.
  • a computer program product comprising a computer readable storage medium storing a program of instructions executable by the system to perform one or more methods described herein also may be provided.
  • FIG. 1 illustrates an exemplified basic block diagram of a computer hardware used in an embodiment of the present invention.
  • FIG. 2 illustrates examples of microphones placed between two speakers, according to an embodiment of the present invention.
  • FIGS. 3A to 3D illustrate one embodiment of a flowchart of an overall process for extracting target speeches from a plurality of speeches coming from different directions of arrival, according to an embodiment of the present invention.
  • FIG. 4A illustrates one embodiment of a block diagram of the system, according to an embodiment of the present invention.
  • FIG. 4B illustrates another embodiment of a block diagram of the system, according to an embodiment of the present invention.
  • FIG. 5 illustrates an example of an aliasing metric and an aliasing metric, according to an embodiment of the present invention.
  • FIG. 6 illustrates experimental results according to one embodiment of the present invention.
  • FIG. 1 illustrates an exemplified basic block diagram of a computer hardware used in an embodiment of the present invention.
  • a computer ( 101 ) may be, for example, but is not limited to, a desktop, a laptop, a notebook, a tablet or a server computer.
  • the server computer may be, for example, but is not limited to, a workstation, a rack-mount type server, a blade type server, or a mainframe server and may run, for example, a hypervisor for creating and running one or more virtual machines.
  • the computer ( 101 ) may comprise one or more CPUs ( 102 ) and a main memory ( 103 ) connected to a bus ( 104 ).
  • the CPU ( 102 ) may be preferably based on a 32-bit or 64-bit architecture.
  • the CPU ( 102 ) may be, for example, but is not limited to, the Power® series of International Business Machines Corporation; the Core ITM series, the Core 2TM series, the AtomTM series, the XeonTM series, the Pentium® series, or the Celeron® series of Intel Corporation; or the PhenomTM series, the AthlonTM series, the TurionTM series, or SempronTM of Advanced Micro Devices, Inc.
  • Power is registered trademark of International Business Machines Corporation in the United States, other countries, or both; “Core i”, “Core 2”, “Atom”, and “Xeon” are trademarks, and “Pentium” and “Celeron” are registered trademarks of Intel Corporation in the United States, other countries, or both; “Phenom”, “Athlon”, “Turion”, and “Sempron” are trademarks of Advanced Micro Devices, Inc. in the United States, other countries, or both).
  • a display ( 106 ) such as a liquid crystal display (LCD) may be connected to the bus ( 104 ) via a display controller ( 105 ).
  • the display ( 106 ) may be used to display, for management of the computer(s), information on a computer connected to a network via a communication line and information on software running on the computer using an appropriate graphics interface.
  • a disk ( 108 ) such as a hard disk or a solid state drive, SSD, and a drive ( 109 ) such as a CD, a DVD, or a BD (Blu-ray disk) drive may be connected to the bus ( 104 ) via an SATA or IDE controller ( 107 ).
  • a keyboard ( 111 ) and a mouse ( 112 ) may be connected to the bus ( 104 ) via a keyboard-mouse controller ( 110 ) or USB bus (not shown).
  • An operating system programs providing Windows®, UNIX® Mac OS®, Linux®, or a Java® processing environment, Java® applications, a Java® virtual machine (VM), and a Java® just-in-time (JIT) compiler, such as J2EE®, other programs, and any data may be stored in the disk ( 108 ) to be loadable to the main memory.
  • Windows is a registered trademark of Microsoft corporation in the United States, other countries, or both;
  • UNIX is a registered trademark of the Open Group in the United States, other countries, or both;
  • Mac OS is a registered trademark of Apple Inc.
  • Linus Torvalds in the United States, other countries, or both
  • Java and “J2EE” are registered trademarks of Oracle America, Inc. in the United States, other countries, or both.
  • the drive ( 109 ) may be used to install a program, such as the computer program of an embodiment of the present invention, readable from a CD-ROM, a DVD-ROM, or a BD to the disk ( 108 ) or to load any data readable from a CD-ROM, a DVD-ROM, or a BD into the main memory ( 103 ) or the disk ( 108 ), if necessary.
  • a program such as the computer program of an embodiment of the present invention
  • a communication interface ( 114 ) may be based on, for example, but is not limited to, the Ethernet® protocol.
  • the communication interface ( 114 ) may be connected to the bus ( 104 ) via a communication controller ( 113 ), physically connects the computer ( 101 ) to a communication line ( 115 ), and may provide a network interface layer to the TCP/IP communication protocol of a communication function of the operating system of the computer ( 101 ).
  • the communication line ( 115 ) may be a wired LAN environment or a wireless LAN environment based on wireless LAN connectivity standards, for example, but is not limited to, IEEE® 802.11a/b/g/n (“IEEE” is a registered trademark of Institute of Electrical and Electronics Engineers, Inc. in the United States, other countries, or both).
  • FIGS. 2 to 7 an embodiment of the present invention will be described with reference to the following FIGS. 2 to 7 .
  • the idea of an embodiment of the present invention is on the basis of extension of the post filtering approach in a probabilistic framework integrating the aliasing metric and speech model.
  • FIG. 2 illustrates two examples of microphones which were placed between two speakers.
  • FIG. 2 illustrates two scenarios: the upper part ( 201 ) shows that two speech input devices are installed, and the lower part ( 231 ) shows that three or more speech input devices are installed.
  • the speech input device may be, for example, a microphone.
  • a microphone is used instead of the speech input device, but this does not mean that the speech input device is limited to a microphone.
  • two microphones ( 221 - 1 , 221 - 2 ) are placed between a target speaker ( 211 ) and an interfering speaker ( 212 ).
  • the target speaker may be, for example, but not limited to, an agent in a company.
  • the interfering speaker ( 212 ) may be, for example, but not limited to a customer of the agent.
  • three or more microphones are placed between a target speaker ( 241 ) and an interfering speaker ( 242 ).
  • the distance between suitable microphone intervals may be determined as similar manner mentioned.
  • FIGS. 3A to 3D illustrates one embodiment of a flowchart of an overall process for extracting target speeches from a plurality of speeches coming from different directions of arrival.
  • FIG. 3A illustrates one embodiment of a flowchart of an overall process.
  • FIG. 3B illustrates a detail of the steps 306 to 308 described in FIG. 3A .
  • FIG. 3C illustrates a detail of the steps 309 to 310 described in FIG. 3A .
  • FIG. 3D illustrates a detail of the step 311 described in FIG. 3A .
  • a system such as the computer ( 101 ) performs each steps described in FIGS. 3A to 3C .
  • the system may be implemented as a single computer or plural computers.
  • step 301 the system starts the process mentioned above.
  • step 302 the system obtains speech signals from each of speech input devices disposed apart in predetermined distances from one another.
  • step 303 the system performs a discrete Fourier transform (DFT) for the obtained speech signals to obtain a complex spectrum.
  • DFT discrete Fourier transform
  • step 304 for each of the all possible pairs of the speech input devices, the system calculates, based on the obtained speech signals, a direction of arrival (DOA) of target speeches and directions of arrival (DOA) of other speeches other than the target speeches.
  • DOA direction of arrival
  • DOA directions of arrival
  • the system may calculate, based on the obtained speech signals, a channel correlation metric which represents a degree of correlation between the speech input devices, a cross-spectrum-based metric between the speech input devices, or combination of these.
  • the channel correlation can be calculated using any method known in the art, for example using the cross-power spectrum phase (CSP) analysis. If the number of the speech input devices is more than or equal to three, the correlation metrics for all pairs of the speech input devices are averaged and then used in a post filter in step 307 or input to the probability model described in step 312 .
  • CSP cross-power spectrum phase
  • cross-spectrum-based metric can be calculated using any method known in the art.
  • cross-spectrum-based metric is calculated as the transfer function in the following non-patent literature, Zielinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2578-2581, 1988”.
  • step 306 the system enhances, based on the speech signals and the direction of arrival of the target speeches, the speech signals arrived from the direction of arrival of the target speech signals, using an adaptive beamformer, to generate the enhanced speech signals.
  • the output of the step 306 may be used in step 307 for obtaining a power spectrum from the output, directly used in step 308 (see step 322 ) for performing a filter bank for the output, or directly input to the probability model described in step 322 (see step 321 ).
  • the system may obtain a power spectrum from the enhanced speech signal, using the post filtering, for example, Zelinski's post-filter.
  • the output of the step 307 may be used in step 308 for performing a filter bank for the output, or directly input to the probability model described in step 322 (see step 323 ).
  • the system may perform a filter bank, for example, the Mel-filter bank, for the power spectrum to obtain a log power spectrum, for example, a log-Mel power spectrum.
  • the output of the filter bank may be further logarithmic-converted.
  • step 309 for each pair of the speech input devices, the system calculates an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches.
  • the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing. If the number of the speech input devices is more than or equal to three, the calculated aliasing metrics are averaged and then processed using a filter bank in step 310 , or directly input to the probability model described in step 312 (see step 331 ).
  • the output of the step 309 may be used in step 310 for performing a filter bank to obtain a filtering version of the aliasing metric, or directly input to the probability model described in step 312 (see step 331 ).
  • the system may perform a filter bank, for example, the Mel-Filter bank, for the aliasing metric to obtain a filtering version of the aliasing metric, for example, the Mel-filtering version of the aliasing metric.
  • a filter bank for example, the Mel-Filter bank
  • the system may perform a filter bank, for example, the Mel-Filter bank, for the cross-spectrum-based metric to obtain a filtering version of the channel correlation metric, for example, the Mel-filtering version of the channel correlation metric.
  • the step 311 must be performed when the Zelinski's post-filter as the post filtering is used in the step 307 .
  • step 312 the system reads, into a memory, a probability model which is the product of a first normal distribution and a second normal distribution.
  • the first normal distribution is a model which has learned features of clean speeches.
  • the second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals. The details of the first normal distribution and the second normal distribution will be explained below by referring the FIGS. 4A and 4B .
  • step 313 the system inputs the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
  • the second normal distribution is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero.
  • the second normal distribution may be made so as to have a variance smaller than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to one.
  • the second normal distribution may be made so as to have a variance larger than that of the first normal distribution in a case where the aliasing metric is close to one.
  • the second normal distribution may be made so as to have a variance larger than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to zero.
  • the probability model having natural continuity in each of the frequency bands of the speech can be realized.
  • step 314 the system judges whether time-frame now processed is a last frame or not. If the judgment is positive, the system proceeds to a final step 315 . Meanwhile, if the judgment is negative, the system proceeds back to step 302 and then repeats the steps 302 to 314 .
  • step 315 the system terminates the process mentioned above.
  • steps 306 to 308 and the steps 309 and 310 can be performed simultaneously or in parallel.
  • steps 306 to 308 the steps 309 and 310 and step 311 can be performed simultaneously or in parallel.
  • FIGS. 4A and 4B illustrate embodiments of a block diagram of the system.
  • FIG. 4A and FIG. 4B each describes a system according to an embodiment of the present invention.
  • Each of the systems ( 401 , 402 ) can be used for extracting target speeches from a plurality of speeches coming from different directions of arrival.
  • Each of the systems ( 401 , 402 ) may be the computer ( 101 ) described in FIG. 1 .
  • the system ( 401 ) comprises discrete Fourier transform (DFT) sections ( 491 , 492 , . . . , 493 ), a directions of arrival (DOA) & Channel correlation (CC) calculation section ( 411 ), an aliasing metric section ( 412 ), a filter bank section ( 413 ), a minimum variance beamformer (MVBF) section ( 414 ), a post filter section ( 415 ), a filter bank section ( 416 ), a factorial modeling section ( 417 ) and an ASR or logger section ( 418 ).
  • DFT discrete Fourier transform
  • DOA directions of arrival
  • CC Channel correlation
  • the system ( 402 ) comprises the common sections ( 491 , 492 , . . . , 493 and 412 to 418 ) as described in FIG. 4A .
  • the system ( 402 ) further comprises a DOA & Transfer function (TF) calculation section ( 421 ) instead of DOA & CC calculation section ( 411 ) and further comprises an additional filter bank section ( 422 ).
  • TF DOA & Transfer function
  • each of the common sections ( 491 , 492 , . . . , 493 and 412 to 418 ) which are commonly comprised in each of the systems ( 401 , 402 ), the DOA and CC calculation section ( 411 ) which are comprised in the system ( 401 ), and the DOA & TF calculation section ( 421 ) and the additional filter bank section ( 422 ) which are comprised in the system ( 402 ) will be explained.
  • Each of the common sections ( 491 , 492 , . . . , 493 and 412 to 418 ), the DOA and CC calculation section ( 411 ), and the DOA & TF calculation section ( 421 ) and the additional filter bank section ( 422 ) may perform the steps described in FIG. 3A , as mentioned below.
  • the discrete Fourier transform (DFT) sections ( 491 , 492 , . . . , 493 ) may perform the steps 302 and 303 .
  • the DOA & CC calculation section ( 411 ) may perform the steps 304 and calculate a channel correlation metric as described in step 305 .
  • the DOA & TF calculation section ( 421 ) calculate a cross-spectrum-based metric as described in step 305 .
  • the minimum variance beamformer (MVBF) section ( 414 ) may perform step 306 .
  • the post filter section ( 415 ) may perform the step 307 .
  • the filter bank section ( 416 ) may perform step the 308 .
  • the aliasing metric section ( 412 ) may perform the step 309 .
  • the filter bank section ( 413 ) may perform step 310 .
  • the filter bank section ( 422 ) may perform step 311 .
  • the factorial modeling section ( 417 ) may perform the steps 312 and 313 .
  • each section ( 412 to 418 and 491 to 493 ) the DOA and CC calculation section ( 411 ), the DOA & TF calculation section ( 421 ) and the additional filter bank section ( 422 ) will be described.
  • plural microphones ( 481 , 482 , . . . , 483 ) are disposed apart in predetermined distances from one another between a target speaker and an interfering speaker.
  • Each of the microphones ( 481 , 482 , . . . , 483 ) receives speech signals from the target and the customer.
  • Each of the microphones ( 481 , 482 , . . . , 483 ) transmits the speech signals, s m,T , to the system ( 401 ).
  • m denotes the number of the m-th microphones
  • T denotes time-frame number index.
  • the speech signal, s m,T may be a time domain signal in one frame at m-th microphone for all m.
  • Each of the DFT sections ( 491 , 492 , . . . , 493 ) may receive speech signals, s m,T , from the corresponding microphones ( 481 , 482 , . . . , 483 ).
  • the number of DFT sections ( 491 , 492 , . . . , 493 ) may correspond to those of the microphones ( 481 , 482 , . . . , 483 ).
  • Each of the DFT sections ( 491 , 492 , . . . , 493 ) then perform a discrete Fourier transform (DFT) for the speech signals, s m,T , at the m-th microphone to obtain a complex spectrum, S m,T .
  • the complex spectrum, S m,T can be expressed as S m,T (n).
  • the complex spectrum, S m,T (n) can be observed in the m-th microphone at the time-frame T in n-th DFT bin.
  • Each of the DFT sections ( 491 , 492 , . . . , 493 ) may transmit the complex spectrum, S m,T , to the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ).
  • the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ) each may estimate DOA and calculate a gain for a post filter, using for example, a CSP analysis.
  • the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ) carry out the CSP analysis for each complex spectrum, S m,T .
  • the DOA & CC calculation section ( 411 ) or the DOA & TF calculation section ( 421 ) may calculate, for each frame, a CSP coefficient in order to estimate directions of arrival (DOA) and calculate a gain for a post filter.
  • the CSP coefficient ⁇ may be calculated for all the possible microphone pairs (l, m), according to the following equation (1).
  • ⁇ T , l , m ⁇ ( i ) IDFT ⁇ [ W T ⁇ ( n ) ⁇ S l , T ⁇ ( n ) ⁇ S m , T ⁇ ( n ) * ⁇ S l , T ⁇ ( n ) ⁇ ⁇ ⁇ S m , T ⁇ ( n ) ⁇ ] ( 1 )
  • ⁇ T (i) denotes a CSP coefficient
  • i denotes a time-domain index
  • W T (n) denotes a weigh of each DFT bin
  • n denotes the DFT bin number
  • * denotes a complex conjugate
  • ⁇ T ⁇ ( i ) IDFT ⁇ [ W T ⁇ ( n ) ⁇ S 1 , T ⁇ ( n ) ⁇ S 2 , T ⁇ ( n ) * ⁇ S 1 , T ⁇ ( n ) ⁇ ⁇ ⁇ S 2 , T ⁇ ( n ) ⁇ ] ( 1 ⁇ a )
  • the CSP coefficient is a representation of the cross-power spectrum phase analysis in a time region and denotes a correlation coefficient corresponding to a delay of i-sample.
  • the CSP coefficient, ⁇ T may be a moving average over few frames back and forth in order to obtain stable expression.
  • the CSP coefficient, ⁇ T may be given as ⁇ T (î T ), which is a CSP-target, i.e. a CSP coefficient of a direction of the target speaker.
  • W T (n) is normally set to one when the weigh is not used in a normal CSP analysis.
  • a weighted CSP which is arbitrary weight value, may be used as W T (n).
  • the weighted CSP can be calculated, for example, according to an embodiment of the invention described in the U.S. Pat. No. 8,712,770.
  • the target speaker direction, î T corresponds to a direction of arrival of target speeches.
  • a range where the target speaker may exist is limited to either of a left or right side.
  • the interfering speaker direction ⁇ 7 corresponds to directions of arrival of other speeches other than the target speeches.
  • a range where the interfering speaker may exist is limited to opposite side of the target speaker.
  • a DOA index, î T , of the target speaker can be estimated, according to the following equation (2), as a point which gives a peak in a side of the target speaker.
  • the DOA index, î T , of the target speaker may be calculated for each of the all possible pairs of the microphones.
  • î T argmax( ⁇ T ( i )), 0 ⁇ i ⁇ i max (2)
  • a DOA index, ⁇ T , of the interfering speaker can be estimated as similar that used for estimating the DOA index, î T , of the target speaker.
  • the DOA index, ⁇ T , of the interfering speaker may be calculated for each of the all possible pairs of the microphones.
  • the DOA index, î T can be used as DOA in the MVBF section ( 414 ) and, therefore, will be passed to the MVBF section ( 414 ).
  • the DOA indexes, î T and ⁇ T can be used in the aliasing metric section ( 412 ) and, therefore, will be passed to the aliasing metric section ( 412 ).
  • the DOA & CC calculation section ( 411 ) may calculate a channel correlation metric, V T, for all the possible microphone pairs (l, m).
  • the channel correlation metric represents a degree of correlation between the microphones.
  • the channel correlation metric, v T can be calculated according to the following equation (3), when the number of microphones is three or more.
  • the CSP-target is set, to an average of the ⁇ l,m (î l,m ) which are calculated for all the possible microphone pairs (l, m).
  • v denotes channel correlation metric which is calculated for all the possible microphone pairs (l, m).
  • T The suffix of the frame number, T, is omitted in the equation (3).
  • the channel correlation metric, v can be calculated according to the following equation (3a), when the number of microphones is two.
  • vT max(0, ⁇ T ( î T )) (3a)
  • the aliasing metric section ( 412 ) may calculate an aliasing metric, E T , based on the direction of arrival of the target speeches and the directions of arrival of the other speeches.
  • the aliasing metric can be calculated according to the following equations (4) and (5), when the number of microphones is three or more. When the number of microphones is three or more, an average of the aliasing metric for all the possible microphone pairs (l, m) is used.
  • E l,m ( n ) cos(2 ⁇ n ⁇ ( î l,m ⁇ l,m )/ N ) (4)
  • N denotes the total number of the DFT bin
  • î l,m denotes a DOA index for the target speaker when seen from the microphone pair (l, m); and ⁇ l,m denotes a DOA index for the interfering speaker when seen from the microphone pair (l, m).
  • T The suffix of the frame number, T, is omitted in the equations (4) and (5).
  • aliasing metric is shown by the dashed line.
  • the vertical axis denotes the aliasing metric E(n) and the horizontal axis denotes the DFT bin number. This indicates lower-confidence regions are observed at regular intervals in the frequency depending on the directions of the interfering-speaker and the target-speaker.
  • the aliasing metric, E(n) will be passed to the filter bank section ( 413 ) in order to carry out a filter bank processing, where d denotes an index of the filter bank.
  • the filter bank section ( 413 ) may calculate a filtering version, e d , of the aliasing metric, E(n), using the filter bank, for example, Mel-filtering-bank.
  • the aliasing metric, E(n) is reduced to the lower dimensional signal, e d .
  • the filtering version, e d of the aliasing metric, E d , can be calculated, according to the following equation (6).
  • the filtering version, e d may be a Mel-band-pass filtered version of the aliasing metric, E d .
  • e d ⁇ n ⁇ max ⁇ ( 0 , E ⁇ ( n ) ) ⁇ B d , n / ⁇ n ′ ⁇ B d , n ′ ( 6 )
  • B d,n is a distribution of the d-th filter in the n-th bin.
  • the output, e d will be passed to the factorial modeling ( 417 ).
  • the MVBF section ( 414 ) enhancing the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals.
  • the MVBF section ( 414 ) receives the DOA index, î T , and then carry out the MVBF in order to obtain an output of the adaptive beamformer, U T .
  • the MVBF minimizes ambient noise by maintaining a constant gain in the target direction.
  • the output of the adaptive beamformer, U T is a power spectrum.
  • the MVBF is described, for example, by the following non-patent literature, F. Asano, H. Asoh, and T. Matsui: “Sound source localization and separation in near field”, IEICE Trans., E83-A, No. 11, pp. 2286-2294, 2000.
  • the power spectrum, U T will be passed to the post-filter section ( 415 ).
  • the post-filter section ( 415 ) carries out the post-filtering processing for the power spectrum, U T , in order to obtain an output of the post-filter, Y T .
  • v T denotes a channel correlation metric
  • the power spectrum, Y T can be filtered per spectral bin as Zelinski's post-filter does.
  • This another embodiment is only applied for the system ( 402 ) described in FIG. 4B .
  • Zelinski's post-filter is described, for example, by the following non-patent literature, Zielinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2578-2581, 1988”. In this embodiment, the value in each frequency band can be obtained.
  • the transfer function H of Zelinski's post-filter is represented as the following equation (8).
  • ⁇ circumflex over ( ⁇ ) ⁇ i,j (n) is a smoothed auto- or cross-spectral density between the microphones on channels i and j for DFT bin n; and is a function for extracting the real part of the complex number.
  • the suffix of the frame number, T, is omitted in the equation (8).
  • ⁇ circumflex over ( ⁇ ) ⁇ i,j is calculated as a local average around frame T, according to the following equation (9).
  • î T is the DOA index for the target speaker, determined by CSP analysis, and M is the DFT size.
  • the output, Y T , from the post-filter section ( 415 ) will be passed to the filter bank section ( 416 ) in order to carry out a filter bank processing.
  • the filter bank section ( 416 ) may calculate a filtering version of the output, Y T , using the filter bank, for example, Mel-filtering-bank, and the output is logarithmic-converted to obtain y t .
  • the obtained y t is log power spectrum, for example, log Mel-power spectrum.
  • the obtained y is actually pre-processed with the gain adaptation so as to maximize the total likelihood of the utterance. This is because a Gaussian mixture model (GMM) in the log-mel spectrum domain has dependency on the input gain.
  • GMM Gaussian mixture model
  • the obtained y t from the filter bank section ( 416 ) will be passed to the factorial modeling section ( 417 ).
  • V T the channel correlation metric
  • the DOA & TF calculation section ( 421 ) calculates a cross-spectrum-based metric, H T , for all the possible microphone pairs (l, m).
  • the cross-spectrum-based metric, H T can be calculated according to the equation (8) mentioned above.
  • Ht is the same as H(n), but (n) corresponding to the index(n) is omitted here.
  • the filter bank ( 420 ) may calculate a filtering version of the output, H T , using the filter bank, for example, Mel-filtering-bank. Accordingly, the obtained the cross-spectrum-based metric, v T , is a Mel-filter version of H T .
  • the output, v T , form the filter bank ( 420 ) can be calculated according to the following equation (14).
  • v d max ⁇ ( 0 , ⁇ n ⁇ H ⁇ ( n ) ⁇ B d , n / ⁇ n ′ ⁇ B d , n ′ ) . ( 14 )
  • H(n) is calculated by the equation (8) mentioned above, and B d,n is a distribution of the d-th filter in the n-th bin.
  • T the suffix of the frame number
  • d the suffix of the filter bank
  • the factorial modeling section ( 417 ) is one key feature of an embodiment of the present invention.
  • a factorial model comprising two factors is introduced.
  • the factorial model is a probability model which is the product of a first normal distribution and a second normal distribution.
  • the factorial model is represented as the following equation (15).
  • T the suffix, T, is omitted.
  • the first normal distribution is represented as a model, p(z
  • y) is a model which has learned features of clean speeches. The clean speeches may be obtained in a quiet room.
  • y) may be probabilistic distribution of estimated clean speech z based on the output y from the filter bank section ( 416 ).
  • y) is in advance trained as Gaussian Mixture Model, using clean speech data ( 471 ).
  • the second normal distribution is represented as a model, p(z
  • the second normal distribution model is a model having a mean in the probability distribution of the enhanced speech signals.
  • the second normal distribution model may be probabilistic distribution of estimated clean speech z based on the confidence metric calculated with the filtering version, e, of the aliasing metric and the channel correlation metric, v.
  • the second normal distribution model is designed as a set of Gaussian distribution each associated with the components of the first normal distribution model.
  • the second normal distribution model has higher probability of z at the current y. Its variance is designed to be small when the confidence metric is high, and to be large when the confidence metric is low. This controls the product distribution shifted more to the model-based value when the confidence is low and more to y (pass-through) when the confidence is high. Further, the band with higher confidence contributes more for the total probability.
  • e,v,y), can be Gaussian mixture model (GMM), because the product of the two Gaussian distribution, i.e. the first normal distribution and a second normal distribution, is also Gaussian distribution.
  • GMM Gaussian mixture model
  • the first normal distribution model is given as the following equation (16).
  • k denotes each index in the mixed normal distribution; and N denotes a normal distribution; ⁇ denotes a mean vector, ⁇ denotes a variance-covariance matrix and a diagonal covariance matrix may be used.
  • ⁇ , ⁇ x,k and ⁇ are given at each k-th Gaussian.
  • ⁇ k (y) is a posterior probability that k-th normal distribution is selected when y is observed.
  • the posterior probability, ⁇ k (y) is given as the following equation (17).
  • ⁇ k ⁇ ( y ) ⁇ k ⁇ N ⁇ ( y ; ⁇ x , k , ⁇ x , k ) / ⁇ k ′ ⁇ ⁇ k ′ ⁇ N ⁇ ( y ; ⁇ x , k ′ , ⁇ x , k ′ ) ( 17 )
  • is the prior probability of the clean speech.
  • is created by scaling each component in the variance-covariance matrix for the clean speech model.
  • the scaling is set to smaller value in a case where the aliasing metric, e, has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one.
  • the variance, ⁇ is designed as the scaled version of ⁇ .
  • the scaling is performed with the parameters, e, v, or combination of these.
  • the variance, ⁇ can be calculate, according to the following equations (19), (20), (21) and (22), by scaling the k-th Gaussian at the d-th band in the speech model.
  • equations (19), (20), (21) and (22) ⁇ , ⁇ and ⁇ each denote a constant and E is a very small vale in order to avoid zero.
  • Z k is a normalization constant for setting the integral of the probability distribution to one.
  • ⁇ x,k is the mean of the clean speech model
  • ⁇ x,k is the variance of the clean speech model.
  • the mean, ⁇ x,k , and the variance, ⁇ x,k are given in advance.
  • the posterior probability, ⁇ k (y), of the k-th normal distribution is expanded to, ⁇ k ′(y,e,v).
  • the expanded posterior probability, ⁇ k ′(y,e,v), are given as the following equation (26).
  • ⁇ k ′ ⁇ ( y , e , v ) ⁇ k ⁇ N ⁇ ( y ; ⁇ z , k ′ , ⁇ z , k ′ ) / ⁇ k ′ ⁇ ⁇ k ′ ⁇ N ⁇ ( y ; ⁇ z , k ′ ′ , ⁇ z , k ′ ′ ) ( 26 )
  • ⁇ k is the prior probability of the clean speech model.
  • the prior probability, ⁇ k is given in advance.
  • the variances, ⁇ z,k ′, used for the posterior probability, ⁇ ′ becomes smaller than the original variance, ⁇ x,k , for the d-th band in a case where the aliasing metric, e, for the d-th frequency band has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one.
  • the aliasing metric, e, for the d-th frequency band has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one
  • the variance, ⁇ k,d becomes smaller and the variance, ⁇ ′ z,k,d , becomes smaller than the original variance, ⁇ x,k,d ⁇ 1 .
  • e,v,y), is shifted from the model-estimated value toward the y from the filter bank section ( 416 ) in a case where the aliasing metric, e, has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one. This is because the second normal distribution is a model with a higher probability around z y.
  • the aliasing metric, e for the d-th frequency band has a value closer to one or the channel correlation metric, v, or a cross-spectrum-based metric is close to zero.
  • the d-th band Gaussian has larger variance, thus contribution of such frequency band becomes low in the estimation of the posterior probability.
  • the average vector, ⁇ z,k,d ′ shifts to ⁇ x,k,d , and the distribution of the product probability, p(z
  • the final estimated output, ⁇ circumflex over (z) ⁇ , from the factorial modeling section ( 417 ) can be obtained, using the minimum mean square error (MMSE).
  • the final estimated output, ⁇ circumflex over (z) ⁇ can be calculated according to the following equation (27).
  • the final estimated output, ⁇ circumflex over (z) ⁇ will be passed to the ASR or logger section ( 418 ).
  • the ASR section ( 418 ) may output the final estimated output, ⁇ circumflex over (z) ⁇ , as a recognized result of the speech.
  • the Logger section ( 418 ) may store the final estimated output, ⁇ circumflex over (z) ⁇ , into a storage, such as a disk ( 108 ) described in FIG. 1 .
  • FIG. 6 illustrates experimental results according to one embodiment of the present invention.
  • two omni-directional microphones were placed on the table between two subject speakers, A and B.
  • the distance between the microphones was 12 cm.
  • the beamformer operated at a 22.05-kHz sampling frequency.
  • the two subject speakers alternately read 100 sentences written in Japanese and the speeches were recorded. Using the recorded speeches as test data, the mixed speech data as the evaluation data was generated.
  • the mixed speech data simulates the simultaneous utterance between the two subject speakers.
  • part of speech segments obtained from the subject speaker A was extracted and scaled by 50%, then superimposed continuously to the speech segment obtained from the subject speaker B.
  • the obtained hundred utterances were used for a target of the ASR.
  • the speeches after the superposition were input to the adaptive beamformer and the post-filter and, after the processing, the utterance split was performed.
  • Table 1 ( 601 ) shows the Character Error Rate (CER) %. The speech recognition accuracy was evaluated by the CER.
  • Case 1 is a baseline of the evaluation, as a reference. Cases 2 to 4 are comparative examples. Case 5 is the Example according to an embodiment of the present invention.
  • Case 1 was a baseline using the single microphone nearest to the subject speaker. The result of the CER, 62.1%, is very high.
  • Case 2 is the simple MVBF system. It showed much improvement for the mixed speech, but little for the alternating speech. The MVBF achieved some speech separation in the mixed speech segments, but it did not sufficiently suppress the interfering speaker's speech. The effect of the MVBF was observed, but the result of the CER, 39.7%, is still high.
  • Case 3 uses the Zelinski's post-filter. The Zelinski's post-filter was further applied the case 2. The effect of the Zelinski's post-filter was observed, but the result of the CER, 20.8%, is not still enough.
  • Case 4 The output of the case 3 was completely replaced with the estimation value of a clean speech model, p(z
  • Case 5 was the system performing factorial modeling, according to an embodiment of the present invention.
  • the output of the case 3 is set to v.
  • the output of the case 3 was partially replaced by amending the data having low degree of the reliability with the data having high reliability.
  • the CER was further reduced compared to case 4.
  • the present invention may be a method, a system, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Methods and systems are provided for separating a target speech from a plurality of other speeches having different directions of arrival. One of the methods includes obtaining speech signals from speech input devices disposed apart in predetermined distances from one another, calculating a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches for each of at least one pair of speech input devices, calculating an aliasing metric, wherein the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing, enhancing speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals, reading a probability model, and inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.

Description

BACKGROUND
Technical Field
This invention relates generally to an extraction of target speeches and, more particularly, to an extraction of target speeches from a plurality of speeches coming from different directions of arrival.
Description of the Related Art
Automatic speech recognition (ASR) is now being widely used in many business solutions. Call-center monitoring is a good example. The agent's speech and the customer's speech on the telephone line are recorded separately by a logger and also transcribed separately. The agent's speech is usually used for checking the agent's performance, while the customer's speech is mainly used to detect unhappy customers who should be brought to a supervisor's attention. The customer's speech may also be further analyzed for the customer's potential needs.
Face-to-face conversations are often observed in situations of sales or automobiles. In the sales, conversations are made between an agent or a customer over a desk or a counter. In the automobiles, conversations are made between a driver and a passenger during the driving.
There is a significant need to monitor the Face-to-face conversations, for example, in the financial industry, as similar with a call-center monitoring. Accordingly, the transcription of such conversations is usually made in these days.
SUMMARY
According to one aspect of the present invention, an embodiment of the present invention provides a computer-implemented method for extracting target speeches from a plurality of speeches coming from different directions of arrival. The method comprises obtaining speech signals from each of speech input devices disposed apart in predetermined distances from one another; for each pair of the speech input devices, calculating, based on the speech signals, a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches; for each pair of the speech input devices, calculating an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches, where the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing; using an adaptive beamformer, enhancing the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals; reading a probability model which is the product of a first normal distribution and a second normal distribution, where the first normal distribution is a model which has learned features of clean speeches and the second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals and is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero, and inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
According to another aspect of the present invention, a system such as a computer system comprising a computer readable storage medium storing a program of instructions executable by the system to perform one or more methods described herein may be provided.
According to another aspect of the present invention, a computer program product comprising a computer readable storage medium storing a program of instructions executable by the system to perform one or more methods described herein also may be provided.
BRIEF DESCRIPTION OF THE DRAWINGS
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures.
FIG. 1 illustrates an exemplified basic block diagram of a computer hardware used in an embodiment of the present invention.
FIG. 2 illustrates examples of microphones placed between two speakers, according to an embodiment of the present invention.
FIGS. 3A to 3D illustrate one embodiment of a flowchart of an overall process for extracting target speeches from a plurality of speeches coming from different directions of arrival, according to an embodiment of the present invention.
FIG. 4A illustrates one embodiment of a block diagram of the system, according to an embodiment of the present invention.
FIG. 4B illustrates another embodiment of a block diagram of the system, according to an embodiment of the present invention.
FIG. 5 illustrates an example of an aliasing metric and an aliasing metric, according to an embodiment of the present invention.
FIG. 6 illustrates experimental results according to one embodiment of the present invention.
DETAILED DESCRIPTION
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
With reference now to FIG. 1, FIG. 1 illustrates an exemplified basic block diagram of a computer hardware used in an embodiment of the present invention.
A computer (101) may be, for example, but is not limited to, a desktop, a laptop, a notebook, a tablet or a server computer. The server computer may be, for example, but is not limited to, a workstation, a rack-mount type server, a blade type server, or a mainframe server and may run, for example, a hypervisor for creating and running one or more virtual machines. The computer (101) may comprise one or more CPUs (102) and a main memory (103) connected to a bus (104). The CPU (102) may be preferably based on a 32-bit or 64-bit architecture. The CPU (102) may be, for example, but is not limited to, the Power® series of International Business Machines Corporation; the Core I™ series, the Core 2™ series, the Atom™ series, the Xeon™ series, the Pentium® series, or the Celeron® series of Intel Corporation; or the Phenom™ series, the Athlon™ series, the Turion™ series, or Sempron™ of Advanced Micro Devices, Inc. (“Power” is registered trademark of International Business Machines Corporation in the United States, other countries, or both; “Core i”, “Core 2”, “Atom”, and “Xeon” are trademarks, and “Pentium” and “Celeron” are registered trademarks of Intel Corporation in the United States, other countries, or both; “Phenom”, “Athlon”, “Turion”, and “Sempron” are trademarks of Advanced Micro Devices, Inc. in the United States, other countries, or both).
A display (106) such as a liquid crystal display (LCD) may be connected to the bus (104) via a display controller (105). The display (106) may be used to display, for management of the computer(s), information on a computer connected to a network via a communication line and information on software running on the computer using an appropriate graphics interface. A disk (108) such as a hard disk or a solid state drive, SSD, and a drive (109) such as a CD, a DVD, or a BD (Blu-ray disk) drive may be connected to the bus (104) via an SATA or IDE controller (107). Moreover, a keyboard (111) and a mouse (112) may be connected to the bus (104) via a keyboard-mouse controller (110) or USB bus (not shown).
An operating system, programs providing Windows®, UNIX® Mac OS®, Linux®, or a Java® processing environment, Java® applications, a Java® virtual machine (VM), and a Java® just-in-time (JIT) compiler, such as J2EE®, other programs, and any data may be stored in the disk (108) to be loadable to the main memory. (“Windows” is a registered trademark of Microsoft corporation in the United States, other countries, or both; “UNIX” is a registered trademark of the Open Group in the United States, other countries, or both; “Mac OS” is a registered trademark of Apple Inc. in the United States, other countries, or both; “Linux” is a registered trademark of Linus Torvalds in the United States, other countries, or both; and “Java” and “J2EE” are registered trademarks of Oracle America, Inc. in the United States, other countries, or both).
The drive (109) may be used to install a program, such as the computer program of an embodiment of the present invention, readable from a CD-ROM, a DVD-ROM, or a BD to the disk (108) or to load any data readable from a CD-ROM, a DVD-ROM, or a BD into the main memory (103) or the disk (108), if necessary.
A communication interface (114) may be based on, for example, but is not limited to, the Ethernet® protocol. The communication interface (114) may be connected to the bus (104) via a communication controller (113), physically connects the computer (101) to a communication line (115), and may provide a network interface layer to the TCP/IP communication protocol of a communication function of the operating system of the computer (101). In this case, the communication line (115) may be a wired LAN environment or a wireless LAN environment based on wireless LAN connectivity standards, for example, but is not limited to, IEEE® 802.11a/b/g/n (“IEEE” is a registered trademark of Institute of Electrical and Electronics Engineers, Inc. in the United States, other countries, or both).
Hereinafter, an embodiment of the present invention will be described with reference to the following FIGS. 2 to 7.
As stated above, there is a significant need to monitor the Face-to-face conversations.
Suppression of unwanted speech is the key factor for splitting a conversation into two tracks, since conversation usually proceeds alternately. However, it is difficult to completely suppress unwanted speech with a small number of microphones, because the spatial aliasing between the two speakers often causes post-filtering using the correlations among multiple channels to become inaccurate.
Further, there are retailed microphone arrays that have 4 to 16 elements. However, they are expensive and it is still difficult with such arrays to shut out sounds from non-subject speakers completely.
Accordingly, the idea of an embodiment of the present invention is on the basis of extension of the post filtering approach in a probabilistic framework integrating the aliasing metric and speech model.
With reference now to FIG. 2, FIG. 2 illustrates two examples of microphones which were placed between two speakers.
FIG. 2 illustrates two scenarios: the upper part (201) shows that two speech input devices are installed, and the lower part (231) shows that three or more speech input devices are installed. The speech input device may be, for example, a microphone. In the following, the term, “a microphone” is used instead of the speech input device, but this does not mean that the speech input device is limited to a microphone.
In the upper part (201), two microphones (221-1, 221-2) are placed between a target speaker (211) and an interfering speaker (212). The target speaker may be, for example, but not limited to, an agent in a company. The interfering speaker (212) may be, for example, but not limited to a customer of the agent.
In the lower part (231), three or more microphones (251-1, 251-2, . . . , 251-n) are placed between a target speaker (241) and an interfering speaker (242). The distance between suitable microphone intervals may be determined as similar manner mentioned.
With reference now to FIGS. 3A to 3D, FIGS. 3A to 3D illustrates one embodiment of a flowchart of an overall process for extracting target speeches from a plurality of speeches coming from different directions of arrival.
FIG. 3A illustrates one embodiment of a flowchart of an overall process. FIG. 3B illustrates a detail of the steps 306 to 308 described in FIG. 3A. FIG. 3C illustrates a detail of the steps 309 to 310 described in FIG. 3A. FIG. 3D illustrates a detail of the step 311 described in FIG. 3A.
A system such as the computer (101) performs each steps described in FIGS. 3A to 3C. The system may be implemented as a single computer or plural computers.
In step 301, the system starts the process mentioned above.
In step 302, the system obtains speech signals from each of speech input devices disposed apart in predetermined distances from one another.
In step 303, the system performs a discrete Fourier transform (DFT) for the obtained speech signals to obtain a complex spectrum.
In step 304, for each of the all possible pairs of the speech input devices, the system calculates, based on the obtained speech signals, a direction of arrival (DOA) of target speeches and directions of arrival (DOA) of other speeches other than the target speeches.
In an optional step 305, for each of the all possible pairs of the speech input devices, the system may calculate, based on the obtained speech signals, a channel correlation metric which represents a degree of correlation between the speech input devices, a cross-spectrum-based metric between the speech input devices, or combination of these.
The channel correlation can be calculated using any method known in the art, for example using the cross-power spectrum phase (CSP) analysis. If the number of the speech input devices is more than or equal to three, the correlation metrics for all pairs of the speech input devices are averaged and then used in a post filter in step 307 or input to the probability model described in step 312.
The cross-spectrum-based metric can be calculated using any method known in the art. For example, cross-spectrum-based metric is calculated as the transfer function in the following non-patent literature, Zielinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2578-2581, 1988”.
In step 306, the system enhances, based on the speech signals and the direction of arrival of the target speeches, the speech signals arrived from the direction of arrival of the target speech signals, using an adaptive beamformer, to generate the enhanced speech signals.
The output of the step 306 may be used in step 307 for obtaining a power spectrum from the output, directly used in step 308 (see step 322) for performing a filter bank for the output, or directly input to the probability model described in step 322 (see step 321).
In an optional step 307, the system may obtain a power spectrum from the enhanced speech signal, using the post filtering, for example, Zelinski's post-filter.
The output of the step 307 may be used in step 308 for performing a filter bank for the output, or directly input to the probability model described in step 322 (see step 323).
In an optional step 308, the system may perform a filter bank, for example, the Mel-filter bank, for the power spectrum to obtain a log power spectrum, for example, a log-Mel power spectrum. The output of the filter bank may be further logarithmic-converted.
In step 309, for each pair of the speech input devices, the system calculates an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches. The aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing. If the number of the speech input devices is more than or equal to three, the calculated aliasing metrics are averaged and then processed using a filter bank in step 310, or directly input to the probability model described in step 312 (see step 331).
The output of the step 309 may be used in step 310 for performing a filter bank to obtain a filtering version of the aliasing metric, or directly input to the probability model described in step 312 (see step 331).
In an optional step 310, the system may perform a filter bank, for example, the Mel-Filter bank, for the aliasing metric to obtain a filtering version of the aliasing metric, for example, the Mel-filtering version of the aliasing metric.
In an optional step 311, the system may perform a filter bank, for example, the Mel-Filter bank, for the cross-spectrum-based metric to obtain a filtering version of the channel correlation metric, for example, the Mel-filtering version of the channel correlation metric. The step 311 must be performed when the Zelinski's post-filter as the post filtering is used in the step 307.
In step 312, the system reads, into a memory, a probability model which is the product of a first normal distribution and a second normal distribution. The first normal distribution is a model which has learned features of clean speeches. The second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals. The details of the first normal distribution and the second normal distribution will be explained below by referring the FIGS. 4A and 4B.
In step 313, the system inputs the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
The second normal distribution is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero.
Further, the second normal distribution may be made so as to have a variance smaller than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to one.
Further, the second normal distribution may be made so as to have a variance larger than that of the first normal distribution in a case where the aliasing metric is close to one.
Further, the second normal distribution may be made so as to have a variance larger than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to zero.
Due to the modification of the variance for the second normal distribution, the probability model having natural continuity in each of the frequency bands of the speech can be realized.
In step 314, the system judges whether time-frame now processed is a last frame or not. If the judgment is positive, the system proceeds to a final step 315. Meanwhile, if the judgment is negative, the system proceeds back to step 302 and then repeats the steps 302 to 314.
In step 315, the system terminates the process mentioned above.
Please note that the steps 306 to 308 and the steps 309 and 310 can be performed simultaneously or in parallel.
Further, please note that the steps 306 to 308, the steps 309 and 310 and step 311 can be performed simultaneously or in parallel.
With reference now to FIGS. 4A and 4B, FIGS. 4A and 4B illustrate embodiments of a block diagram of the system.
FIG. 4A and FIG. 4B each describes a system according to an embodiment of the present invention.
Each of the systems (401, 402) can be used for extracting target speeches from a plurality of speeches coming from different directions of arrival. Each of the systems (401, 402) may be the computer (101) described in FIG. 1.
The system (401) comprises discrete Fourier transform (DFT) sections (491, 492, . . . , 493), a directions of arrival (DOA) & Channel correlation (CC) calculation section (411), an aliasing metric section (412), a filter bank section (413), a minimum variance beamformer (MVBF) section (414), a post filter section (415), a filter bank section (416), a factorial modeling section (417) and an ASR or logger section (418).
The system (402) comprises the common sections (491, 492, . . . , 493 and 412 to 418) as described in FIG. 4A. The system (402) further comprises a DOA & Transfer function (TF) calculation section (421) instead of DOA & CC calculation section (411) and further comprises an additional filter bank section (422). As stated below, when Zelinski's post-filter is used in the post filter (415), the system (402) is selected.
In the following, each of the common sections (491, 492, . . . , 493 and 412 to 418) which are commonly comprised in each of the systems (401, 402), the DOA and CC calculation section (411) which are comprised in the system (401), and the DOA & TF calculation section (421) and the additional filter bank section (422) which are comprised in the system (402) will be explained.
Each of the common sections (491, 492, . . . , 493 and 412 to 418), the DOA and CC calculation section (411), and the DOA & TF calculation section (421) and the additional filter bank section (422) may perform the steps described in FIG. 3A, as mentioned below.
The discrete Fourier transform (DFT) sections (491, 492, . . . , 493) may perform the steps 302 and 303.
The DOA & CC calculation section (411) may perform the steps 304 and calculate a channel correlation metric as described in step 305. The DOA & TF calculation section (421) calculate a cross-spectrum-based metric as described in step 305.
The minimum variance beamformer (MVBF) section (414) may perform step 306.
The post filter section (415) may perform the step 307.
The filter bank section (416) may perform step the 308.
The aliasing metric section (412) may perform the step 309.
The filter bank section (413) may perform step 310.
The filter bank section (422) may perform step 311.
The factorial modeling section (417) may perform the steps 312 and 313.
In the following, the processing details carried out by each section (412 to 418 and 491 to 493) the DOA and CC calculation section (411), the DOA & TF calculation section (421) and the additional filter bank section (422) will be described.
Let us suppose that plural microphones (481, 482, . . . , 483) are disposed apart in predetermined distances from one another between a target speaker and an interfering speaker.
Each of the microphones (481, 482, . . . , 483) receives speech signals from the target and the customer. Each of the microphones (481, 482, . . . , 483) transmits the speech signals, sm,T, to the system (401). Here, m denotes the number of the m-th microphones, and T denotes time-frame number index. Accordingly, the speech signal, sm,T, may be a time domain signal in one frame at m-th microphone for all m.
Each of the DFT sections (491, 492, . . . , 493) may receive speech signals, sm,T, from the corresponding microphones (481, 482, . . . , 483). The number of DFT sections (491, 492, . . . , 493) may correspond to those of the microphones (481, 482, . . . , 483).
Each of the DFT sections (491, 492, . . . , 493) then perform a discrete Fourier transform (DFT) for the speech signals, sm,T, at the m-th microphone to obtain a complex spectrum, Sm,T. The complex spectrum, Sm,T can be expressed as Sm,T(n). The complex spectrum, Sm,T(n), can be observed in the m-th microphone at the time-frame T in n-th DFT bin.
Each of the DFT sections (491, 492, . . . , 493) may transmit the complex spectrum, Sm,T, to the DOA & CC calculation section (411) or the DOA & TF calculation section (421).
In the following, common processes performed by the DOA & CC calculation section (411) and the DOA & TF calculation section (421) will be described.
The DOA & CC calculation section (411) or the DOA & TF calculation section (421) each may estimate DOA and calculate a gain for a post filter, using for example, a CSP analysis. The DOA & CC calculation section (411) or the DOA & TF calculation section (421) carry out the CSP analysis for each complex spectrum, Sm,T. The DOA & CC calculation section (411) or the DOA & TF calculation section (421) may calculate, for each frame, a CSP coefficient in order to estimate directions of arrival (DOA) and calculate a gain for a post filter. The CSP coefficient φ may be calculated for all the possible microphone pairs (l, m), according to the following equation (1).
φ T , l , m ( i ) = IDFT [ W T ( n ) · S l , T ( n ) · S m , T ( n ) * S l , T ( n ) · S m , T ( n ) ] ( 1 )
where φT (i) denotes a CSP coefficient; i denotes a time-domain index; WT (n) denotes a weigh of each DFT bin; n denotes the DFT bin number; and * denotes a complex conjugate.
Accordingly, if two microphones are used, the equation (1) mentioned above may be rewritten as the following equation (1a).
φ T ( i ) = IDFT [ W T ( n ) · S 1 , T ( n ) · S 2 , T ( n ) * S 1 , T ( n ) · S 2 , T ( n ) ] ( 1 a )
The CSP coefficient is a representation of the cross-power spectrum phase analysis in a time region and denotes a correlation coefficient corresponding to a delay of i-sample.
In one embodiment, the CSP coefficient, φ T, may be a moving average over few frames back and forth in order to obtain stable expression. In another embodiment, the CSP coefficient, φ T, may be given as φ TT), which is a CSP-target, i.e. a CSP coefficient of a direction of the target speaker.
In one embodiment, WT (n) is normally set to one when the weigh is not used in a normal CSP analysis. In another embodiment, a weighted CSP, which is arbitrary weight value, may be used as WT (n). The weighted CSP can be calculated, for example, according to an embodiment of the invention described in the U.S. Pat. No. 8,712,770.
Value maximizing φ gives the target speaker direction, îT, and the interfering speaker direction ĵT. The target speaker direction, îT, corresponds to a direction of arrival of target speeches. A range where the target speaker may exist is limited to either of a left or right side. The interfering speaker direction ĵ7, corresponds to directions of arrival of other speeches other than the target speeches. A range where the interfering speaker may exist is limited to opposite side of the target speaker.
A DOA index, îT, of the target speaker can be estimated, according to the following equation (2), as a point which gives a peak in a side of the target speaker. The DOA index, îT, of the target speaker may be calculated for each of the all possible pairs of the microphones.
î T=argmax(φ T(i)),
0<i<i max  (2)
A DOA index, ĵT, of the interfering speaker can be estimated as similar that used for estimating the DOA index, îT, of the target speaker. The DOA index, ĵT, of the interfering speaker may be calculated for each of the all possible pairs of the microphones.
The DOA index, îT, can be used as DOA in the MVBF section (414) and, therefore, will be passed to the MVBF section (414).
The DOA indexes, îT and ĵT, can be used in the aliasing metric section (412) and, therefore, will be passed to the aliasing metric section (412).
In the following, the processes performed by the DOA & CC calculation section (411) will be first described.
The DOA & CC calculation section (411) may calculate a channel correlation metric, V T, for all the possible microphone pairs (l, m). The channel correlation metric represents a degree of correlation between the microphones.
The channel correlation metric, vT, can be calculated according to the following equation (3), when the number of microphones is three or more. When the number of microphones is three or more, in the equation (3), the CSP-target is set, to an average of the φ l,ml,m) which are calculated for all the possible microphone pairs (l, m).
v = max ( 0 , 2 M ( M - 1 ) l < m φ _ l , m ( i ^ l , m ) ) ( 3 )
where v denotes channel correlation metric which is calculated for all the possible microphone pairs (l, m). The suffix of the frame number, T, is omitted in the equation (3).
The channel correlation metric, v, can be calculated according to the following equation (3a), when the number of microphones is two.
vT=max(0,φ T(î T))  (3a)
The process performed by the DOA & TF calculation section (421) will be later described after the explanation of the post-filtering processing performed by the post-filter section (415).
The aliasing metric section (412) may calculate an aliasing metric, ET, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches. The aliasing metric can be calculated according to the following equations (4) and (5), when the number of microphones is three or more. When the number of microphones is three or more, an average of the aliasing metric for all the possible microphone pairs (l, m) is used.
E l,m(n)=cos(2π·n·(î l,m −ĵ l,m)/N)  (4)
E ( n ) = 2 M ( M - 1 ) l < m E l , m ( n ) ( 5 )
where N denotes the total number of the DFT bin; îl,m denotes a DOA index for the target speaker when seen from the microphone pair (l, m); and ĵl,m denotes a DOA index for the interfering speaker when seen from the microphone pair (l, m). The suffix of the frame number, T, is omitted in the equations (4) and (5).
With reference now to the upper part (501) in FIG. 5, sound waves (531, 532, 533) are shown. In a case where El,m(n) (see 541) is large, the sound wave (531) at the target-speaker direction (522) has the similar phase to the one at the interfering speaker direction (521) for the n-th DFT bin. Because MVBF and post-filter work based on the phase information, they confuse the sound from the interfering-speaker side (511) with the sound from the target-speaker side (512). That means that E(n) can be treated as the confidence metric of MVBF and post-filter. In a case where E(n) is large, the n-th DFT bin has lower confidence in the output of MVBF and post-filter.
With reference now to the lower part (551) in FIG. 5, an example of the aliasing metric is shown by the dashed line. The vertical axis denotes the aliasing metric E(n) and the horizontal axis denotes the DFT bin number. This indicates lower-confidence regions are observed at regular intervals in the frequency depending on the directions of the interfering-speaker and the target-speaker.
With reference now back to FIGS. 4A and 4B, the aliasing metric, E(n), will be passed to the filter bank section (413) in order to carry out a filter bank processing, where d denotes an index of the filter bank.
The filter bank section (413) may calculate a filtering version, ed, of the aliasing metric, E(n), using the filter bank, for example, Mel-filtering-bank. The aliasing metric, E(n), is reduced to the lower dimensional signal, ed.
The filtering version, ed, of the aliasing metric, Ed, can be calculated, according to the following equation (6). The filtering version, ed, may be a Mel-band-pass filtered version of the aliasing metric, Ed.
e d = n max ( 0 , E ( n ) ) · B d , n / n B d , n ( 6 )
where Bd,n is a distribution of the d-th filter in the n-th bin.
The output, ed, will be passed to the factorial modeling (417).
The MVBF section (414) enhancing the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals. In detail, the MVBF section (414) receives the DOA index, îT, and then carry out the MVBF in order to obtain an output of the adaptive beamformer, UT. The MVBF minimizes ambient noise by maintaining a constant gain in the target direction. The output of the adaptive beamformer, UT, is a power spectrum. The MVBF is described, for example, by the following non-patent literature, F. Asano, H. Asoh, and T. Matsui: “Sound source localization and separation in near field”, IEICE Trans., E83-A, No. 11, pp. 2286-2294, 2000. The power spectrum, UT, will be passed to the post-filter section (415).
The post-filter section (415) carries out the post-filtering processing for the power spectrum, UT, in order to obtain an output of the post-filter, YT.
In one embodiment of the post-filtering processing, the power spectrum, YT, can be calculated according to the following equation (7). This embodiment is only applied for the system (401) described in FIG. 4A. In this embodiment, the value common to all frequencies can be obtained.
Y T(n)=vT·U T(n)  (7)
where vT denotes a channel correlation metric.
In another embodiment of the post-filtering processing, the power spectrum, YT, can be filtered per spectral bin as Zelinski's post-filter does. This another embodiment is only applied for the system (402) described in FIG. 4B. Zelinski's post-filter is described, for example, by the following non-patent literature, Zielinski, “A microphone array with adaptive post-filtering for noise reduction in reverberant rooms,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2578-2581, 1988”. In this embodiment, the value in each frequency band can be obtained.
The transfer function H of Zelinski's post-filter is represented as the following equation (8).
H ( n ) = [ 2 M ( M - 1 ) l = m M - 1 m = l + 1 M { ϕ ^ l , m ( n ) } ] / { 1 M l = 1 M ϕ ^ l , 1 ( n ) } ( 8 )
where {circumflex over (φ)}i,j(n) is a smoothed auto- or cross-spectral density between the microphones on channels i and j for DFT bin n; and
Figure US09640197-20170502-P00001
is a function for extracting the real part of the complex number. The suffix of the frame number, T, is omitted in the equation (8).
In the equation (8), {circumflex over (φ)}i,j is calculated as a local average around frame T, according to the following equation (9).
ϕ ^ ( T ) , i , j ( n ) = 1 ( 2 L + 1 ) l = - L L ϕ ( T + 1 ) , i , j ( n ) . ( 9 )
The cross-spectral density is calculated after steering, according to the following equation (10).
φ(T),1,2(n)=S (T),1(n)·{S (T),2(ne }*  (10)
where ST,i is the complex spectrum of the observation at microphone i, and τ is given by the following equation (11).
τ=2π·î T ·n/M.  (11)
where îT is the DOA index for the target speaker, determined by CSP analysis, and M is the DFT size.
The output is then calculated, according to the following equations (12) and (13).
H T′(n)=max(H T(n),0.0)  (12)
Y T(n)=H T′(nU T(n)  (13)
Please note that an only non-negative value is taken, as UT is the power spectrum of the MVBF output.
The output, YT, from the post-filter section (415) will be passed to the filter bank section (416) in order to carry out a filter bank processing.
The filter bank section (416) may calculate a filtering version of the output, YT, using the filter bank, for example, Mel-filtering-bank, and the output is logarithmic-converted to obtain yt. The obtained yt is log power spectrum, for example, log Mel-power spectrum.
The obtained y is actually pre-processed with the gain adaptation so as to maximize the total likelihood of the utterance. This is because a Gaussian mixture model (GMM) in the log-mel spectrum domain has dependency on the input gain.
The obtained yt from the filter bank section (416) will be passed to the factorial modeling section (417).
The following descriptions relating to each of the DOA & TF calculation section (421) and the filter bank (422) described in FIG. 4B are applied only for the system (402).
In a case where Zelinski's post-filter is used, the channel correlation metric, V T, must be calculated as cross-spectrum-based metric.
The DOA & TF calculation section (421) calculates a cross-spectrum-based metric, HT, for all the possible microphone pairs (l, m). The cross-spectrum-based metric, HT, can be calculated according to the equation (8) mentioned above. Ht is the same as H(n), but (n) corresponding to the index(n) is omitted here.
The filter bank (420) may calculate a filtering version of the output, HT, using the filter bank, for example, Mel-filtering-bank. Accordingly, the obtained the cross-spectrum-based metric, vT, is a Mel-filter version of HT. The output, vT, form the filter bank (420) can be calculated according to the following equation (14).
v d = max ( 0 , n H ( n ) · B d , n / n B d , n ) . ( 14 )
where H(n) is calculated by the equation (8) mentioned above, and Bd,n is a distribution of the d-th filter in the n-th bin. The suffix of the frame number, T, is omitted in the equation (14). Further, the suffix of the filter bank, d, will be omitted in the following sections for simplicity.
The factorial modeling section (417) is one key feature of an embodiment of the present invention. In the factorial modeling section (417), a factorial model comprising two factors is introduced. The factorial model is a probability model which is the product of a first normal distribution and a second normal distribution. The factorial model is represented as the following equation (15). Herein after, the suffix, T, is omitted.
p(z|y,e,v)∝p(z|yp(z|e,v)  (15)
where y denotes the output from the filter bank section (416); e denotes the output from the filter bank section (413); v denotes the channel correlation metric from the DOA & Transfer function calculation section (411); z denotes the output of the factorial modeling section (417). The first normal distribution is represented as a model, p(z|y). The first normal distribution, p(z|y), is a model which has learned features of clean speeches. The clean speeches may be obtained in a quiet room. For example, the first normal distribution, p(z|y), may be probabilistic distribution of estimated clean speech z based on the output y from the filter bank section (416). The first normal distribution, p(z|y), is in advance trained as Gaussian Mixture Model, using clean speech data (471).
The second normal distribution, is represented as a model, p(z|e,v). The second normal distribution model is a model having a mean in the probability distribution of the enhanced speech signals. In detail, the second normal distribution model may be probabilistic distribution of estimated clean speech z based on the confidence metric calculated with the filtering version, e, of the aliasing metric and the channel correlation metric, v. The second normal distribution model is designed as a set of Gaussian distribution each associated with the components of the first normal distribution model. The second normal distribution model has higher probability of z at the current y. Its variance is designed to be small when the confidence metric is high, and to be large when the confidence metric is low. This controls the product distribution shifted more to the model-based value when the confidence is low and more to y (pass-through) when the confidence is high. Further, the band with higher confidence contributes more for the total probability.
The distribution of the product probability, p(z|e,v,y), can be Gaussian mixture model (GMM), because the product of the two Gaussian distribution, i.e. the first normal distribution and a second normal distribution, is also Gaussian distribution.
The first normal distribution model is given as the following equation (16).
p ( z | y ) = k K ρ k ( y ) · N ( z ; μ x , k , Σ x , k ) ( 16 )
where k denotes each index in the mixed normal distribution; and N denotes a normal distribution; μ denotes a mean vector, Σ denotes a variance-covariance matrix and a diagonal covariance matrix may be used. μ, Σx,k and γ are given at each k-th Gaussian. ρk(y) is a posterior probability that k-th normal distribution is selected when y is observed. The posterior probability, ρk(y), is given as the following equation (17).
ρ k ( y ) = γ k · N ( y ; μ x , k , Σ x , k ) / k γ k · N ( y ; μ x , k , Σ x , k ) ( 17 )
where γ is the prior probability of the clean speech.
The second normal distribution is given as the following equation (18).
p(z|e,v)=N(z;y,ψ(e,v))  (18)
where ψ is created by scaling each component in the variance-covariance matrix for the clean speech model. The scaling is set to smaller value in a case where the aliasing metric, e, has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one.
The variance, ψ, is designed as the scaled version of Σ. The scaling is performed with the parameters, e, v, or combination of these. For example, the variance, ψ, can be calculate, according to the following equations (19), (20), (21) and (22), by scaling the k-th Gaussian at the d-th band in the speech model. In the following equations (19), (20), (21) and (22), α, β and γ each denote a constant and E is a very small vale in order to avoid zero.
ψk,dx,k,d·β·(e d+(1−v)+ε)  (19)
ψk,dx,k,d·β·(1−√{square root over (v(1−e d))}+ε)  (20)
ψk,dx,k,d·β·(e d+ε)  (21)
ψk,dx,k,d·β·(1−1/(1+exp)(−α(e d−γ))))−1  (22)
Accordingly, the distribution of the product probability, p(z|e,v,y), can be expressed as the following equation (23).
P ( z | y , e , v ) = k K Z k - 1 ρ k ( y , e , v ) · N ( z ; μ x , k , Σ x , k ) · N ( z ; y , ψ k ( e , v ) ) = k K Z k - 1 ρ k ( y , e , v ) · N ( z ; μ z , k , Σ z , k ) ( 23 )
where Zk is a normalization constant for setting the integral of the probability distribution to one.
The means, μz,k′, and the variances, Σz,k′, of the distribution of the product probability, p(z|e,v,y), are given by the following equations (24) and (25), respectively.
μz,k′=Σz,k′(Σx,k −1μx,kk −1 y)  (24)
Σz,k′=(Σx,k −1k −1)−1  (25)
where μx,k is the mean of the clean speech model, Σx,k, is the variance of the clean speech model. The mean, μx,k, and the variance, Σx,k, are given in advance.
Further, the posterior probability, ρk(y), of the k-th normal distribution is expanded to, ρk′(y,e,v). The expanded posterior probability, ρk′(y,e,v), are given as the following equation (26).
ρ k ( y , e , v ) = γ k · N ( y ; μ z , k , Σ z , k ) / k γ k · N ( y ; μ z , k , Σ z , k ) ( 26 )
where γk is the prior probability of the clean speech model. The prior probability, γk, is given in advance.
The variances, Σz,k′, used for the posterior probability, ρ′, becomes smaller than the original variance, Σx,k, for the d-th band in a case where the aliasing metric, e, for the d-th frequency band has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one.
According to the equation (26), as stated above, in a case where the aliasing metric, e, for the d-th frequency band has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one, the variance, ψk,d, becomes smaller and the variance, Σ′z,k,d, becomes smaller than the original variance, Σx,k,d −1. This makes the d-th band Gaussian more sensitive, thus contribution of such frequency band becomes larger in the estimation of the posterior probability. Accordingly, the frequency band having high reliability can be actively utilized as a key. Further, according to the equation (26), for the frequency band, d, the average vector, μz,k,d′, shifts to yd, and the distribution of the product probability, p(z|e,v,y), is shifted closer to the second normal distribution. That is, the distribution of the product probability, p(z|e,v,y), is shifted from the model-estimated value toward the y from the filter bank section (416) in a case where the aliasing metric, e, has a value closer to zero or the channel correlation metric, v, or a cross-spectrum-based metric is close to one. This is because the second normal distribution is a model with a higher probability around z=y.
Meanwhile, confidence or reliability of such frequency band becomes smaller in the estimation of the posterior probability in a case where the aliasing metric, e, for the d-th frequency band has a value closer to one or the channel correlation metric, v, or a cross-spectrum-based metric is close to zero. This makes the d-th band Gaussian has larger variance, thus contribution of such frequency band becomes low in the estimation of the posterior probability. Further, for the frequency band, d, the average vector, μz,k,d′, shifts to μx,k,d, and the distribution of the product probability, p(z|e,v,y), is shifted closer to the first normal distribution, i.e. the distribution of the speech model (471). This makes compensation only for the degraded part.
The final estimated output, {circumflex over (z)}, from the factorial modeling section (417) can be obtained, using the minimum mean square error (MMSE). The final estimated output, {circumflex over (z)}, can be calculated according to the following equation (27).
z ^ z · p ( z | y , e , v ) z k K ρ k ( y , e , v ) · u z , k ( 27 )
The final estimated output, {circumflex over (z)}, will be passed to the ASR or logger section (418). The ASR section (418) may output the final estimated output, {circumflex over (z)}, as a recognized result of the speech. The Logger section (418) may store the final estimated output, {circumflex over (z)}, into a storage, such as a disk (108) described in FIG. 1.
With reference now to FIG. 6, FIG. 6 illustrates experimental results according to one embodiment of the present invention.
In a small, quiet meeting room, two omni-directional microphones were placed on the table between two subject speakers, A and B. The distance between the microphones was 12 cm. The beamformer operated at a 22.05-kHz sampling frequency.
The two subject speakers alternately read 100 sentences written in Japanese and the speeches were recorded. Using the recorded speeches as test data, the mixed speech data as the evaluation data was generated. The mixed speech data simulates the simultaneous utterance between the two subject speakers. In details, part of speech segments obtained from the subject speaker A was extracted and scaled by 50%, then superimposed continuously to the speech segment obtained from the subject speaker B. The obtained hundred utterances were used for a target of the ASR. The speeches after the superposition were input to the adaptive beamformer and the post-filter and, after the processing, the utterance split was performed.
In this test data, there was almost no complete silence during the speaking of the subject speaker B. This means that mixed voice state continues during the speaking of the subject speaker B.
Therefore, only the speech segment of the subject speaker B was cut out in order to focus on the performance of simultaneous speech section. Accordingly, the evaluation using ASR was performed only for speaker B.
The experimental results are shown in Table 1 (601).
Table 1 (601) shows the Character Error Rate (CER) %. The speech recognition accuracy was evaluated by the CER.
Case 1 is a baseline of the evaluation, as a reference. Cases 2 to 4 are comparative examples. Case 5 is the Example according to an embodiment of the present invention.
Case 1: Case 1 was a baseline using the single microphone nearest to the subject speaker. The result of the CER, 62.1%, is very high.
Case 2: Case 2 is the simple MVBF system. It showed much improvement for the mixed speech, but little for the alternating speech. The MVBF achieved some speech separation in the mixed speech segments, but it did not sufficiently suppress the interfering speaker's speech. The effect of the MVBF was observed, but the result of the CER, 39.7%, is still high.
Case 3: Case 3 uses the Zelinski's post-filter. The Zelinski's post-filter was further applied the case 2. The effect of the Zelinski's post-filter was observed, but the result of the CER, 20.8%, is not still enough.
Case 4: The output of the case 3 was completely replaced with the estimation value of a clean speech model, p(z|y).
Case 5: Case 5 was the system performing factorial modeling, according to an embodiment of the present invention. The output of the case 3 is set to v. Using the factorial modeling, the output of the case 3 was partially replaced by amending the data having low degree of the reliability with the data having high reliability. The CER was further reduced compared to case 4.
The present invention may be a method, a system, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
By the expression “comprise(s)/comprising a/one” should be understood as “comprise(s)/comprising at least one”.
By the expression “comprise(s)/comprising” should be understood as “comprise(s)/comprising at least”.
By the expression “/” should be understood as “and/or”.

Claims (20)

What is claimed is:
1. A method for extracting target speeches from a plurality of speeches originating from different directions of arrival, the method comprising:
obtaining speech signals from each of a multiple of speech input devices disposed apart in predetermined distances from one another;
calculating, based on the speech signals, a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches for each of at least one pair of speech input devices;
calculating, for each of the at least one pair of speech input devices, an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches, wherein the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing;
enhancing, using an adaptive beamformer, speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals;
reading a probability model, which is the product of a first normal distribution and a second normal distribution, wherein the first normal distribution is a model which has learned features of clean speeches and the second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals and is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero; and
inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
2. The method according to claim 1, the method further comprising calculating, for each of the at least one pair of the speech input devices, a channel correlation metric which represents a degree of correlation between the speech input devices or a cross-spectrum-based metric between the speech input devices, based on the obtained speech signals and
wherein the second normal distribution is made so as to have a variance smaller than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to one.
3. The method according to claim 2, wherein a cross-spectrum-based metric as the channel correlation metric is processed by a filter bank and then input to the probability model, when use is made of a post filter after the use of the adaptive beamformer in the enhancement of the speech signals.
4. The method according to claim 2, wherein if the number of the speech input devices is more than or equal to three, the correlation metrics for all pairs of the speech input devices are averaged and then input to the probability model or processed using a filter bank or a post filter.
5. The method according to claim 2, wherein if the number of the speech input devices is more than or equal to three, the cross-spectral densities for all pairs of the speech input devices are averaged and input to the probability model or processed using a filter bank or a post filter.
6. The method according to claim 2, wherein the followings are repeated for each flame of the speech signals: obtaining speech signals, calculating a direction of arrival of target speeches and directions of arrival of other speeches, calculating a channel correlation metric or the cross-spectrum-based metric, calculating an aliasing metric, generating enhanced speech signals, reading a probability model, and inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
7. The method according to claim 1, wherein the second normal distribution is made so as to have a variance larger than that of the first normal distribution in a case where the aliasing metric is close to one.
8. The method according to claim 1, the method further comprising calculating, for each of the at least one pair of the speech input devices, a channel correlation metric which represents a degree of correlation between the speech input devices or a cross-spectrum-based metric between the speech input devices, based on the obtained speech signals, and
wherein the second normal distribution is made so as to have a variance larger than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric between the speech input devices is close to zero.
9. The method according to claim 1, wherein use is made of a filter bank after the use of the adaptive beamformer in the enhancement of the speech signals.
10. The method according to claim 9, wherein the speech signals are logarithmic-converted after the use of the filter bank in the enhancement of the speech signals.
11. The method according to claim 1, wherein use is made of a post filter after the use of the adaptive beamformer in the enhancement of the speech signals.
12. The method according to claim 11, wherein use is made of a filter bank after the use of the post filter in the enhancement of the speech signals.
13. The method according to claim 12, wherein the speech signals are logarithmic-converted after the use of the filter bank in the enhancement of the speech signals.
14. The method according to claim 1, wherein the calculated aliasing metric is processed by a filter bank and then input to the probability model.
15. The method according to claim 1, wherein if the number of the speech input devices is more than or equal to three, the calculated aliasing metrics are averaged and then input to the probability model or processed using a filter bank.
16. The method according to claim 1, wherein the followings are repeated for each flame of the speech signals: obtaining speech signals, calculating a direction of arrival of target speeches and directions of arrival of other speeches, calculating an aliasing metric, generating enhanced speech signals, reading a probability model, and inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
17. A system, comprising:
a processor; and
a memory storing a program, which, when executed on the processor, performs an operation for separating a target speech from a plurality of other speeches having different directions of arrival, the operation comprising:
obtaining speech signals from each of a multiple of speech input devices disposed apart in predetermined distances from one another;
calculating, based on the speech signals, a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches for each of at least one pair of speech input devices;
calculating, for each of the at least one pair of speech input devices, an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches, wherein the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing;
enhancing, using an adaptive beamformer, the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals;
reading a probability model which is the product of a first normal distribution and a second normal distribution, wherein the first normal distribution is a model which has learned features of clean speeches and the second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals and is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero; and
inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
18. The system according to claim 17, the operation further comprising calculating, for each of the at least one pair of the speech input devices, a channel correlation metric which represents a degree of correlation between the speech input devices or a cross-spectrum-based metric between the speech input devices, based on the obtained speech signals, and
wherein the second normal distribution is made so as to have a variance smaller than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to one.
19. A computer program product for separating a target speech from a plurality of other speeches having different directions of arrival, the computer program product comprising a computer usable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a computer to cause the computer to perform a method comprising:
obtaining speech signals from each of a multiple of speech input devices disposed apart in predetermined distances from one another;
calculating, based on the speech signals, a direction of arrival of target speeches and directions of arrival of other speeches other than the target speeches for each of at least one pair of speech input devices;
calculating, for each of the at least one pair of speech input devices, an aliasing metric, based on the direction of arrival of the target speeches and the directions of arrival of the other speeches, wherein the aliasing metric indicates which frequency band of speeches is susceptible to spatial aliasing;
enhancing, using an adaptive beamformer, the speech signals arrived from the direction of arrival of the target speech signals, based on the speech signals and the direction of arrival of the target speeches, to generate the enhanced speech signals;
reading a probability model which is the product of a first normal distribution and a second normal distribution, wherein the first normal distribution is a model which has learned features of clean speeches and the second normal distribution is a model which has a mean in the probability distribution of the enhanced speech signals and is made so as to have a variance smaller than that of the first normal distribution in a case where the aliasing metric is close to zero, and
inputting the enhanced speech signals and the aliasing metric to the probability model to output target speeches.
20. The computer program product according to claim 19, the method further comprising calculating, for each of the at least one pair of the speech input devices, a channel correlation metric which represents a degree of correlation between the speech input devices or a cross-spectrum-based metric between the speech input devices, based on the obtained speech signals and
wherein the second normal distribution is made so as to have a variance smaller than that of the first normal distribution in a case where the channel correlation metric or the cross-spectrum-based metric is close to one.
US15/077,523 2016-03-22 2016-03-22 Extraction of target speeches Expired - Fee Related US9640197B1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US15/077,523 US9640197B1 (en) 2016-03-22 2016-03-22 Extraction of target speeches
US15/440,773 US9818428B2 (en) 2016-03-22 2017-02-23 Extraction of target speeches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/077,523 US9640197B1 (en) 2016-03-22 2016-03-22 Extraction of target speeches

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/440,773 Continuation US9818428B2 (en) 2016-03-22 2017-02-23 Extraction of target speeches

Publications (1)

Publication Number Publication Date
US9640197B1 true US9640197B1 (en) 2017-05-02

Family

ID=58629323

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/077,523 Expired - Fee Related US9640197B1 (en) 2016-03-22 2016-03-22 Extraction of target speeches
US15/440,773 Expired - Fee Related US9818428B2 (en) 2016-03-22 2017-02-23 Extraction of target speeches

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/440,773 Expired - Fee Related US9818428B2 (en) 2016-03-22 2017-02-23 Extraction of target speeches

Country Status (1)

Country Link
US (2) US9640197B1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2566755A (en) * 2017-09-25 2019-03-27 Cirrus Logic Int Semiconductor Ltd Talker change detection
CN110931036A (en) * 2019-12-07 2020-03-27 杭州国芯科技股份有限公司 Microphone array beam forming method
CN113095258A (en) * 2021-04-20 2021-07-09 深圳力维智联技术有限公司 Directional signal extraction method, system, device and storage medium
US11295740B2 (en) * 2019-08-22 2022-04-05 Beijing Xiaomi Intelligent Technology Co., Ltd. Voice signal response method, electronic device, storage medium and system
US11443748B2 (en) * 2020-03-03 2022-09-13 International Business Machines Corporation Metric learning of speaker diarization
US11651767B2 (en) 2020-03-03 2023-05-16 International Business Machines Corporation Metric learning of speaker diarization
US11894008B2 (en) * 2017-12-12 2024-02-06 Sony Corporation Signal processing apparatus, training apparatus, and method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10999132B1 (en) * 2017-10-25 2021-05-04 Amazon Technologies, Inc. Detecting degraded network monitoring agents
CN108899044B (en) * 2018-07-27 2020-06-26 苏州思必驰信息科技有限公司 Voice signal processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6339758B1 (en) * 1998-07-31 2002-01-15 Kabushiki Kaisha Toshiba Noise suppress processing apparatus and method
US20120140948A1 (en) 2010-07-02 2012-06-07 Panasonic Corporation Directional microphone device and directivity control method
US8503697B2 (en) 2009-03-25 2013-08-06 Kabushiki Kaisha Toshiba Pickup signal processing apparatus, method, and program product
US8712770B2 (en) 2007-04-27 2014-04-29 Nuance Communications, Inc. Method, preprocessor, speech recognition system, and program product for extracting target speech by removing noise
US8762137B2 (en) 2009-11-30 2014-06-24 International Business Machines Corporation Target voice extraction method, apparatus and program product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6339758B1 (en) * 1998-07-31 2002-01-15 Kabushiki Kaisha Toshiba Noise suppress processing apparatus and method
US8712770B2 (en) 2007-04-27 2014-04-29 Nuance Communications, Inc. Method, preprocessor, speech recognition system, and program product for extracting target speech by removing noise
US8503697B2 (en) 2009-03-25 2013-08-06 Kabushiki Kaisha Toshiba Pickup signal processing apparatus, method, and program product
US8762137B2 (en) 2009-11-30 2014-06-24 International Business Machines Corporation Target voice extraction method, apparatus and program product
US20120140948A1 (en) 2010-07-02 2012-06-07 Panasonic Corporation Directional microphone device and directivity control method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Delikaris-Manias, S. et al., "Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays" IEEE Transactions on Audio, Speech, and Language Processing vol. 21, No. 11, Nov. 2013. (pp. 2356-2367).
Dmour, M.A., "Mixture of Beamformers for Speech Separation and Extraction" Engineering thesis and dissertation collection, Oct. 2010. (pp. 1-211).
Himawan, I. et al., "Microphone Array Beamforming Approach to Blind Speech Separation" Machine Learning for Multimodal Interaction, vol. 4892, Jun. 2007. (pp. 295-305).
Maganti, H.K. et al., "Speech Enhancement and Recognition in Meetings With an Audio-Visual Sensor Array" IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 8, Nov. 2007. (pp. 2257-2269).

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2566755A (en) * 2017-09-25 2019-03-27 Cirrus Logic Int Semiconductor Ltd Talker change detection
US10580411B2 (en) 2017-09-25 2020-03-03 Cirrus Logic, Inc. Talker change detection
GB2566755B (en) * 2017-09-25 2021-04-14 Cirrus Logic Int Semiconductor Ltd Talker change detection
US11894008B2 (en) * 2017-12-12 2024-02-06 Sony Corporation Signal processing apparatus, training apparatus, and method
US11295740B2 (en) * 2019-08-22 2022-04-05 Beijing Xiaomi Intelligent Technology Co., Ltd. Voice signal response method, electronic device, storage medium and system
CN110931036A (en) * 2019-12-07 2020-03-27 杭州国芯科技股份有限公司 Microphone array beam forming method
US11443748B2 (en) * 2020-03-03 2022-09-13 International Business Machines Corporation Metric learning of speaker diarization
US11651767B2 (en) 2020-03-03 2023-05-16 International Business Machines Corporation Metric learning of speaker diarization
CN113095258A (en) * 2021-04-20 2021-07-09 深圳力维智联技术有限公司 Directional signal extraction method, system, device and storage medium

Also Published As

Publication number Publication date
US20170278524A1 (en) 2017-09-28
US9818428B2 (en) 2017-11-14

Similar Documents

Publication Publication Date Title
US9818428B2 (en) Extraction of target speeches
US11003983B2 (en) Training of front-end and back-end neural networks
JP4906908B2 (en) Objective speech extraction method, objective speech extraction apparatus, and objective speech extraction program
EP3289586B1 (en) Impulsive noise suppression
JP4875656B2 (en) Signal section estimation device and method, program, and recording medium
EP3807878B1 (en) Deep neural network based speech enhancement
US9984680B2 (en) Method for improving acoustic model, computer for improving acoustic model and computer program thereof
US10984814B2 (en) Denoising a signal
US9601124B2 (en) Acoustic matching and splicing of sound tracks
US10152507B2 (en) Finding of a target document in a spoken language processing
US20190385590A1 (en) Generating device, generating method, and non-transitory computer readable storage medium
Abutalebi et al. Speech enhancement based on β-order mmse estimation of short time spectral amplitude and laplacian speech modeling
US10586529B2 (en) Processing of speech signal
JP5726790B2 (en) Sound source separation device, sound source separation method, and program
JP6724290B2 (en) Sound processing device, sound processing method, and program
US12094481B2 (en) ADL-UFE: all deep learning unified front-end system
US10540990B2 (en) Processing of speech signals
US20170213548A1 (en) Score stabilization for speech classification
JP6125953B2 (en) Voice section detection apparatus, method and program
US10170103B2 (en) Discriminative training of a feature-space transform

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FUKUDA, TAKASHI;ICHIKAWA, OSAMU;REEL/FRAME:038071/0987

Effective date: 20160311

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20210502