WO1996007177A1 - Appareil et procede de detection de la parole en presence d'autres sons - Google Patents

Appareil et procede de detection de la parole en presence d'autres sons Download PDF

Info

Publication number
WO1996007177A1
WO1996007177A1 PCT/US1995/010942 US9510942W WO9607177A1 WO 1996007177 A1 WO1996007177 A1 WO 1996007177A1 US 9510942 W US9510942 W US 9510942W WO 9607177 A1 WO9607177 A1 WO 9607177A1
Authority
WO
WIPO (PCT)
Prior art keywords
likelihood
speech
sites
power
parameter
Prior art date
Application number
PCT/US1995/010942
Other languages
English (en)
Inventor
Brant M. Helf
Original Assignee
Picturetel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Picturetel Corporation filed Critical Picturetel Corporation
Publication of WO1996007177A1 publication Critical patent/WO1996007177A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Definitions

  • This invention relates generally to apparatus and methods for detecting speech in the presence of other sounds and more particularly to video-conferencing networks having controllers for automatically directing audio and video signals to remote conferencing sites.
  • multi-point video ⁇ conferencing networks are being used more often to enable exchange of information among people at different remote conferencing sites.
  • Each site typically includes a television camera with at least one microphone and a video monitor with at least one loudspeaker.
  • a controller sometimes referred to as a bridge. All remote sites are connected to the controller like spokes in a rimless wheel.
  • Each remote site transmits audio and video signals; however, each remote site typically receives only one video signal from the controller.
  • the video signals originating from such site are directed, via the controller, to the video monitors at the other sites.
  • the video signal being viewed on the video monitor at the site of the speaking participant may typically originate from the site of the last speaking participant.
  • audio signals can be mixed so that each site can receive an audio signal that is the sum of audio signals from the several sites; the speaking participant's audio signals being excluded from the mix at his/her site. Nevertheless, it is generally not desirable to simply add all incoming audio signals to provide the outgoing audio mix. It is generally better to exclude from the mixed audio signal those signals which contain only background noise. Also, it may be desirable to limit the number of included audio channels in which speech is present, since otherwise the resulting outgoing mix of too many speaking participants may be confusing.
  • the controller distinguish between speech and unwanted audio sounds, such as steady-state background noise from fans and air conditioning, sporadic sounds such as paper shuffling, chair squeaking, and footsteps, as well as from inadvertently struck objects which ring.
  • speech and unwanted audio sounds such as steady-state background noise from fans and air conditioning, sporadic sounds such as paper shuffling, chair squeaking, and footsteps, as well as from inadvertently struck objects which ring.
  • a programmed controller having a likelihood of speech estimation software module for generating a likelihood of speech parameter, lsp, in response to audio signals.
  • the likelihood of speech parameter is a function of the periodicity and power history of the audio signals. More particularly, the likelihood of speech parameter is generated as a function of the degree with which the periodicity and power of the audio signal is consistent with the pitch characteristics of speech; however, audio signals exhibiting rapid decay in the power are attributed to ringing produced by sharply struck objects and are rejected as not being speech pitch.
  • the audio signals are sequentially windowed to produce a series of frames.
  • Each one of the frames is processed to detect a periodicity characteristic indicative of speech pitch.
  • the power in each one of the frames is also determined.
  • the likelihood of speech parameter, lsp is determined as a function of both detection of a periodicity characteristic of speech pitch and a determination that the rate of decay of power to the current frame from previously processed frames is at a relatively high rate; a rate more associated with the ringing of a sharply struck object than with that from speech.
  • the likelihood of speech estimation module also determines power characteristics in the audio signals to supplement the use of periodicity detection. With such an arrangement, consideration is given to cases where pitch is not present, or not strongly present. Also, such power detection characteristic is used because pitch detection requires a relatively long period of audio signal analysis since detection of some pitch periodicity requires processing of several frames.
  • the pitch characteristic is not readily detectible, as when the speech is in the presence of relatively loud noise or has been significantly corrupted by the audio digitization and coding (i.e., compression) process.
  • the likelihood of speech estimation module determines power characteristics in a plurality of different frequency bands.
  • the power characteristic for each band is determined with a relatively short time delay in accordance with the "absolute" power level and ratio of the "absolute” power level to that bands background noise level.
  • a video-conferencing network includes a plurality of conferencing sites. Each site is in communications with a controller, or bridge, over a communications path. Each one of such sites has a camera and at least one microphone for producing video and audio signals representative of events taking place at such one of the sites, and a monitor and loudspeaker for reproducing video and audio signals directed to such one of the sites.
  • the controller, or bridge is located at a central site. The controller is responsive to the audio and video signals produced at the sites and, in response thereto, automatically directing at least a portion of such audio signals originating at the plurality of sites, and a selected one of the video signals originating from the plurality of sites, to the monitors and loudspeakers at the sites.
  • the controller includes a plurality of likelihood of speech estimation software modules; one such module for each remote site. Each such module determines the likelihood of speech parameter, lsp, from the audio signals produced at the corresponding site.
  • a video controller module is included in the controller for directing one of the produced video signals to the sites, selectively, in accordance with the likelihood of speech parameter, lsp, determined for each of the sites.
  • An audio mixer module is also included in the controller for combining the audio signals produced at the plurality of sites and for directing such combined audio signals to the loudspeakers at the sites, selectively, in accordance with the likelihood of speech parameter, lsp, determined for each of the sites.
  • the likelihood of speech estimation modules determine the likelihood of speech parameters, lsp, for each site as a function of a determined periodicity characteristic indicative of speech pitch (i.e., a pitch parameter, pitch_par) from the audio signals produced at such site.
  • the pitch parameter for each site has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0, ( i.e., indicating the lowest likelihood of speech).
  • each likelihood of speech estimation module also determines such estimate as a function of the time history of the power in the audio signals. Such time history is used to assess whether the rate of decay of power is at a relatively high rate; i.e., a decay rate associated more with a sharply struck object that rings than with that of speech.
  • each one of the likelihood of speech estimation modules provides a parameter which is a function of the power characteristics of the audio signals. More particularly, they provide a relatively short delay determined power level parameter which is a function of the absolute power level of the audio signal and the ratio of the absolute power level to background noise level. Such a parameter is provided for each one of a plurality of different frequency bands.
  • a low frequency band of processed audio signals is used to generate an "instantaneous" low pass filter parameter. "instantaneous_lpf_par”.
  • Such parameter is processed by a slow decay module to produce a low pass filter parameter, lpf_par, having a relatively fast rise time; but which is decayed slowly by the module.
  • the resulting, slowly decaying parameter, low pass filter parameter, lpf_par has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0 ( i.e., indicating the lowest likelihood of speech).
  • a high pass filter band of processed audio signal is similarly used to provide an "instantaneous" high pass filter parameter, "instantaneous_hpf_par” .
  • Such parameter is processed by a second slow decay module to produce a fast rise time and slowly decaying high pass filter parameter, hpf_par.
  • the high pass filter parameter, hpf_par has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0, ( i.e., indicating the lowest likelihood of speech) .
  • the pitch_par, lpf_par and hpf_par are combined to produce the likelihood of speech parameter, lsp, for the likelihood of speech estimation module.
  • the likelihood of speech parameter, lsp, for each module has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0, (i.e., indicating the lowest likelihood of speech) .
  • the likelihood of speech parameter produced by each of the modules is used by the video controller module and the audio mixer module for directing audio and video signals to the sites selectively in accordance with the likelihood of speech parameter, lsp, determined for each of the sites.
  • FIG. 1 is a block diagram of a multi-point video conferencing network according to the invention
  • FIG. 2 is a block diagram of an audio/video controller, or bridge, located at a central site of the video conferencing network of FIG. 1;
  • FIG. 3 is a functional block diagram useful in understanding an exemplary one of a plurality of likelihood of speech estimation modules used in the audio/video controller of FIG. 2;
  • FIG. 4 is a functional block diagram of a video controller module used in the audio/video controller of FIG. 2;
  • FIG. 5 is a functional block diagram of an audio mixer module used in the audio/video controller of FIG. 2;
  • FIG. 6 is a flow diagram useful in understanding the operation of a video source determination module used in the video controller module of FIG. 4;
  • FIG. 7 is a flow diagram useful in understanding the operation of an audio mix determination module used in the audio mixer module of FIG. 5;
  • FIG. 8 is a flow diagram useful in understanding the operation of an audio mixer used in the audio mixer module of FIG. 5, and FIGs. 9A and 9B are flow diagrams useful in understanding the computation of the pitch parameter, pitch_par, by a pitch parameter computation module used in the likelihood of speech estimation module of FIG. 3.
  • a multi-point video ⁇ conferencing network 10 is shown.
  • Such network 10 includes a plurality of, here at least three, conferencing sites, 12-L - 12 N , as shown. Each site is in communications with an audio/video controller 22, or bridge, over a communications path.
  • Each one of such sites 12 - 12 N has a television camera 14 and at least one microphone 16 for producing video and audio signals representative of events taking place at such one of the sites and a television monitor 18 and loudspeaker 20 for reproducing audio and video signals directed to such one of the sites 12 ⁇ - 12 N .
  • the audio/video controller 22, or bridge, responsive to the audio and video signals i.e., AUDIO ! - AUDIO N , VIDEO !
  • VIDE0 N VIDE0 N , respectively
  • the sites 12 - 12 N are located at a central site 24 for automatically mixing, then directing, at least a portion of the audio signals produced at the plurality of sites 12 ⁇ - 12 N and directing a selected one of the video signals (i.e, VIDEO) produced by the plurality of sites 12 ! - 2 N , to the loudspeakers 20 and television monitors 18 at the other sites.
  • VIDEO video signals
  • the audio/video controller 24 (FIG. l) includes a conventional microprocessor suitably programmed, in a manner to be described. Suffice it to say here, that the program has a plurality of likelihood of speech estimation software modules 26 2 - 26 N (FIG.2); each one of such modules processing the audio signal that it receives from the corresponding one of the remote sites 12 ! - 12 N , respectively as indicated. Each one of the likelihood of speech estimation modules 26 - 26 N , is identical and determines an estimate of the likelihood of speech parameter, lsp, at the remote site associated therewith, in response to the audio signals received from such remote site.
  • a video controller module 30 is included in the audio/video controller 24 for directing one of the produced video signals to the sites selectively in accordance with the likelihood of speech parameters, lsp x - lsp N , determined by the likelihood of speech estimation modules 26 x -26 N , respectively. (The details of the video controller module 30 will be discussed in connection with FIG. 4) .
  • An audio mixer module 32 is also included in the controller 24 for combining the audio signals produced at the plurality of sites 12 ! ⁇ 12 N an ⁇ f° r directing such combined audio signals (AUDIO MIX) to the loudspeakers 30 at the sites selectively in accordance with the likelihood of speech parameter, lsp, determined by each of the likelihood of speech estimation modules 26 !
  • Each likelihood of speech estimation module 26 ⁇ 26 N determines the estimate of the likelihood of speech, lsp, as a function of a determined periodicity characteristic indicative of speech pitch; i.e., a pitch parameter, pitch_par.
  • Each likelihood of speech estimation module 26 ! ⁇ 26 N a l so determines the lsp as a function of the time history of the power in the audio signals. Such time history is used to assess whether the rate of decay of power is at a relatively high rate; i.e., a decay rate more associated with that from a sharply struck object that rings than from that associated with speech.
  • Each one of the likelihood of speech estimation modules 26 x - 26 N computes a parameter which is a function of the power characteristics in the audio signals. More particularly, a power level parameter is determined with relatively short delay. Such parameter is a function of the "absolute" power level of the audio signal and the ratio of the "absolute” power level to background noise level. Such parameter is provided for each one of a plurality of different frequency bands. A low frequency band of processed audio signals is used to generate an "instantaneous" low pass filter parameter, "instantaneous_lpf_par". Such parameter is processed by a slow decay module to produce a low pass filter parameter, lpf_par, having a relatively fast rise time, but which is decayed slowly by the module.
  • the resulting, slowly decaying parameter, a low pass filter parameter, lpf_par has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0 ( i.e., indicating the lowest likelihood of speech) .
  • a high pass filter band of processed audio signal is similarly used to provide an "instantaneous" high pass filter parameter, "instantaneous_hpf_par”.
  • a second slow decay processor produces a fast rise time, and slow decaying high pass filter parameter, hpf_par.
  • the high pass filter parameter, hpfjpar has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0, ( i.e., indicating the lowest likelihood of speech) .
  • the pitch_par, lpf_par and hpf_par are combined to produce the likelihood of speech parameter, lsp, for the likelihood of speech estimation module.
  • the lsp_par for each module has a value which varies from 1.0 (i.e., indicating the highest likelihood of speech) to 0.0, (i.e., indicating the lowest likelihood of speech).
  • the likelihood of speech parameters produced by each of the modules 26 x - 26 N are used by the video controller module 30 and the audio mixer module 32 for directing video and audio signals to the sites selectively in accordance with the likelihood of speech parameter, lsp, determined for each of the sites 12 ! ⁇ 12 N*
  • the likelihood of speech estimation modules also determine the power characteristics in the audio signals produced in a plurality of different frequency bands.
  • the power characteristic for each band is determined in accordance with the a relatively short delay (i.e., current) , "absolute" power level and the relative power between the "absolute” power level and background noise.
  • the use of high and low pass filter processing with fast response is used because many words begin with a phoneme that does not contain pitch. They are also used as a "back up” because pitch detection is sometimes unreliable because of acoustic noise, coder-decoder (i.e., codec) noise, or the temporary absence of phonemes with a pitch structure or characteristic.
  • the audio and video signals are directed to and from the controller 24 as compressed digital data.
  • the audio signals may be high pass filtered by a third order elliptical high pass filter, not shown, to attenuate frequencies below 60 Hz by at least 20 db with frequencies above 100 Hz having ripples of less than 2 db.
  • the decompressed audio signals have a sampling rate of, here, either 16 Khz or 8 Khz.
  • the samples are processed by three software modules included in each of the likelihood of speech estimation modules
  • an exemplary one of the likelihood of speech estimation modules 26 2 - 26 6 here module 26 l7 is shown to include: a high frequency speech parameter detection module 40; a low frequency speech parameter detection module 42; and a pitch parameter detection module 44.
  • the high frequency speech parameter module provides high pass filtering of the samples to pass only those frequency components having frequencies greater than, or equal to, here, 2500 Hz, when either 8 Hz or 16 Hz sampling is used.
  • filtering is performed by a 5th order elliptical infinite impulse response (IIR) digital filtering module 46.
  • IIR infinite impulse response
  • Such filtering module 46 provides a relatively very low processing time delay to the passed signals.
  • the low frequency speech parameter detection module 42 provides low pass filtering, via low pass filtering module 48, of the samples to pass only those frequency components having frequencies less than 2500 Hz when 16 Khz sampling is used. If 8 Khz sampling is used, low pass filtering is eliminated by the low pass filtering module 48.
  • the signals passed by both the high and low pass filtering modules 46, 48 are then each separately processed in short term power detection modules 50, 52, respectively, as shown, to determine the short delay power levels thereof; "hpf_power" and "lpf_power”, respectively.
  • the passed signals are squared and averaged over time, here preferably in a relatively short time interval, for example over a 4 millisecond (ms) time interval in this illustrated embodiment.
  • a new power level is produced, or computed, every 2 ms.
  • the computed power levels from modules 50, 52 may be considered as representing the "absolute" power level of the high and low frequency components in the audio signals.
  • the computed power levels are also averaged over a short time interval, here 20 ms.
  • a new average power level is computed every 2 ms.
  • min_lpf_power each represent the background noise for the high and low frequency components, respectively, in the audio signals.
  • background noise for each frequency band is determined every 2 ms first by computing the average power level over the last 20 ms; and then, using the smallest of the computed values of average power for the past 10 seconds is used as the current estimate of background noise level.
  • the average background noise level is computed, or produced, by the background noise modules 54, 56 every 2 s; however, such level is the minimum short term average power over, here, a 10 second interval.
  • the present power levels "hpf_power”, “lpf_power” of the signals passed by the high and low pass filters, respectively, are determined by combining the "absolute” power from the short term power detection modules 50, 52 and the background noise level from the background noise level modules 54, 56 in combination modules 58, 60, respectively, as shown and in a manner to be described. More particularly, the instantaneous low frequency parameter, "instantaneous_lpf_par”, and instantaneous high frequency parameter,
  • instantaneous_lpf_par minimum of ⁇ 1.0, the minimum of ⁇ 4, 0.0002 times the square root of the "lpf_power” ⁇ times the minimum of (1.0, .004 times "lpf_power” divided by "min_lpf_power”) ⁇
  • “instantaneous_hpf_par” minimum of ⁇ 1.0, the minimum of ⁇ 4, 0.003 times the square root of the "hpf_power” times the minimum of (1.0, .004 times "hpf_power” divided by "min_hpf_power”) ⁇ ; where: the constants .0002 and .003 are experimentally determined constants and vary with the number of bits in each digital sample. Here 16 bits are used.
  • the constant for the low pass filtered components i.e., .0002 is smaller than for the high pass filtered frequency components because most speech sounds have more "absolute" power at the lower frequencies. Background acoustic noise also usually has higher power density at low frequencies rather than at high frequencies.
  • the pitch parameter processing software module 44 is provided to detect a periodicity characteristic indicative of a speech pitch characteristic in the audio signals.
  • a useful difference between speech signals and many other sounds is that sometimes speech is nearly periodic. Vowels and other nearly periodic phonemes arise from repeated pressure pulses to the vocal cords.
  • Such periodicity is used herein with the above described low frequency parameter, "lpf_par” and high frequency parameter, "hpf_par” to estimate the likelihood of speech and provide the likelihood of speech parameter, lsp.
  • the audio signals fed to each one of the likelihood of speech estimation modules 26 ! - 26 N are, after low pass filtering by filter 48, sequentially windowed by a pitch frame module 70 (within a periodicity detection module 72) to produce a series of overlapping frames.
  • Each one of the frames produced by module 70 is processed to detect for periodicity by a periodicity detection module 72.
  • a determination is made of the degree of speech periodicity in the audio signal. Such determination is made by performing the autocorrelation process on the windowed audio signals in the frame.
  • the autocorrelation process is performed in a autocorrelation module 74 which produces a discrete autocorrelation function, R( ⁇ ) , where T indexes time delay in the autocorrelation process.
  • the maximum autocorrelation function, R( ⁇ ) max within the range of periodicity associated with speech is detected.
  • the degree with which the periodicity and power of the audio signal is consistent with the pitch characteristics of speech is determined; however, audio signals exhibiting rapid decay in frame to frame power are considered as being attributed to sharply struck objects which ring and are rejected as not being speech pitch.
  • the pitch parameter processing module 44 produces a pitch parameter, "pitch_par”, here once every 16 ms. Each time such pitch parameter,
  • pitch_par is produced, the latest 32 ms of audio signal is windowed or framed, using pitch frame module 70, by a function, or suitable amplitude envelope, which has zero amplitude at the beginning and ends of the window. Such is sometimes referred to hereinafter as a pitch frame.
  • a pitch frame is provided once every 16 ms; thus a sequence of 32 ms, overlapping frames is produced, one frame overlapping the next by 16 ms.
  • autocorrelation is a process performed in the time domain; however, it can be computed efficiently in the frequency domain.
  • the frequency spectrum of each pitch frame is calculated using conventional Fast Fourier Transform (FFT) processing in the autocorrelation module 74.
  • FFT Fast Fourier Transform
  • the frequency spectrum H( ⁇ ) is multiplied by its complex conjugate H * ( ⁇ ) to produce H( ⁇ )*H * ( ⁇ ), where ⁇ is proportional to frequency. While still in the frequency domain, peaks in the produced H( ⁇ ) are heightened relative to background.
  • R(0) represents the zero shift power, i.e., the power in the windowed audio signal frame.
  • the autocorrelation function R( ⁇ ) is then searched for the maximum, R( ⁇ ) max /R(0) , in a range from, here, 4.375 to 13.375 ms.
  • the ratio of R( ⁇ ) max /R(0) is determined and stored.
  • a speech pitch detection module sets a periodicity detection flag: (1) The ratio of R( ⁇ ) max /R(0) for each of the past 3 frames is greater than, here, .7; (2) the pitch period, T, has standard deviation, less than, here, .56 ms, over the last three frames; (3) the average of instantaneous low pass filter parameters, "instantaneous_lpf_par” for the past three frames exceeds .05; and, (4) the power R(0) , over the last 10 pitch frames does not fall off rapidly, in a manner now to be described, module 82 sets a periodicity detection flag.
  • the power for each pitch frame, R(0), hereinafter referred to as P 0 is stored in memory. If there is a strong downward trend in power (i.e., in P 0 ) the periodicity flag is not set indicating that the sound is not likely speech.
  • the older frames, j, with which the current frame is compared, are allowed more leeway for some reduction in power.
  • the current frame power level is P 0 and the power level n frames ago is P_ n , if any one of the following inequalities is satisfied, then the current power is considered to be from a non-speech source:
  • P-i 2.75 • P 0 P_2 > (2.75 • 1.07) • P 0 P_ 3 > (2.75 • 1.07 2 ) • P 0 P_ 4 > (2.75 • 1.07 3 ) • P 0
  • the periodicity detection flag is not set; i.e., the last criteria, criteria "(4)", set forth above is not met. Otherwise the periodicity detection flag is set if the first three criteria described above are also met. Thus, the periodicity detection flag is set only if the four following criteria are satisfied:
  • a pitch parameter computation module 86 determines the pitch parameter, "pitch_par" as follows:
  • the pitch parameter, "pitch_par” is allowed to rise rapidly, but decay slowly. However, the processor decays the pitch parameter, "pitch_par” more slowly that for the low pass filter parameter, "lpf_par” and high pass filter parameter, “hpf_par”.
  • the pitch parameter computation module, 86 the pitch parameter module 86 decays the pitch parameter, pitch_par, by subtracting 0.0016 from the value it held on the previous frame; however, the pitch parameter, pitch_par, is not allowed to go below zero. In the absence of periodicity detection, the pitch parameter, "pitch_par” decays from 1 to 0 in 10 seconds.
  • the pitch parameter computation module 86 then upper limits “temp” to 1.0. This processing is provided to slightly dampen increases in the pitch parameter, "pitch_par", affecting the current value of the pitch parameter, "pitch_par".
  • the rise time is also slightly dampened so that a single isolated instance of periodicity does not produce a pitch parameter, "pitch_par" of 1 in such case. Also, an average ratio of R( ⁇ ) max /R(0) over the last three pitch frames below .8 can prevent the pitch parameter, "pitch_par" from reaching 1.
  • each one of the likelihood of speech estimation modules 26 2 - 26 N determines the likelihood of speech parameter, lsp ! ⁇ lsp N , respectively, from the audio signals produced at the corresponding site 12 - 12 N , respectively.
  • a new speech likelihood parameter, lsp is produced every 2 ms as a seven bit digital word.
  • the video controller module 30 (FIG. 4) is included in the audio/video controller 24 for directing one of the produced video signals, VIDEO ! - VIDE0 N to the sites 12 ! - 12 N , selectively in accordance with likelihood of speech parameters, lsp ! - lsp N , determined for each of the sites.
  • An audio mixer module 32 (FIG. 5) is also included in the audio/video controller 24 for combining the audio signals AUDIO ! ⁇ AUDI0 N produced at the plurality of sites and for directing combinations of the audio signals into AUDIO MIX signals for the loudspeakers at the sites selectively in accordance with the likelihood of speech parameters, lsp ! - lsp N , determined for each of the sites.
  • the video controller module 30, FIG. 4 determines the remote site that is to supply the video signal for the other sites.
  • a video source determination module 90 makes this selection and provides a video output index parameter, "video_out_index" parameter to a video switcher 92 and the audio mixer module 30 (FIG. 5) .
  • the likelihood of speech parameter, lsp, for each site, i is sampled by the video source determination module 90 (FIG. 4) . If the lsp for a site is greater than .9, then an activity_time parameter for the sites, in the video source determination module 90, is incremented by 2 ms; otherwise, the activity_time parameter for such site is set to zero.
  • a counter used by the video source determination module 90 keeps track of the time duration the site presently supplying the video signal has been supplying such video signal.
  • the time duration is designated as a video source timer parameter, "video_source_timer". If the video source timer indicates that the site currently supplying the video has been supplying such video for less than a predetermined minimum dwell time, here 1000 ms, the video from that site is allowed to continue supplying such video in order to prevent too frequent changes in the video broadcast. Also, if the site currently supplying the video has a lsp greater than .8, the video from that site is allowed to continue supply the video in order to prevent too frequent changes in the video broadcast. Other sites are not considered.
  • the video source determination module finds the site with the largest lsp.
  • the video source determination module 90 finds the site with the largest lsp among only those sites having an "activity_time" greater than 1500 ms. That site becomes the new video signal broadcaster. If there is no site with an "activity_time” greater than 1500 ms, then the video signal broadcasting site is left unchanged.
  • the video source timer is cleared to zero after the video site changes, and the selected site is designated by the video output index parameter, "video_out_index".
  • the video source determination module 90 selects the site with the greatest lsp as the new site for broadcasting the video signal.
  • the audio mixer module 32 responds to the likelihood of speech parameters, lsp ! -lsp N , produced for each of the sites 12 x -12 N , and to the "video_out_index" produced by the video determination module 90 (FIG. 4) to determine weighting factors W ! - W N for an audio mixer 94. More particularly, every 2 ms, for each remote site, a parameter, "modified_lsp" is computed by an audio mix determination module 96 (FIG. 5) .
  • modified_lsp parameter for each site is initialized to equal the corresponding likelihood of speech parameter, lsp, for such site. Then, for the site currently broadcasting the video, the "modified_lsp" for such site is set to 1.2, or such other suitable value large enough to insure that the audio at such site is included in the audio mix broadcast to the other sites receiving the video. Then, for each of the sites which were included in the audio mix 2 ms earlier, .1 is added to the "modified_lsp" associated with such sites. This has the effect of providing hysteresis; i.e., making it less likely that the set of sites included in the mix will change.
  • the audio mix determination module 96 then examines the "modified_lsp" for all sites and determines, here, the three sites having the largest values of
  • the audio mixer 94 mixes or scales the audio signals AUDIO ! - AUDI0 N fed thereto from the remote sites 12 ! - 12 N by the weights W x - W N , respectively, to produce an AUDIO MIX output signal for the sites. If the audio weight, W n , for a site 12 n is 0, and if the weight, W n , for such site 12 n was 0 two milliseconds earlier, then all 2 ms of the audio signals from that site I2 n are excluded from the AUDIO MIX output signal.
  • the audio weight, W n , for site I2 n is 1, and if the weight W n for such site 12 n was 1 two milliseconds earlier, than all 2 ms of the audio signals from that site 12 n are included in the AUDIO MIX output signal. If the audio weight, W n , for a site 12 n is now 1, but was 0 two milliseconds earlier, then the audio signals from that site 12 n are included in the AUDIO MIX output signal, but only after being scaled in level by a function which smoothly goes from 0.0 to 1.0 over the course of the 2 ms. This insures that no clicks resulting from weight discontinuities will be audible.
  • such function is a simple ramp which increases linearly from 0 to 1 over the 2 ms. If the audio weight, W n , for a site 12 n is presently 0, but was 1 two milliseconds earlier, then the audio from such site 12 n is included in the AUDIO MIX output signal mix; but only after being scaled by a function which smoothly goes from 1.0 to 0.0 over the course of the next two milliseconds. This insures that no clicks resulting from weight discontinuities will be audible.
  • such function is a simple ramp which decreases linearly from 1 to 0 over the 2 ms. The process is summarized by the flow diagram in FIG. 8.
  • the video switcher 92 (FIG. 4) is of any conventional design and selects, as the video broadcasting site, the site designated by the
  • the video switcher 92 may be a selector which couples one of the video signals VIDEO ! " VIDEO N to all other sites selectively in accordance with the "video_out_index M parameter. It is noted, however, that the site broadcasting the video receives on the video monitor at such site the video broadcast from the last broadcasting site. It is further understood that for any site, the audio mix received at that site, (i.e., loudspeaker signal) does not include the audio signal originating from that site.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Réseau de vidéoconférence (10) comportant un ensemble de sites de conférence (121-12N). Une unité de commande (22), située dans un site central, est sensible aux signaux audio (16) et vidéo (14) produits dans lesdits sites et achemine automatiquement au moins une partie de ces signaux audio provenant de l'ensemble de sites, ainsi qu'un signal sélectionné parmi les signaux vidéo provenant de l'ensemble de sites, vers les moniteur de télévision (18) et les haut-parleurs (20) installés dans les sites. Un module de commande audio/vidéo (24) comprend une unité de commande vidéo qui achemine vers les sites un des signaux vidéo produits, en fonction de la vraisemblance d'un paramètre vocal déterminé pour chacun des sites, et un module de mixage audio qui combine les signaux audio produits dans l'ensemble de sites et achemine les signaux audio combinés vers les haut-parleurs installés dans les sites en fonction de la vraisemblance d'un paramètre vocal déterminé pour chacun des sites.
PCT/US1995/010942 1994-08-31 1995-08-30 Appareil et procede de detection de la parole en presence d'autres sons WO1996007177A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US29892894A 1994-08-31 1994-08-31
US08/298,928 1994-08-31

Publications (1)

Publication Number Publication Date
WO1996007177A1 true WO1996007177A1 (fr) 1996-03-07

Family

ID=23152594

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1995/010942 WO1996007177A1 (fr) 1994-08-31 1995-08-30 Appareil et procede de detection de la parole en presence d'autres sons

Country Status (1)

Country Link
WO (1) WO1996007177A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002049356A2 (fr) * 2000-12-12 2002-06-20 Intel Corporation Composition video par selection automatique a partir de cameras multiples
WO2003010619A2 (fr) * 2001-07-23 2003-02-06 Sergei Vladislavovich Rudakov Dispositif de liaison audio et video pour les utilisateurs d'ordinateurs

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4847829A (en) * 1985-04-08 1989-07-11 Datapoint Corporation Video conferencing network
US5263019A (en) * 1991-01-04 1993-11-16 Picturetel Corporation Method and apparatus for estimating the level of acoustic feedback between a loudspeaker and microphone
US5271057A (en) * 1991-10-09 1993-12-14 Bell Communications Research, Inc. Audio processing system for teleconferencing system
US5317672A (en) * 1991-03-05 1994-05-31 Picturetel Corporation Variable bit rate speech encoder
US5434913A (en) * 1993-11-24 1995-07-18 Intel Corporation Audio subsystem for computer-based conferencing system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4847829A (en) * 1985-04-08 1989-07-11 Datapoint Corporation Video conferencing network
US5263019A (en) * 1991-01-04 1993-11-16 Picturetel Corporation Method and apparatus for estimating the level of acoustic feedback between a loudspeaker and microphone
US5317672A (en) * 1991-03-05 1994-05-31 Picturetel Corporation Variable bit rate speech encoder
US5271057A (en) * 1991-10-09 1993-12-14 Bell Communications Research, Inc. Audio processing system for teleconferencing system
US5434913A (en) * 1993-11-24 1995-07-18 Intel Corporation Audio subsystem for computer-based conferencing system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002049356A2 (fr) * 2000-12-12 2002-06-20 Intel Corporation Composition video par selection automatique a partir de cameras multiples
WO2002049356A3 (fr) * 2000-12-12 2002-11-28 Intel Corp Composition video par selection automatique a partir de cameras multiples
US6577333B2 (en) 2000-12-12 2003-06-10 Intel Corporation Automatic multi-camera video composition
CN100474919C (zh) * 2000-12-12 2009-04-01 英特尔公司 自动选择视频源的方法和自动视频切换系统
WO2003010619A2 (fr) * 2001-07-23 2003-02-06 Sergei Vladislavovich Rudakov Dispositif de liaison audio et video pour les utilisateurs d'ordinateurs
WO2003010619A3 (fr) * 2001-07-23 2003-03-20 Sergei Vladislavovich Rudakov Dispositif de liaison audio et video pour les utilisateurs d'ordinateurs

Similar Documents

Publication Publication Date Title
US5878391A (en) Device for indicating a probability that a received signal is a speech signal
US5642464A (en) Methods and apparatus for noise conditioning in digital speech compression systems using linear predictive coding
US5991277A (en) Primary transmission site switching in a multipoint videoconference environment based on human voice
EP0707763B1 (fr) Reduction de bruits de fond pour l'amelioration de la qualite de voix
EP1585947B1 (fr) Systeme de detection et de localisation sonore
US6154721A (en) Method and device for detecting voice activity
US6411927B1 (en) Robust preprocessing signal equalization system and method for normalizing to a target environment
US4230906A (en) Speech digitizer
KR100283421B1 (ko) 음성 속도 변환 방법 및 그 장치
KR100302370B1 (ko) 음성구간검출방법과시스템및그음성구간검출방법과시스템을이용한음성속도변환방법과시스템
RU2206174C2 (ru) Способ регулирования громкости звуковоспроизведения и устройства для его реализации
Hellwarth et al. Automatic conditioning of speech signals
MXPA05002290A (es) Control de la sonoridad de la voz en senales que contienen voz y otros tipos de material de audio.
JPH07193548A (ja) 雑音低減処理方法
JP2792853B2 (ja) オーディオ信号の伝送方法及び装置
US5430826A (en) Voice-activated switch
JPH0713586A (ja) 音声判別装置と音響再生装置
WO1996007177A1 (fr) Appareil et procede de detection de la parole en presence d'autres sons
JP2910417B2 (ja) 音声音楽判別装置
US6115589A (en) Speech-operated noise attenuation device (SONAD) control system method and apparatus
US20050267745A1 (en) System and method for babble noise detection
US6633847B1 (en) Voice activated circuit and radio using same
US3437757A (en) Speech analysis system
JPH0689095A (ja) 音響信号選択装置
EP0583172A1 (fr) Appareil de décodage du langage et procédé de décodage

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA