WO2018049391A1 - Procédé et appareil de classification de segment representatif - Google Patents

Procédé et appareil de classification de segment representatif Download PDF

Info

Publication number
WO2018049391A1
WO2018049391A1 PCT/US2017/051160 US2017051160W WO2018049391A1 WO 2018049391 A1 WO2018049391 A1 WO 2018049391A1 US 2017051160 W US2017051160 W US 2017051160W WO 2018049391 A1 WO2018049391 A1 WO 2018049391A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
energy
calculation module
kurtosis
silence
Prior art date
Application number
PCT/US2017/051160
Other languages
English (en)
Inventor
Fathy Yassa
Ben Reaves
Nima Ferdosi
Original Assignee
Speech Morphing Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/262,963 external-priority patent/US9767791B2/en
Application filed by Speech Morphing Systems, Inc. filed Critical Speech Morphing Systems, Inc.
Priority to EP17849768.1A priority Critical patent/EP3510592A4/fr
Publication of WO2018049391A1 publication Critical patent/WO2018049391A1/fr
Priority to IL265310A priority patent/IL265310A/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Definitions

  • Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural language.
  • the meaning of a complex spoken sentence (which often has never been heard or uttered before) can be understood only by decomposing the complex spoken sentence into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and then combining those meanings according to the grammar rules of the language.
  • the recognition of each lexical segment in turn requires its decomposition into a sequence of discrete phonetic segments and mapping each segment to one element of a finite set of elementary sounds (roughly, the phonemes of the language).
  • VAD Voice activity detection
  • speech activity detection is a technique used in speech processing in which the presence or absence of human speech is detected.
  • the main uses of VAD are in speech coding and speech recognition.
  • VAD can facilitate speech processing, and can also be used to deactivate some processes during non- speech section of an audio session: VAD can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.
  • aspects of the exemplary embodiments relate to systems and methods designed to segment speech by detecting the pauses between the words and/or phrases, i.e. to determine whether a particular time interval contains speech or non-speech, e.g. a pause.
  • FIG. 1 illustrates a block diagram of a system for classifying input audio, according an exemplary embodiment.
  • FIG. 2 illustrates a flow diagram of a method of identifying lexical boundaries between silence and non-silence in input audio, according to an exemplary embodiment.
  • FIG. 3 illustrates a graph of the sliding energy ratio as a function of time, according an exemplary embodiment.
  • FIG. 4 illustrates a block diagram of the blocks over which energy is calculated, according an exemplary embodiment.
  • FIG. 5 is a flow diagram of a method of identifying audio transitions within a super segment, according an exemplary embodiment.
  • FIG. 6 is a flow diagram of a method of extracting feature vectors for each block, according an exemplary embodiment.
  • FIG. 7 illustrates a flow diagram of a method of making a preliminary
  • FIG. 8 illustrates a flow diagram of classifying the audio contained within a segment, according an exemplary embodiment.
  • FIG. 1 illustrates a block diagram of a system for classifying input audio according an exemplary embodiment.
  • the classification system in FIG. 1 may be implemented as a computer system
  • computer system 110 is a computer comprising several modules, i.e. computer components embodied as either software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to form an exemplary computer system.
  • the computer components may be implemented as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • a unit or module may advantageously be configured to reside on the addressable storage medium and configured to execute on one or more processors or microprocessors.
  • a unit or module may include, by way of example, components, such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
  • components such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
  • components such as software components, object oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
  • Input 115 is a module configured to receive input audio from any audio source and output the received audio to the divider 117.
  • the audio source may be live audio, for example received from a microphone; or recorded audio, for example received from a file, etc.
  • the divider 117 is a module configured to receive the output of the input 115 and divide said input audio into blocks, i.e. the blocked audio, of (for example 500) samples. Audio may be silence, speech, music, background/ambient noise or any combination thereof.
  • the energy calculator 120 is a module configured to receive the blocked audio output from the divider 117, calculate the energy of the waveform of the input audio and output the calculated energy of the waveform to the energy ratio calculator 125.
  • the energy ratio calculator 125 is a module configured to calculate the ratio of the energy Nl contiguous blocks of audio over the energy of N2 contiguous blocks of audio contained within the first set of contiguous block of audio and output the energy ratio to the silence boundary locator 130.
  • the silence boundary locator 130 is a module configured to receive the energy ratio calculated by the energy ratio calculator 125, identify the lexical segments between the silence and non-silence portion of the input audio and output the non-silence portions of the input audio to the kurtosis calculator 135.
  • the kurtosis calculator 135 is a module configured to receive the non-silence audio portions from the silence boundary locator 130, calculate the kurtosis over a sliding window in the non-silence audio portion and identify the transitions between the different types of audio within each non-silence audio portion.
  • the non-silence audio portion between each transition is designated a super segment.
  • Each super segment retains the block divisions from the divider 117.
  • the kurtosis calculator 135 is further configured to calculate the kurtosis for each individual block within a super segment.
  • the zero crossing calculator 140 is a module configured to receive the block divided super segments from the kurtosis calculator 135 and calculate the number of zero crossings in each individual block with said super segments.
  • a zero-crossing is a point at which the sign of a mathematical function changes (e.g. from positive to negative), represented by a crossing of the axis (zero value) in the graph of the function, and is a commonly used term in electronics, mathematics, sound, and image processing.
  • the silence ratio calculator 142 is a module configured to receive the block divided super segments from the kurtosis calculator 135 and calculate the silence ratio over each individual block.
  • the silence ratio of an audio segment is the ratio of the samples that are smaller than a given threshold to the length of the block
  • the weighted averager 145 is a module configured to create a weighted average of the parameters to create the feature vector of each individual block.
  • the distance 150 is a module configured to determine, for each block, the
  • the preliminary decider 155 is a module configured to make a preliminary classification of the audio contained within each block, i.e. whether the block is primarily speech, music or background/ambient noise or silence.
  • the polar 160 is a module configured to poll the individual blocks within a segment to make a final classification of the audio within said segment.
  • the pause detector 170 is a module configured to obtain segments, and determine if the segment length is longer than a given threshold and divide the segments into small segments.
  • FIG. 2 illustrates a flow diagram of a method of identifying lexical boundaries between silence and non-silence in input audio, according to an exemplary embodiment.
  • the audio portions between these initial lexical boundaries are designated super segments, which may contain multiple types of audio.
  • the input 120 receives input audio from an audio source.
  • the audio source may be live audio, for example received from a microphone, recorded audio, for example audio recorded in a file, synthesized audio, etc.
  • the divider 117 divides the input audio into blocks of samples, for example 500 samples per block.
  • the length of the blocks is a function of the sample rate.
  • the energy calculator 120 calculates the energy over Nl contiguous blocks.
  • Nl is 6 blocks.
  • the energy calculator 120 calculates the energy over N2 contiguous blocks where said contiguous blocks are wholly contained within Nl contiguous blocks.
  • N2 is 4 blocks.
  • the energy ratio calculator 125 calculates the ratio of the energy of the Nl blocks over the energy of the N2 blocks.
  • the energy ratio calculator 125 calculates the energy ratio over a sliding frame.
  • the silence boundary locator 130 identifies the lexical boundaries to and from silence by the energy ratio of the blocks. When there is a transition from silence to non- silence, the moving energy ratio will spike sharply. When there is a transition from non-silence to silence, the moving energy ratio will decline abruptly. The silence boundary locator 130 will assign each super segment a label: silence or non-silence based upon these transitions.
  • FIG. 3 illustrates a graph of the sliding energy ratio as a function of time.
  • the time index 310 illustrates a relatively uniform energy ratio of approximately
  • the time index 320 illustrates a spike in the energy ratio, i.e. the energy ratio is substantially greater than 1. This spike in the energy ratio indicates a transition from non-silence to silence.
  • the time index 330 illustrates a uniform energy ratio of approximately 1 , once again indicating a lack of transitions between silence to non- silence.
  • Time index 340 represents a dip in the energy ratio, i.e. the energy ratio is substantially less than 1 , indicating transition from silence to non-silence.
  • FIG. 4 illustrates a block diagram of the blocks over which energy calculator 120 calculates the energy.
  • Blocks 405 illustrate the blocks in the input audio.
  • Blocks 410 illustrate the Nl contiguous blocks over which energy calculator first calculates the energy.
  • Blocks 420 illustrate the N2 contiguous blocks over which energy calculator 120 also calculates the energy.
  • FIG. 5 is a flow diagram of the method of identifying audio transitions within a super segment.
  • the kurtosis calculator 135 receives a blocked super segment of non- silence audio.
  • the kurtosis calculator 135 calculates the kurtosis over a sliding window over N3 blocks in the super segment. Calculating kurtosis is within the skill of one schooled in the art of signal processing.
  • the kurtosis calculator 135 identifies the transitions between the different types of audio contained within the N3 blocks of the super segment, i.e. speech, music, and background/ambient noise. A rapid change in the kurtosis over a sliding window of N3 blocks indicates a change in the audio as shown in Table 1.
  • the audio between transitions is designated a segment.
  • FIG. 6 is a flow diagram of a method of extracting the feature vectors for each block.
  • the feature vectors are used to classify the type of audio contained within each block in the segment.
  • the kurtosis calculator 135 calculates the kurtosis for each individual block within a segment.
  • the zero crossing calculator 140 calculates the zero crossing rate for each individual block.
  • the zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.
  • the silence ratio calculator 1425 calculates the silence ratio for each individual block.
  • the silence ratio of an audio segment is the ratio of the samples that are smaller than a given threshold to the length of the segment.
  • the energy calculator 120 calculates the energy of each individual block.
  • the four parameters, energy, kurtosis, zero crossing rate and silence ratio, once weighted, will collectively represent the feature vector of a block.
  • FIG. 7 illustrates a flow diagram of a method of making a preliminary
  • the weighted average calculates a multiplication factor, i.e. a weight, to apply to each parameter.
  • a weight i.e. a weight
  • the applied weight is one over one standard deviation of each parameter.
  • the weighted parameters are the elements in the feature vector of each block.
  • distance 150 calculates the Euclidean distances between the feature vector and a centroid of pure speech, a centroid of pure music, a centroid of pure background/ambient noise, and a centroid of pure silence. Calculating a Euclidean distance is within the ordinary skill of one schooled in the art of signal processing.
  • the preliminary decider 155 classifies the block based upon the centroid with the lowest Euclidean distance from the feature vector, e.g. if a centroid containing music has the lowest Euclidean distance from the feature vector, the segment will be initially classified as music.
  • each block is assigned a value corresponding to its preliminary classification.
  • An example of such values is illustrated in Table 2.
  • Figure 8 illustrates a flow diagram classifying the audio contained within a segment.
  • the polar 160 averages the assigned numbers of each block within the segment.
  • the polar 160 determines which block classifications deviate by more than a given threshold, e.g. one standard deviation from the average.
  • step 830 the blocks whose classification deviate by the more than the given threshold are identified as potentially misclassified, i.e. outliers.
  • step 840 the classification of the outliers is reassigned to the current average.
  • the average of all blocks, included the reclassified outliers is recalculated.
  • the average is re-calculated with all outliers excluded.
  • the polar 160 determines the difference between the recalculated average and each of the assigned audio values in Table 2. The segment is classified according to the audio with the lowest difference to the recalculated average.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil de segmentation de la parole par détection des pauses entre les mots et/ou les phrases, lesdits procédé et appareil permettant de déterminer si un intervalle de temps particulier contient de la parole ou de la non-parole, telle qu'une pause.
PCT/US2017/051160 2016-09-12 2017-09-12 Procédé et appareil de classification de segment representatif WO2018049391A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17849768.1A EP3510592A4 (fr) 2016-09-12 2017-09-12 Procédé et appareil de classification de segment representatif
IL265310A IL265310A (en) 2016-09-12 2019-03-12 Method and apparatus for exemplary segment classification

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/262,963 US9767791B2 (en) 2013-05-21 2016-09-12 Method and apparatus for exemplary segment classification
US15/262,963 2016-09-12

Publications (1)

Publication Number Publication Date
WO2018049391A1 true WO2018049391A1 (fr) 2018-03-15

Family

ID=61562835

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/051160 WO2018049391A1 (fr) 2016-09-12 2017-09-12 Procédé et appareil de classification de segment representatif

Country Status (3)

Country Link
EP (1) EP3510592A4 (fr)
IL (1) IL265310A (fr)
WO (1) WO2018049391A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276164A (zh) * 2020-02-15 2020-06-12 中国人民解放军空军特色医学中心 飞机上高噪音环境自适应话音激活检测装置及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617508A (en) * 1992-10-05 1997-04-01 Panasonic Technologies Inc. Speech detection device for the detection of speech end points based on variance of frequency band limited energy
US6629070B1 (en) * 1998-12-01 2003-09-30 Nec Corporation Voice activity detection using the degree of energy variation among multiple adjacent pairs of subframes
US20120253812A1 (en) * 2011-04-01 2012-10-04 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9324319B2 (en) * 2013-05-21 2016-04-26 Speech Morphing Systems, Inc. Method and apparatus for exemplary segment classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617508A (en) * 1992-10-05 1997-04-01 Panasonic Technologies Inc. Speech detection device for the detection of speech end points based on variance of frequency band limited energy
US6629070B1 (en) * 1998-12-01 2003-09-30 Nec Corporation Voice activity detection using the degree of energy variation among multiple adjacent pairs of subframes
US20120253812A1 (en) * 2011-04-01 2012-10-04 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3510592A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276164A (zh) * 2020-02-15 2020-06-12 中国人民解放军空军特色医学中心 飞机上高噪音环境自适应话音激活检测装置及方法
CN111276164B (zh) * 2020-02-15 2021-08-03 中国人民解放军空军特色医学中心 飞机上高噪音环境自适应话音激活检测装置及方法

Also Published As

Publication number Publication date
IL265310A (en) 2019-05-30
EP3510592A4 (fr) 2020-04-29
EP3510592A1 (fr) 2019-07-17

Similar Documents

Publication Publication Date Title
US9449617B2 (en) Method and apparatus for exemplary segment classification
US9368116B2 (en) Speaker separation in diarization
US9767791B2 (en) Method and apparatus for exemplary segment classification
EP2089877A1 (fr) Système et procédé de détermination de l'activité de la parole
JP2006215564A (ja) 自動音声認識システムにおける単語精度予測方法、及び装置
JP6622681B2 (ja) 音素崩れ検出モデル学習装置、音素崩れ区間検出装置、音素崩れ検出モデル学習方法、音素崩れ区間検出方法、プログラム
Deng et al. Confidence measures in speech emotion recognition based on semi-supervised learning
US11823685B2 (en) Speech recognition
CN112435653A (zh) 语音识别方法、装置和电子设备
CN116457870A (zh) 并行化Tacotron:非自回归且可控的TTS
Barker et al. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition
KR20240053639A (ko) 제한된 스펙트럼 클러스터링을 사용한 화자-턴 기반 온라인 화자 구분
JP2000172295A (ja) 低複雑性スピ―チ認識器の区分ベ―スの類似性方法
KR20160089103A (ko) 실시간 음원 분류 장치 및 방법
US9754593B2 (en) Sound envelope deconstruction to identify words and speakers in continuous speech
JP2012032557A (ja) 音声に含まれる吸気音を検出する装置、方法、及びプログラム
JP2017045054A (ja) 言語モデル改良装置及び方法、音声認識装置及び方法
WO2018049391A1 (fr) Procédé et appareil de classification de segment representatif
US20090150164A1 (en) Tri-model audio segmentation
Faycal et al. Comparative performance study of several features for voiced/non-voiced classification
Michel et al. Frame-level MMI as a sequence discriminative training criterion for LVCSR
Gomathy et al. Gender clustering and classification algorithms in speech processing: a comprehensive performance analysis
WO2018169772A2 (fr) Retour de qualité relatif à des mots-clés enregistrés par l'utilisateur pour systèmes de reconnaissance vocale automatiques
CN114678040B (zh) 语音一致性检测方法、装置、设备及存储介质
Pawar et al. Analysis of FFSR, VFSR, MFSR techniques for feature extraction in speaker recognition: a review

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17849768

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017849768

Country of ref document: EP

Effective date: 20190412