US20180137880A1 - Phonation Style Detection - Google Patents
Phonation Style Detection Download PDFInfo
- Publication number
- US20180137880A1 US20180137880A1 US15/434,164 US201715434164A US2018137880A1 US 20180137880 A1 US20180137880 A1 US 20180137880A1 US 201715434164 A US201715434164 A US 201715434164A US 2018137880 A1 US2018137880 A1 US 2018137880A1
- Authority
- US
- United States
- Prior art keywords
- speech
- phonation
- signal
- energy
- speech activity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000001514 detection method Methods 0.000 title claims description 24
- 238000000034 method Methods 0.000 claims abstract description 44
- 230000000694 effects Effects 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 20
- 230000005284 excitation Effects 0.000 claims description 13
- 238000005259 measurement Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 7
- 230000003595 spectral effect Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 230000008878 coupling Effects 0.000 claims 1
- 238000010168 coupling process Methods 0.000 claims 1
- 238000005859 coupling reaction Methods 0.000 claims 1
- 238000004891 communication Methods 0.000 abstract description 10
- 210000001260 vocal cord Anatomy 0.000 description 13
- 238000013459 approach Methods 0.000 description 6
- 230000001755 vocal effect Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 208000032041 Hearing impaired Diseases 0.000 description 2
- 230000004075 alteration Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000002459 sustained effect Effects 0.000 description 2
- 206010010144 Completed suicide Diseases 0.000 description 1
- 206010011878 Deafness Diseases 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001914 calming effect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010370 hearing loss Effects 0.000 description 1
- 231100000888 hearing loss Toxicity 0.000 description 1
- 208000016354 hearing loss disease Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000035987 intoxication Effects 0.000 description 1
- 231100000566 intoxication Toxicity 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 208000010125 myocardial infarction Diseases 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 208000028173 post-traumatic stress disease Diseases 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 201000002510 thyroid cancer Diseases 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/70—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
- G06F21/71—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information
- G06F21/72—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information in cryptographic circuits
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/70—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
- G06F21/78—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data
- G06F21/79—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data in semiconductor storage media, e.g. directly-addressable memories
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/02—Topology update or discovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/22—Alternate routing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/16—Implementing security features at a particular protocol layer
- H04L63/162—Implementing security features at a particular protocol layer at the data link layer
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/002—Countermeasures against attacks on cryptographic mechanisms
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2209/00—Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
- H04L2209/08—Randomization, e.g. dummy operations or using noise
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2209/00—Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
- H04L2209/12—Details relating to cryptographic hardware or logic circuitry
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/54—Organization of routing tables
Definitions
- Phonation is the rapid, periodic opening and closing of the glottis through separation and apposition of the vocal folds that, accompanied by breath under lung pressure, constitutes a source of vocal sound.
- Phonation style refers to different speaking styles which may include normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. Babble phonation occurs when there is more than one speaker talking at the same time.
- the current state of the art has considered noise degraded speech applications where the goal is to extract the target speech or suppress the interfering speech, but lacks a decision making process of how to address different phonation styles.
- the speech recognition algorithm tries to recognize whatever data that it has been given. Using a probabilistic approach, the speech recognition algorithm tries to decipher what was the most likely sequence of spoken words.
- the speech recognition algorithm can be successful even in noisy environments due to a prior knowledge about the potential sequence. For example, if the spoken words are, ‘Find the nearest liquor store’, the speech recognizer only needs to recognize ‘liquor’ and then it can guess what the rest of the words are or, at least, get close to the lexical content.
- babble speech where more than one speaker is speaking, allows for the speech to not be processed by a speech recognizer since the output would be unreliable.
- each phonation class may have specific features just for that phonation class.
- a decision tree is used to take the next appropriate step.
- speech recognition is the desired next step and the pre-processer indicated that it was whispered speech, then the next step would be to send the data to a speech recognizer that is trained on whispered speech.
- the invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds.
- the purpose of the invention is to introduce the phonation style as a way to control computer software.
- a method for phonation style detection comprises detecting speech activity in a signal; extracting signal features from the detected speech activity; characterizing the extracted signal features; and performing a decision process on the characterized signal features which determines whether the detected speech activity is normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, or non-voice sound.
- FIG. 1 depicts the present invention's process for feature extraction and phonation style classification.
- FIG. 2 depicts the present invention's decision process for detection and phonation classification based on extracted features.
- FIG. 3 depicts a measured time domain and spectrogram for normally-phonated speech.
- FIG. 4 depicts a measured time domain and spectrogram for low-level speech.
- FIG. 5 depicts a measured time domain and spectrogram for high-level speech.
- FIG. 6 depicts a measured time domain and spectrogram for whispered speech.
- FIG. 7 depicts a measured time domain and spectrogram for babble speech.
- the invention described herein allows an audio stream to be analyzed and classified based on the phonation style and then, depending on the application, a software control process is able to make appropriate control decisions.
- the goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
- FIG. 1 displays the invention's feature extraction and classifier approach to the phonation style detection while Error! Reference source not found. provides a decision tree process into which the extracted features are analyzed.
- the invention allows an audio stream to be classified based on the phonation style and then, as in FIG. 2 , depending on the application, a software control process is able to make appropriate control decisions.
- the goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
- the first step in classifying the signal is to decide if there is speech activity or not 110 .
- Reference source not found. shows how the invention delineates the phonation style once speech activity has been found.
- the signal is processed in small blocks of data that are typically 10-30 milliseconds in duration.
- Features such as energy, pitch extraction, autocorrelation, spectral tilt, and/or cepstral coefficients coupled with classifiers and learning algorithms such as state vector machines, neural networks, and Gaussian mixture models are typical approaches to detect speech activity. If speech activity is detected 110 , and signal features are extracted 120 , 130 , 140 , 150 , 160 and characterized through a fusion and decision process 170 under software control 180 , then a multi-step, multi-feature approach is used to determine the phonation style as depicted in FIG. 2 .
- the first decision point is to measure if there is harmonic information 210 in the signal.
- a phoneme or a unit of a voice sound has a fundamental frequency associated with it, which is determined by how many times the vocal folds open and close in one second (given in units of Hz).
- the harmonics are produced when the fundamental frequency resonates within a cavity, in the case of voice that cavity is the vocal tract. Harmonics are an integer multiple of the fundamental frequency.
- the fundamental frequency is the first harmonic, the next octave up is the second harmonic and so on.
- There are many methods to determine the harmonics of a voice sample one method is found in [2].
- an energy measure 220 is used to decide if the signal is either non-voice sounds 230 , softly spoken speech 250 , or whisper 260 .
- w(m) is a weighting function such as a rectangular, hamming, or triangular window.
- the length of the window tends to approximately be 10-30 milliseconds due to the time-vary nature of speech signals. Additionally, the window length should encompass at least one pitch period, but not too many pitch periods.
- the frequency domain energy calculation would be the same as the time domain energy calculation.
- the bandwidth may be a reduced frequency range to avoid unreliable or noisy regions.
- the energy can also be computed using autocorrelation values. The important concept isn't how the energy is calculated, but to understand how to use the energy levels. Higher energy coupled with harmonics would be indicative of normal spoken speech. Lower energy would be indicative of softly spoken or whispered speech.
- the phonation decision would be labeled Non-Voice sounds 230 . If the signal is non-voice sounds, the next step may be to avoid these regions of the signal or to employ an additional step to classify the non-voice signal.
- voicing occurs as the vocal folds open and close due to air pressure building up behind the vocal folds and then being released [5].
- This vibration of the vocal folds opening and closing is referred to as the fundamental frequency or pitch period of the speech signal.
- the bursts of air that are released as the vocal folds opens and closes then becomes an excitation source for the vocal cavity.
- Vowels are examples of sustained voicing where the pitch period is quasi-periodic. For males, the vocal folds open and close at about 110 cycles per second. For females, it is about 250 cycles per second. Unvoiced sounds occur in speech where there is a lack of the vocal folds vibrating.
- a fricative /s/ or /f/ For unvoiced speech, such as a fricative /s/ or /f/, the vocal folds are not vibrating and there is a constriction in the vocal tract that gives it a noise-like, turbulent quality to the speech.
- Various combinations of these two main voicing states (voiced and unvoiced) allow other voicing states to be reached such as voiced fricatives (/z/) or whispered speech where no vocal fold vibration occur, even in the vowels.
- Softly Spoken speech 250 If there is voicing 240 , albeit at low energy levels 220 , then the signal would be labeled as Softly Spoken speech 250 . Decision processes for softly spoken speech may be to amplify the signal in order to make it more audible. If there is no voicing 240 with low energy levels 220 , then the signal would be labeled as Whisper speech 260 .
- Detecting whisper speech may have many applications for hearing impaired people.
- An application may be to amplify the low energy sounds or to convert the whispered speech to normally phonated speech [6].
- the energy in whispered speech is at higher frequencies compared to normal phonation.
- the combination of the speech having lower energy at higher frequencies make it very troublesome for people with hearing loss. Detecting and converting whispered speech to normal speech could be very beneficial for the hearing impaired.
- a mixed excitation measure 270 could be used to decide if the signal is either babble 280 , loudly spoken speech 300 , or normally phonated speech 310 .
- Mixed excitations are detected 270 when the pitch is changing rapidly such as a transition between phonemes or when two or more speakers are talking at the same time. For single talker, mixed excitations, these regions will be short with a duration of about 10-20 milliseconds. For multi-talker mixed excitation, these regions will be longer and occur when one speaker's speech is corrupted by another speaker's speech.
- 7,177,808 B addresses how to improve speaker identification when multiple speakers are present by finding the “usable” (single-talker excitation) speech and processing only the usable speech.
- the usable speech there are many ways to estimate the usable speech by using techniques such as the kurtosis, linear predicative residual, autocorrelation, and wavelet transform to name a few [7], [8], [9].
- the phonation decision would be to label the signal as Babble (multiple speakers) sounds 280 [10]. Once again, it is not important as to how to estimate mixed excitations, but to know when there is one talker present or multiple. Applications for Babble speech may be to avoid that region, label it as unreliable, or try to separate the speakers. If there is no mixed excitation detected 270 , then one talker is present and the next step is to look for clipping 290 .
- Clipping is a form of waveform distortion that occurs when the speaker speaks loud enough to over drive the audio amplifier, the voltage or current that represents the speech is beyond its maximum capability. This typically occurs when the speaker is shouting. If too much clipping 290 occurs, the signal is labeled as Loudly Spoken 300 . Applications for detecting Loudly Spoken speech may be to attenuate the signal, detect emotions, or mitigate the effects of clipping [11]. If there is little or no clipping, the signal is labeled as Normal Spoken 310 speech. Typical applications for Normal Spoken includes speech recognition, speaker identification, and language identification.
- the speech is characterized by sustained energy and vowels.
- the graph clearly shows the onsets/offsets of speech and the energy has a larger standard deviation due to the distinct classes of silence, unvoiced speech, and voiced speech.
- voiced speech there is strong harmonicity which can be measured by a pitch estimator.
- Features for normally phonated speech may include total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures.
- the energy levels are lower for the voiced and unvoiced regions. There still is a glottal pulse but the various regions of silence, voiced, and unvoiced are not as distinct especially if there is some background noise present. The dynamic range of the energy levels are reduced. The voiced speech regions may be shorter and the unvoiced regions may be longer.
- Features for low-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different.
- the energy levels are higher for the voiced and unvoiced regions.
- the glottal pulse is very strong due to the strong airflow causing the vocal folds to abduct and adduct.
- the voiced speech regions may be longer and the unvoiced regions may be shorter.
- Features for high-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different. Clipping may be an additional feature to detect high-level speech phonation.
- the speech is characterized by lower volume levels (sound pressure) and the lack of a glottal excitation.
- Whisper speech occurs when there is no vocal fold vibrations and the speech is characterized by a noise-like or hissing quality. There will be little, if any, voiced speech regions.
- the same features of total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures can be used to detect whispered-speech phonation. However, there will be less energy in the lower frequency regions compared to normally phonated speech.
- babble-speech phonation occurs, when there is more than one speaker talking at the same time. This leads to fewer silence regions and mixed-excitation where unvoiced and voiced speech overlap. There will be fewer unvoiced regions. The same features apply, but the reduced amount of silence and unvoiced regions provide an extra clue about the phonation style.
- phonation style may exist such as loud whispering for dramatic purposes but the use of the same feature set along with a unique feature for loud whispering, would still allow for successful detection of the new phonation style.
- the phonation style detection is not geared towards any one speech processing application but provides a decision point of how to proceed.
- the pre-processing step to phonation style detection is to extract the speech signal, including but not limited to extraction from a microphone or an interception of a transmission containing speech.
- the speech is then analyzed by a speech activity detector (SAD) 110 as shown in FIG. 1 .
- the SAD analyzes a segment of audio data and determines if the signal is speech, silence, or background noise. If the SAD 110 detects speech, the feature extraction section 120 , 130 , 140 , 150 , 160 follows.
- the feature extraction section is used to classify the phonation of an individual's speech.
- the features include but are not limited to a harmonic measurement, signal energy measurement, voice activity detector, time domain measurements, and frequency domain measurements. Some systems may utilize more features, while others may utilize less.
- the information fusion/decision 170 is an algorithm under software control 180 that any type of statistical classifier can perform to detect the type of phonation.
- This statistical classifier can be any type of classifiers such as but not limited to iVectors, Gaussian mixture models, support vector machine, and/or a type of neural networks.
- the information fusion and decision 170 will output the type of phonation, such as, but not limited to, normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation.
- the output can feed a number of software applications, such as, but not limited to, speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications.
- speech recognition systems e.g., speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications.
- the phonation style becomes part of the game where how a person speaks is as important was what is being said and how the controller is being used.
- phonation style detection can be used for hearing aids to detect whisper speech and convert high frequency, non-phonation information to normal, phonated to speech at a lower frequency region.
- Phonation style detection can also be used for voice coaching to give the subject feedback as to his/her pronunciation and style of pronunciation.
- Uses of the present invention include but are not limited to the following:
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Mobile Radio Communication Systems (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Storage Device Security (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds. The purpose of the invention is to introduce the phonation style as a way to control computer software.
Description
- The invention described herein may be manufactured and used by or for the Government for governmental purposes without the payment of any royalty thereon.
- Phonation is the rapid, periodic opening and closing of the glottis through separation and apposition of the vocal folds that, accompanied by breath under lung pressure, constitutes a source of vocal sound. Technology exists that detects sound created through phonation and attempts to understand and decode the sounds. Examples include Siri, automated phone menus, and Shazaam. Limitations on the current technologies mean that they only work well when the speaker is using a normal speaking or phonation style not including loud, babble, whisper, or pitch and they assume that the speaker wants to be heard and understood.
- Phonation style refers to different speaking styles which may include normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. Babble phonation occurs when there is more than one speaker talking at the same time. The current state of the art has considered noise degraded speech applications where the goal is to extract the target speech or suppress the interfering speech, but lacks a decision making process of how to address different phonation styles. For most speech recognition processes, the speech recognition algorithm tries to recognize whatever data that it has been given. Using a probabilistic approach, the speech recognition algorithm tries to decipher what was the most likely sequence of spoken words. For constrained environments where the vocabulary is limited, the speech recognition algorithm can be successful even in noisy environments due to a prior knowledge about the potential sequence. For example, if the spoken words are, ‘Find the nearest liquor store’, the speech recognizer only needs to recognize ‘liquor’ and then it can guess what the rest of the words are or, at least, get close to the lexical content. The detection of babble speech (where more than one speaker is speaking), allows for the speech to not be processed by a speech recognizer since the output would be unreliable.
- Current methods also experience issues in situations of degraded speech which can occur in almost any communication setting. Typically, speech degradation is assumed to be due to environmental noise or communication channel artifacts. However, speech degradation can also occur due to changes in phonation style. A speech processing algorithm that is trained using normally phonated speech but is given whispered or high-volume phonation style speech will quickly degrade and create nonsensical outputs. Instead of assuming a phonation style, a pre-processing technique is used to classify the speech as being normally-phonated, whispered, low-level volume, high-level volume or babble speech. Features are extracted and analyzed so as to make a decision. Many of the same features such as spectral tilt, energy, and envelope shape are standard for each phonation class. However, each phonation class may have specific features just for that phonation class. Based on the outcome of the pre-processing, a decision tree is used to take the next appropriate step. When speech recognition is the desired next step and the pre-processer indicated that it was whispered speech, then the next step would be to send the data to a speech recognizer that is trained on whispered speech.
- In an unconstrained, dynamically changing environment, speech recognizers have not succeeded in being able to accurately recognize the spoken dialogue. This process is complicated by multiple speakers, noisy environment, and unstructured lexical information. The current state of the art lacks an approach for uniquely classifying the various phonation styles; it also does not address how to make appropriate follow-on decisions. The current technology also assumes that there is only one speaker and that the speaker desires to be understood. When there are multiple speakers or the speaker wishes to obfuscate his/her communication, the output of speech recognizers quickly degrades into non-sense lexical information.
- It is therefore an object of the invention to classify speech into several possible phonation styles.
- It is a further object of the invention to extract from speech signals characteristics which lead to the classification of several phonation styles.
- It is yet a further object of the present invention to make a determination as to which of several phonation styles is present based on the detection of the presence or absence of extracted speech characteristics.
- Briefly stated, the invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds. The purpose of the invention is to introduce the phonation style as a way to control computer software.
- In an embodiment of the invention, a method for phonation style detection, comprises detecting speech activity in a signal; extracting signal features from the detected speech activity; characterizing the extracted signal features; and performing a decision process on the characterized signal features which determines whether the detected speech activity is normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, or non-voice sound.
- The above, and other objects, features and advantages of the invention will become apparent from the following description read in conjunction with the accompanying drawings, in which like reference numerals designate the same elements.
-
- [1] Stanley J. Wenndt, Edward J. Cupples; “Method and apparatus for detecting illicit activity by classifying whispered speech and normally phonated speech according to the relative energy content of formants and fricatives”; U.S. Pat. No. 7,577,564; Issued: Aug. 18, 2009
- [2] Darren M. Haddad, Andrew J. Noga; “Generalized Harmonicity Indicator”; U.S. Pat. No. 7,613,579; Issued: Nov. 3, 2009
- [3] Deller, Jr., J. R., Proakis, J. G., Hansen, J. H. L. (1993), “Discrete-Time Processing of Speech Signals,” New York, N.Y.: Macmillan Publishing Company, pp. 234-240.
- [4] Deller, Jr., J. R., Proakis, J. G., Hansen, J. H. L. (1993), “Discrete-Time Processing of Speech Signals,” New York, N.Y.: Macmillan Publishing Company, pp. 110-115.
- [5] Peter J. Watson, Angela H. Ciccia, Gary Weismer; “The relation of lung volume initiation to selected acoustic properties of speech”; Journal Acoustical Society of America 113 (5), May 2003; pages 2812-2819.
- [6] Zhi Tao; Xue-Dan Tan; Tao Han; Ji-Hua Gu; Yi-Shen Xu; He-Ming Zhao (2010), “Reconstruction of Normal Speech from Whispered Speech Based on RBF Neural Network,” 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.
- [7] Krishnamachari, K. R.; Yantorno, R. E.; Lovekin, J. M.; Benincasa, D. S.; Wenndt, S. J., “Use of local kurtosis measure for spotting usable speech segments in co-channel speech,”, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings.
- [8] Kizhanatham, A. R.; Yantorno, R. E.; Smolenski, B. Y., “Peak difference of autocorrelation of wavelet transform (PDAWT) algorithm based usable speech measure,” 2003 7th World Multiconference on Systemics, Cybernetics and Informatics Proceedings
- [9] Yantorno, R. E.; Smolenski, B. Y.; Chandra, N.; “Usable speech measures and their fusion,” Proceedings of the 2003 International Symposium on Circuits and Systems.
- [10] Krishnamurthy, N.; Hansen, J. H. L.; “Speech babble: Analysis and modeling for speech systems”; IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, ICASSP 2008; pages 4505-4508
- [11] Hayakawa, Makoto; Fukumori, Takahiro; Nakayama, Masato; Nishiura, Takanobu, “Suppression of clipping noise in observed speech based on spectral compensation with Gaussian mixture models and reference of clean speech,” Proceedings of Meetings on Acoustics-ICA 2013.
-
FIG. 1 depicts the present invention's process for feature extraction and phonation style classification. -
FIG. 2 depicts the present invention's decision process for detection and phonation classification based on extracted features. -
FIG. 3 depicts a measured time domain and spectrogram for normally-phonated speech. -
FIG. 4 depicts a measured time domain and spectrogram for low-level speech. -
FIG. 5 depicts a measured time domain and spectrogram for high-level speech. -
FIG. 6 depicts a measured time domain and spectrogram for whispered speech. -
FIG. 7 depicts a measured time domain and spectrogram for babble speech. - The invention described herein allows an audio stream to be analyzed and classified based on the phonation style and then, depending on the application, a software control process is able to make appropriate control decisions. The goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
- For typical speech applications, normal phonation is assumed and the audio applications attempt to process all the incoming audio. Some audio applications try to detect aberrations to this assumption, for example, if the energy level drops significantly. In this case, the algorithm may label it as whispered speech. This invention is unique in that no assumptions are made as to the type of phonation style. Instead, a multi-feature set is used to classify the phonation style.
- People communicate in different phonation style depending on the communication setting and lexical intent. If a person wishes to conceal the spoken message, then whispering may be employed. Whisper speech occurs when then is no vocal fold vibration and the speech is characterized by a noise-like or hissing quality. For low-level speech, vocal vibrations occur, but the sound level is at a reduced level compared to normally phonated speech. If a person wishes to emphasize a point or needs to speak so as to be heard above noisy environment, then a high-level phonation style may be employed. Additionally, babble-phonation may exist where more than one speaker is talking simultaneously. Other types of phonation are possible depending on the environment. For example, loud whispering may be used for dramatic purposes or emphasis of an opinion. This invention has applicability to any speech processing application.
- Referring to
FIG. 1 displays the invention's feature extraction and classifier approach to the phonation style detection while Error! Reference source not found. provides a decision tree process into which the extracted features are analyzed. As inFIG. 1 , the invention allows an audio stream to be classified based on the phonation style and then, as inFIG. 2 , depending on the application, a software control process is able to make appropriate control decisions. The goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message. - For typical speech applications, normal phonation is assumed and the audio applications attempts to process all the incoming audio. Some audio applications try to detect aberrations to this assumption, for example, if the energy level drops significantly. In this case, the algorithm may label it as whispered speech. Earlier approaches, such as [1], only assumed two phonation states: normally-phonated speech or whispered speech. This invention is unique in that no assumptions are made as to the type of phonation style. Instead, a multi-feature set is used to classify the phonation style (see
FIG. 2 ). - Referring to
FIG. 2 , the first step in classifying the signal is to decide if there is speech activity or not 110. Referring momentarily to Error! Reference source not found. shows how the invention delineates the phonation style once speech activity has been found. - Referring back to
FIG. 1 , the signal is processed in small blocks of data that are typically 10-30 milliseconds in duration. Features such as energy, pitch extraction, autocorrelation, spectral tilt, and/or cepstral coefficients coupled with classifiers and learning algorithms such as state vector machines, neural networks, and Gaussian mixture models are typical approaches to detect speech activity. If speech activity is detected 110, and signal features are extracted 120, 130, 140, 150, 160 and characterized through a fusion anddecision process 170 undersoftware control 180, then a multi-step, multi-feature approach is used to determine the phonation style as depicted inFIG. 2 . - Referring again to
FIG. 2 , the first decision point is to measure if there isharmonic information 210 in the signal. A phoneme or a unit of a voice sound has a fundamental frequency associated with it, which is determined by how many times the vocal folds open and close in one second (given in units of Hz). The harmonics are produced when the fundamental frequency resonates within a cavity, in the case of voice that cavity is the vocal tract. Harmonics are an integer multiple of the fundamental frequency. The fundamental frequency is the first harmonic, the next octave up is the second harmonic and so on. There are many methods to determine the harmonics of a voice sample, one method is found in [2]. - If insufficient harmonics are found in the
decision block 210, then anenergy measure 220 is used to decide if the signal is either non-voice sounds 230, softly spokenspeech 250, orwhisper 260. There are many ways to compute the energy of a signal. The easiest way to compute energy is in the time domain by summing up the squared values of the signal: E(n)=Σn=−∞ N+∞s2(n) [3]. Since speech is time-varying and may change very rapidly, a windowing function is applied: E(n)=Σm=0 N−1[w(m)s(n−m)]2 - where w(m) is a weighting function such as a rectangular, hamming, or triangular window. The length of the window tends to approximately be 10-30 milliseconds due to the time-vary nature of speech signals. Additionally, the window length should encompass at least one pitch period, but not too many pitch periods. There are many variations of computing the energy such as using the absolute value instead of the squared value or the log of the energy or using a frequency range. In the frequency domain, E=Σf
1 f2 |X[f]|2 [4] where X[f] is the Fourier transform of the signal. The bandwidth can be the full bandwidth between f1=0 and f2=fs/2 where fs is the Nyquist. In this case, the frequency domain energy calculation would be the same as the time domain energy calculation. Or, the bandwidth may be a reduced frequency range to avoid unreliable or noisy regions. The energy can also be computed using autocorrelation values. The important concept isn't how the energy is calculated, but to understand how to use the energy levels. Higher energy coupled with harmonics would be indicative of normal spoken speech. Lower energy would be indicative of softly spoken or whispered speech. - If the signal fails the
harmonics test 210, but hashigher energy 220, the phonation decision would be labeled Non-Voice sounds 230. If the signal is non-voice sounds, the next step may be to avoid these regions of the signal or to employ an additional step to classify the non-voice signal. - If the signal fails the
harmonics test 210, but haslower energy 220, the next step would be to test for voicing 240. Voicing occurs as the vocal folds open and close due to air pressure building up behind the vocal folds and then being released [5]. This vibration of the vocal folds opening and closing is referred to as the fundamental frequency or pitch period of the speech signal. The bursts of air that are released as the vocal folds opens and closes then becomes an excitation source for the vocal cavity. Vowels are examples of sustained voicing where the pitch period is quasi-periodic. For males, the vocal folds open and close at about 110 cycles per second. For females, it is about 250 cycles per second. Unvoiced sounds occur in speech where there is a lack of the vocal folds vibrating. For unvoiced speech, such as a fricative /s/ or /f/, the vocal folds are not vibrating and there is a constriction in the vocal tract that gives it a noise-like, turbulent quality to the speech. Various combinations of these two main voicing states (voiced and unvoiced) allow other voicing states to be reached such as voiced fricatives (/z/) or whispered speech where no vocal fold vibration occur, even in the vowels. - If there is voicing 240, albeit at
low energy levels 220, then the signal would be labeled as SoftlySpoken speech 250. Decision processes for softly spoken speech may be to amplify the signal in order to make it more audible. If there is no voicing 240 withlow energy levels 220, then the signal would be labeled asWhisper speech 260. - Detecting whisper speech may have many applications for hearing impaired people. An application may be to amplify the low energy sounds or to convert the whispered speech to normally phonated speech [6]. In addition to having less energy, the energy in whispered speech is at higher frequencies compared to normal phonation. The combination of the speech having lower energy at higher frequencies make it very troublesome for people with hearing loss. Detecting and converting whispered speech to normal speech could be very beneficial for the hearing impaired.
- If harmonics are found in the
decision block 210, then amixed excitation measure 270 could be used to decide if the signal is eitherbabble 280, loudly spokenspeech 300, or normallyphonated speech 310. Mixed excitations are detected 270 when the pitch is changing rapidly such as a transition between phonemes or when two or more speakers are talking at the same time. For single talker, mixed excitations, these regions will be short with a duration of about 10-20 milliseconds. For multi-talker mixed excitation, these regions will be longer and occur when one speaker's speech is corrupted by another speaker's speech. U.S. Pat. No. 7,177,808 B addresses how to improve speaker identification when multiple speakers are present by finding the “usable” (single-talker excitation) speech and processing only the usable speech. As with the energy measurement, there are many ways to estimate the usable speech by using techniques such as the kurtosis, linear predicative residual, autocorrelation, and wavelet transform to name a few [7], [8], [9]. - If harmonics are detected 210 in the signal, but the signal has mixed
excitations 270, the phonation decision would be to label the signal as Babble (multiple speakers) sounds 280 [10]. Once again, it is not important as to how to estimate mixed excitations, but to know when there is one talker present or multiple. Applications for Babble speech may be to avoid that region, label it as unreliable, or try to separate the speakers. If there is no mixed excitation detected 270, then one talker is present and the next step is to look for clipping 290. - Clipping is a form of waveform distortion that occurs when the speaker speaks loud enough to over drive the audio amplifier, the voltage or current that represents the speech is beyond its maximum capability. This typically occurs when the speaker is shouting. If too
much clipping 290 occurs, the signal is labeled as Loudly Spoken 300. Applications for detecting Loudly Spoken speech may be to attenuate the signal, detect emotions, or mitigate the effects of clipping [11]. If there is little or no clipping, the signal is labeled as Normal Spoken 310 speech. Typical applications for Normal Spoken includes speech recognition, speaker identification, and language identification. - Referring to
FIG. 3 , for normally-phonated speech the speech is characterized by sustained energy and vowels. The graph clearly shows the onsets/offsets of speech and the energy has a larger standard deviation due to the distinct classes of silence, unvoiced speech, and voiced speech. For the voiced speech, there is strong harmonicity which can be measured by a pitch estimator. Features for normally phonated speech may include total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures. - Referring to
FIG. 4 , for low-level speech phonation the energy levels are lower for the voiced and unvoiced regions. There still is a glottal pulse but the various regions of silence, voiced, and unvoiced are not as distinct especially if there is some background noise present. The dynamic range of the energy levels are reduced. The voiced speech regions may be shorter and the unvoiced regions may be longer. Features for low-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different. - Referring to
FIG. 5 , for high-level speech phonation the energy levels are higher for the voiced and unvoiced regions. The glottal pulse is very strong due to the strong airflow causing the vocal folds to abduct and adduct. The voiced speech regions may be longer and the unvoiced regions may be shorter. Features for high-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different. Clipping may be an additional feature to detect high-level speech phonation. - Referring to
FIG. 6 , for whispered-speech phonation the speech is characterized by lower volume levels (sound pressure) and the lack of a glottal excitation. Whisper speech occurs when there is no vocal fold vibrations and the speech is characterized by a noise-like or hissing quality. There will be little, if any, voiced speech regions. The same features of total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures can be used to detect whispered-speech phonation. However, there will be less energy in the lower frequency regions compared to normally phonated speech. - Referring to
FIG. 7 , babble-speech phonation occurs, when there is more than one speaker talking at the same time. This leads to fewer silence regions and mixed-excitation where unvoiced and voiced speech overlap. There will be fewer unvoiced regions. The same features apply, but the reduced amount of silence and unvoiced regions provide an extra clue about the phonation style. - Other phonation style may exist such as loud whispering for dramatic purposes but the use of the same feature set along with a unique feature for loud whispering, would still allow for successful detection of the new phonation style. The phonation style detection is not geared towards any one speech processing application but provides a decision point of how to proceed.
- Referring to
FIG. 1 , the pre-processing step to phonation style detection is to extract the speech signal, including but not limited to extraction from a microphone or an interception of a transmission containing speech. The speech is then analyzed by a speech activity detector (SAD) 110 as shown inFIG. 1 . The SAD analyzes a segment of audio data and determines if the signal is speech, silence, or background noise. If theSAD 110 detects speech, thefeature extraction section - The information fusion/
decision 170 is an algorithm undersoftware control 180 that any type of statistical classifier can perform to detect the type of phonation. This statistical classifier can be any type of classifiers such as but not limited to iVectors, Gaussian mixture models, support vector machine, and/or a type of neural networks. - The information fusion and
decision 170 will output the type of phonation, such as, but not limited to, normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. The output can feed a number of software applications, such as, but not limited to, speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications. For video gaming and simulated war games, the phonation style becomes part of the game where how a person speaks is as important was what is being said and how the controller is being used. Additionally, phonation style detection can be used for hearing aids to detect whisper speech and convert high frequency, non-phonation information to normal, phonated to speech at a lower frequency region. Phonation style detection can also be used for voice coaching to give the subject feedback as to his/her pronunciation and style of pronunciation. - Uses of the present invention include but are not limited to the following:
-
- PTSD detection;
- Screening/testing: thyroid cancer screening, eHealth over the phone diagnosis, stroke detection, intoxication detection, heart attack detection;
- Hearing aids;
- Pain management monitoring, pain level detection for non-verbal patients (including animals);
- Improved closed caption translation;
- Military—information exploitation, simulated war games;
- Gaming software—vocal control;
- Voice coaching;
- Call center—customer service rep assist, auditory cues, suicide prevention;
- Internet of Things—robotic caregivers, accents for robots, appliance language tone/accent;
- Microphone improvement;
- Apps—voice change tracker, meditation alert, health monitor, mood/calming, whisper transcription, tonal detection and feedback, PIN/code replacement, voice to text with emotion, mental health monitor;
- Security—prioritized listening, voice interception, scanning chatter, crowd monitoring, airport monitoring/TSA, profiling, threat detection, interrogation;
- Having described preferred embodiments of the invention with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope or spirit of the invention as defined in the appended claims.
Claims (16)
1. A method for phonation style detection, comprising:
detecting speech activity in a signal;
extracting signal features from said detected speech activity;
characterizing said extracted signal features; and
performing a decision process on said characterized signal features which determines whether said detected speech activity is one of normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, and non-voice sound.
2. In the method of claim 1 , characterizing further comprises characterizing said extracted signal features in terms of harmonic measure, signal energy, mixed-excitation, clipping, and voicing.
3. In the method of claim 2 , performing a decision process further comprises
classifying said speech activity as non-voice sounds in the absence of harmonics and low energy signal features.
4. In the method of claim 2 , performing a decision process further comprises
classifying said speech activity as softly spoken speech in the absence of harmonics and low energy signal features but in the presence of voicing signal features.
5. In the method of claim 2 , performing a decision process further comprises
classifying said speech activity as babble in the presence of harmonics and mixed excitation signal features.
6. In the method of claim 2 , performing a decision process further comprises
classifying said speech activity as loudly spoken speech in the presence of harmonics and clipping but in the absence of mixed excitation signal features.
7. In the method of claim 2 , performing a decision process further comprises
classifying said speech activity as normally spoken speech in the presence of harmonics but in the absence clipping and mixed excitation signal features.
8. In the method of claim 2 , performing a decision process further comprises
classifying said speech activity as whisper speech in the absence of harmonics and voicing signal features but in the presence of low energy signal features.
9. In the method of claim 1 , speech activity detection is performed on substantially 10 to 30 millisecond blocks of said signal.
10. In the method of claim 1 , speech activity detection further comprises measurement of any one of the following: energy, pitch extraction, autocorrelation, spectral tilt, and cepstral coefficients.
11. In the method of claim 10 , said speech activity detection further comprises coupling said measurements with classifiers and learning algorithms.
12. In the method of claim 11 , said classifiers and leaning algorithms are selected from the group comprising state vector machines, neural networks, and Gaussian mixture models.
13. In the method of claim 10 , said measurement of energy further comprises computing discrete signal energy in the time domain according to:
E(n)=Σm=0 N−1 [w(m)s(n−m)]2
E(n)=Σm=0 N−1 [w(m)s(n−m)]2
where
w(m) is a weighting function;
s is signal amplitude;
n is a current discrete time sample;
m is a current discrete time sample of a window of time; and
E(m) is computed signal energy over said time window.
14. In the method of claim 13 , said weighting function w(m) is selected from the group consisting of rectangular, hamming and triangular window functions.
15. In the method of claim 10 , said measurement of energy further comprises computing signal energy in the frequency domain according to:
E=Σ f1 f 2 |X[f]| 2
E=Σ f
where
X[f] is the Fourier transform of said signal;
f1 and f2 are the Fourier transform frequency limits; and
E is the computed energy of said signal.
16. In the method of claim 15 , f1 is 0 and f2 is one-half the Nyquist rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/434,164 US20180137880A1 (en) | 2016-11-16 | 2017-02-16 | Phonation Style Detection |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662422611P | 2016-11-16 | 2016-11-16 | |
US15/434,164 US20180137880A1 (en) | 2016-11-16 | 2017-02-16 | Phonation Style Detection |
Publications (1)
Publication Number | Publication Date |
---|---|
US20180137880A1 true US20180137880A1 (en) | 2018-05-17 |
Family
ID=62107298
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/397,142 Active US10121011B2 (en) | 2016-11-16 | 2017-01-03 | Apparatus, method and article of manufacture for partially resisting hardware trojan induced data leakage in sequential logics |
US15/423,686 Active 2037-03-25 US10091092B2 (en) | 2016-11-16 | 2017-02-03 | Pseudorandom communications routing |
US15/434,164 Abandoned US20180137880A1 (en) | 2016-11-16 | 2017-02-16 | Phonation Style Detection |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/397,142 Active US10121011B2 (en) | 2016-11-16 | 2017-01-03 | Apparatus, method and article of manufacture for partially resisting hardware trojan induced data leakage in sequential logics |
US15/423,686 Active 2037-03-25 US10091092B2 (en) | 2016-11-16 | 2017-02-03 | Pseudorandom communications routing |
Country Status (1)
Country | Link |
---|---|
US (3) | US10121011B2 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109067765A (en) * | 2018-08-30 | 2018-12-21 | 四川创客知佳科技有限公司 | Communication management method for Internet of Things security system |
CN109246209A (en) * | 2018-08-30 | 2019-01-18 | 广元量知汇科技有限公司 | Forestry Internet of Things secure communication management method |
CN110111776A (en) * | 2019-06-03 | 2019-08-09 | 清华大学 | Interactive voice based on microphone signal wakes up electronic equipment, method and medium |
US20190311711A1 (en) * | 2018-04-10 | 2019-10-10 | Futurewei Technologies, Inc. | Method and device for processing whispered speech |
US20190348021A1 (en) * | 2018-05-11 | 2019-11-14 | International Business Machines Corporation | Phonological clustering |
US11848019B2 (en) | 2021-06-16 | 2023-12-19 | Hewlett-Packard Development Company, L.P. | Private speech filterings |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3220304B1 (en) * | 2016-02-22 | 2018-11-07 | Eshard | Method of testing the resistance of a circuit to a side channel analysis |
US20180089426A1 (en) * | 2016-09-29 | 2018-03-29 | Government Of The United States As Represented By The Secretary Of The Air Force | System, method, and apparatus for resisting hardware trojan induced leakage in combinational logics |
CN109474641B (en) * | 2019-01-03 | 2020-05-12 | 清华大学 | Reconfigurable switch forwarding engine resolver capable of destroying hardware trojans |
TWI743692B (en) | 2020-02-27 | 2021-10-21 | 威鋒電子股份有限公司 | Hardware trojan immunity device and operation method thereof |
CN114692227B (en) * | 2022-03-29 | 2023-05-09 | 电子科技大学 | Large-scale chip netlist-level hardware Trojan detection method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5862238A (en) * | 1995-09-11 | 1999-01-19 | Starkey Laboratories, Inc. | Hearing aid having input and output gain compression circuits |
US20160293172A1 (en) * | 2012-10-15 | 2016-10-06 | Digimarc Corporation | Multi-mode audio recognition and auxiliary data encoding and decoding |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4941143A (en) * | 1987-11-10 | 1990-07-10 | Echelon Systems Corp. | Protocol for network having a plurality of intelligent cells |
US5034882A (en) * | 1987-11-10 | 1991-07-23 | Echelon Corporation | Multiprocessor intelligent cell for a network which provides sensing, bidirectional communications and control |
US5113498A (en) * | 1987-11-10 | 1992-05-12 | Echelon Corporation | Input/output section for an intelligent cell which provides sensing, bidirectional communications and control |
US5784631A (en) * | 1992-06-30 | 1998-07-21 | Discovision Associates | Huffman decoder |
US7840803B2 (en) * | 2002-04-16 | 2010-11-23 | Massachusetts Institute Of Technology | Authentication of integrated circuits |
US8255443B2 (en) * | 2008-06-03 | 2012-08-28 | International Business Machines Corporation | Execution unit with inline pseudorandom number generator |
US20120063597A1 (en) * | 2010-09-15 | 2012-03-15 | Uponus Technologies, Llc. | Apparatus and associated methodology for managing content control keys |
EP2512061A1 (en) * | 2011-04-15 | 2012-10-17 | Hanscan IP B.V. | System for conducting remote biometric operations |
US9218511B2 (en) * | 2011-06-07 | 2015-12-22 | Verisiti, Inc. | Semiconductor device having features to prevent reverse engineering |
US9264892B2 (en) * | 2013-07-03 | 2016-02-16 | Verizon Patent And Licensing Inc. | Method and apparatus for attack resistant mesh networks |
CN104580027B (en) | 2013-10-25 | 2018-03-20 | 新华三技术有限公司 | A kind of OpenFlow message forwarding methods and equipment |
CN104579968B (en) | 2013-10-26 | 2018-03-09 | 华为技术有限公司 | SDN switch obtains accurate flow table item method and SDN switch, controller, system |
US20170118066A1 (en) | 2014-04-30 | 2017-04-27 | Hewlett-Packard Development Company, L.P. | Data plane to forward traffic based on communications from a software defined (sdn) controller during a control plane failure |
CN105099920A (en) | 2014-04-30 | 2015-11-25 | 杭州华三通信技术有限公司 | Method and device for setting SDN flow entry |
US9692689B2 (en) | 2014-08-27 | 2017-06-27 | International Business Machines Corporation | Reporting static flows to a switch controller in a software-defined network (SDN) |
JP5922203B2 (en) * | 2014-08-29 | 2016-05-24 | 株式会社日立製作所 | Semiconductor device |
US9917769B2 (en) | 2014-11-17 | 2018-03-13 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and system for virtualizing flow tables in a software-defined networking (SDN) system |
US9794055B2 (en) * | 2016-03-17 | 2017-10-17 | Intel Corporation | Distribution of forwarded clock |
-
2017
- 2017-01-03 US US15/397,142 patent/US10121011B2/en active Active
- 2017-02-03 US US15/423,686 patent/US10091092B2/en active Active
- 2017-02-16 US US15/434,164 patent/US20180137880A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5862238A (en) * | 1995-09-11 | 1999-01-19 | Starkey Laboratories, Inc. | Hearing aid having input and output gain compression circuits |
US20160293172A1 (en) * | 2012-10-15 | 2016-10-06 | Digimarc Corporation | Multi-mode audio recognition and auxiliary data encoding and decoding |
Non-Patent Citations (4)
Title |
---|
Gaurang Parikh and Philipos C. Loizoua, The influence of noise on vowel and consonant cues, Department of Electrical Engineering, University of Texas at Dallas, Richardson, Texas 75083-0688; 20 September 2005URL:http://ecs.utdallas.edu/loizou/cimplants/jasa_noise_effect_dec05.pdf * |
S. Nandhini, A. Shenbagavalli, Ph.D; Voiced/Unvoiced Detection using Short Term Processing, International conference on Innovations in Information, Embedded and Communication Systems (ICIIECS-2014)URL: https://pdfs.semanticscholar.org/c20b/e6aaeaed889444e225e2bd18a6eb44e9cb36.pdf * |
Singal Processing, Date:(asked Jan 27th 2015 answered Jan27th 2015) URL:https://dsp.stackexchange.com/questions/20246/energy-calculation-in-frequency-domain * |
Stanley J. Wenndt,Edward J. Cupples and Richard M. Floyd; A Study on the Classification of Whispered and Normally Phonated Speech, 7th International conference on spoken language processing [ICSLP2002]URL: https://pdfs.semanticscholar.org/05ec/a131bfc4b7106aa229c261b518894158d4fd.pdf * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190311711A1 (en) * | 2018-04-10 | 2019-10-10 | Futurewei Technologies, Inc. | Method and device for processing whispered speech |
US10832660B2 (en) * | 2018-04-10 | 2020-11-10 | Futurewei Technologies, Inc. | Method and device for processing whispered speech |
US20190348021A1 (en) * | 2018-05-11 | 2019-11-14 | International Business Machines Corporation | Phonological clustering |
US10943580B2 (en) * | 2018-05-11 | 2021-03-09 | International Business Machines Corporation | Phonological clustering |
CN109067765A (en) * | 2018-08-30 | 2018-12-21 | 四川创客知佳科技有限公司 | Communication management method for Internet of Things security system |
CN109246209A (en) * | 2018-08-30 | 2019-01-18 | 广元量知汇科技有限公司 | Forestry Internet of Things secure communication management method |
CN110111776A (en) * | 2019-06-03 | 2019-08-09 | 清华大学 | Interactive voice based on microphone signal wakes up electronic equipment, method and medium |
US11848019B2 (en) | 2021-06-16 | 2023-12-19 | Hewlett-Packard Development Company, L.P. | Private speech filterings |
Also Published As
Publication number | Publication date |
---|---|
US10121011B2 (en) | 2018-11-06 |
US10091092B2 (en) | 2018-10-02 |
US20180139119A1 (en) | 2018-05-17 |
US20180137290A1 (en) | 2018-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20180137880A1 (en) | Phonation Style Detection | |
US11004461B2 (en) | Real-time vocal features extraction for automated emotional or mental state assessment | |
Hansen | Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect | |
Ibrahim | Preprocessing technique in automatic speech recognition for human computer interaction: an overview | |
Mak et al. | A study of voice activity detection techniques for NIST speaker recognition evaluations | |
JP4568371B2 (en) | Computerized method and computer program for distinguishing between at least two event classes | |
Yegnanarayana et al. | Epoch-based analysis of speech signals | |
US20050171774A1 (en) | Features and techniques for speaker authentication | |
Chowdhury et al. | Bayesian on-line spectral change point detection: a soft computing approach for on-line ASR | |
EP2083417B1 (en) | Sound processing device and program | |
WO2011046474A2 (en) | Method for identifying a speaker based on random speech phonograms using formant equalization | |
Hansen et al. | Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system | |
Pohjalainen et al. | Shout detection in noise | |
Bhangale et al. | Synthetic speech spoofing detection using MFCC and radial basis function SVM | |
Nathwani et al. | Speech intelligibility improvement in car noise environment by voice transformation | |
Gallardo | Human and automatic speaker recognition over telecommunication channels | |
Garg et al. | A comparative study of noise reduction techniques for automatic speech recognition systems | |
KR20210000802A (en) | Artificial intelligence voice recognition processing method and system | |
Tzudir et al. | Analyzing RMFCC feature for dialect identification in Ao, an under-resourced language | |
JP2797861B2 (en) | Voice detection method and voice detection device | |
Galić et al. | Whispered speech recognition using hidden markov models and support vector machines | |
Bhukya et al. | End point detection using speech-specific knowledge for text-dependent speaker verification | |
Turan | Enhancement of throat microphone recordings using gaussian mixture model probabilistic estimator | |
Joseph et al. | Indian accent detection using dynamic time warping | |
Alam et al. | Neural response based phoneme classification under noisy condition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |