US20180137880A1 - Phonation Style Detection - Google Patents

Phonation Style Detection Download PDF

Info

Publication number
US20180137880A1
US20180137880A1 US15/434,164 US201715434164A US2018137880A1 US 20180137880 A1 US20180137880 A1 US 20180137880A1 US 201715434164 A US201715434164 A US 201715434164A US 2018137880 A1 US2018137880 A1 US 2018137880A1
Authority
US
United States
Prior art keywords
speech
phonation
signal
energy
speech activity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/434,164
Inventor
Stanley J. Wenndt
Darren M. Haddad
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
US Air Force
Original Assignee
US Air Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by US Air Force filed Critical US Air Force
Priority to US15/434,164 priority Critical patent/US20180137880A1/en
Publication of US20180137880A1 publication Critical patent/US20180137880A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/70Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
    • G06F21/71Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information
    • G06F21/72Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information in cryptographic circuits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/70Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
    • G06F21/78Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data
    • G06F21/79Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data in semiconductor storage media, e.g. directly-addressable memories
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/22Alternate routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/16Implementing security features at a particular protocol layer
    • H04L63/162Implementing security features at a particular protocol layer at the data link layer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/002Countermeasures against attacks on cryptographic mechanisms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/08Randomization, e.g. dummy operations or using noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/12Details relating to cryptographic hardware or logic circuitry
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/54Organization of routing tables

Definitions

  • Phonation is the rapid, periodic opening and closing of the glottis through separation and apposition of the vocal folds that, accompanied by breath under lung pressure, constitutes a source of vocal sound.
  • Phonation style refers to different speaking styles which may include normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. Babble phonation occurs when there is more than one speaker talking at the same time.
  • the current state of the art has considered noise degraded speech applications where the goal is to extract the target speech or suppress the interfering speech, but lacks a decision making process of how to address different phonation styles.
  • the speech recognition algorithm tries to recognize whatever data that it has been given. Using a probabilistic approach, the speech recognition algorithm tries to decipher what was the most likely sequence of spoken words.
  • the speech recognition algorithm can be successful even in noisy environments due to a prior knowledge about the potential sequence. For example, if the spoken words are, ‘Find the nearest liquor store’, the speech recognizer only needs to recognize ‘liquor’ and then it can guess what the rest of the words are or, at least, get close to the lexical content.
  • babble speech where more than one speaker is speaking, allows for the speech to not be processed by a speech recognizer since the output would be unreliable.
  • each phonation class may have specific features just for that phonation class.
  • a decision tree is used to take the next appropriate step.
  • speech recognition is the desired next step and the pre-processer indicated that it was whispered speech, then the next step would be to send the data to a speech recognizer that is trained on whispered speech.
  • the invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds.
  • the purpose of the invention is to introduce the phonation style as a way to control computer software.
  • a method for phonation style detection comprises detecting speech activity in a signal; extracting signal features from the detected speech activity; characterizing the extracted signal features; and performing a decision process on the characterized signal features which determines whether the detected speech activity is normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, or non-voice sound.
  • FIG. 1 depicts the present invention's process for feature extraction and phonation style classification.
  • FIG. 2 depicts the present invention's decision process for detection and phonation classification based on extracted features.
  • FIG. 3 depicts a measured time domain and spectrogram for normally-phonated speech.
  • FIG. 4 depicts a measured time domain and spectrogram for low-level speech.
  • FIG. 5 depicts a measured time domain and spectrogram for high-level speech.
  • FIG. 6 depicts a measured time domain and spectrogram for whispered speech.
  • FIG. 7 depicts a measured time domain and spectrogram for babble speech.
  • the invention described herein allows an audio stream to be analyzed and classified based on the phonation style and then, depending on the application, a software control process is able to make appropriate control decisions.
  • the goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
  • FIG. 1 displays the invention's feature extraction and classifier approach to the phonation style detection while Error! Reference source not found. provides a decision tree process into which the extracted features are analyzed.
  • the invention allows an audio stream to be classified based on the phonation style and then, as in FIG. 2 , depending on the application, a software control process is able to make appropriate control decisions.
  • the goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
  • the first step in classifying the signal is to decide if there is speech activity or not 110 .
  • Reference source not found. shows how the invention delineates the phonation style once speech activity has been found.
  • the signal is processed in small blocks of data that are typically 10-30 milliseconds in duration.
  • Features such as energy, pitch extraction, autocorrelation, spectral tilt, and/or cepstral coefficients coupled with classifiers and learning algorithms such as state vector machines, neural networks, and Gaussian mixture models are typical approaches to detect speech activity. If speech activity is detected 110 , and signal features are extracted 120 , 130 , 140 , 150 , 160 and characterized through a fusion and decision process 170 under software control 180 , then a multi-step, multi-feature approach is used to determine the phonation style as depicted in FIG. 2 .
  • the first decision point is to measure if there is harmonic information 210 in the signal.
  • a phoneme or a unit of a voice sound has a fundamental frequency associated with it, which is determined by how many times the vocal folds open and close in one second (given in units of Hz).
  • the harmonics are produced when the fundamental frequency resonates within a cavity, in the case of voice that cavity is the vocal tract. Harmonics are an integer multiple of the fundamental frequency.
  • the fundamental frequency is the first harmonic, the next octave up is the second harmonic and so on.
  • There are many methods to determine the harmonics of a voice sample one method is found in [2].
  • an energy measure 220 is used to decide if the signal is either non-voice sounds 230 , softly spoken speech 250 , or whisper 260 .
  • w(m) is a weighting function such as a rectangular, hamming, or triangular window.
  • the length of the window tends to approximately be 10-30 milliseconds due to the time-vary nature of speech signals. Additionally, the window length should encompass at least one pitch period, but not too many pitch periods.
  • the frequency domain energy calculation would be the same as the time domain energy calculation.
  • the bandwidth may be a reduced frequency range to avoid unreliable or noisy regions.
  • the energy can also be computed using autocorrelation values. The important concept isn't how the energy is calculated, but to understand how to use the energy levels. Higher energy coupled with harmonics would be indicative of normal spoken speech. Lower energy would be indicative of softly spoken or whispered speech.
  • the phonation decision would be labeled Non-Voice sounds 230 . If the signal is non-voice sounds, the next step may be to avoid these regions of the signal or to employ an additional step to classify the non-voice signal.
  • voicing occurs as the vocal folds open and close due to air pressure building up behind the vocal folds and then being released [5].
  • This vibration of the vocal folds opening and closing is referred to as the fundamental frequency or pitch period of the speech signal.
  • the bursts of air that are released as the vocal folds opens and closes then becomes an excitation source for the vocal cavity.
  • Vowels are examples of sustained voicing where the pitch period is quasi-periodic. For males, the vocal folds open and close at about 110 cycles per second. For females, it is about 250 cycles per second. Unvoiced sounds occur in speech where there is a lack of the vocal folds vibrating.
  • a fricative /s/ or /f/ For unvoiced speech, such as a fricative /s/ or /f/, the vocal folds are not vibrating and there is a constriction in the vocal tract that gives it a noise-like, turbulent quality to the speech.
  • Various combinations of these two main voicing states (voiced and unvoiced) allow other voicing states to be reached such as voiced fricatives (/z/) or whispered speech where no vocal fold vibration occur, even in the vowels.
  • Softly Spoken speech 250 If there is voicing 240 , albeit at low energy levels 220 , then the signal would be labeled as Softly Spoken speech 250 . Decision processes for softly spoken speech may be to amplify the signal in order to make it more audible. If there is no voicing 240 with low energy levels 220 , then the signal would be labeled as Whisper speech 260 .
  • Detecting whisper speech may have many applications for hearing impaired people.
  • An application may be to amplify the low energy sounds or to convert the whispered speech to normally phonated speech [6].
  • the energy in whispered speech is at higher frequencies compared to normal phonation.
  • the combination of the speech having lower energy at higher frequencies make it very troublesome for people with hearing loss. Detecting and converting whispered speech to normal speech could be very beneficial for the hearing impaired.
  • a mixed excitation measure 270 could be used to decide if the signal is either babble 280 , loudly spoken speech 300 , or normally phonated speech 310 .
  • Mixed excitations are detected 270 when the pitch is changing rapidly such as a transition between phonemes or when two or more speakers are talking at the same time. For single talker, mixed excitations, these regions will be short with a duration of about 10-20 milliseconds. For multi-talker mixed excitation, these regions will be longer and occur when one speaker's speech is corrupted by another speaker's speech.
  • 7,177,808 B addresses how to improve speaker identification when multiple speakers are present by finding the “usable” (single-talker excitation) speech and processing only the usable speech.
  • the usable speech there are many ways to estimate the usable speech by using techniques such as the kurtosis, linear predicative residual, autocorrelation, and wavelet transform to name a few [7], [8], [9].
  • the phonation decision would be to label the signal as Babble (multiple speakers) sounds 280 [10]. Once again, it is not important as to how to estimate mixed excitations, but to know when there is one talker present or multiple. Applications for Babble speech may be to avoid that region, label it as unreliable, or try to separate the speakers. If there is no mixed excitation detected 270 , then one talker is present and the next step is to look for clipping 290 .
  • Clipping is a form of waveform distortion that occurs when the speaker speaks loud enough to over drive the audio amplifier, the voltage or current that represents the speech is beyond its maximum capability. This typically occurs when the speaker is shouting. If too much clipping 290 occurs, the signal is labeled as Loudly Spoken 300 . Applications for detecting Loudly Spoken speech may be to attenuate the signal, detect emotions, or mitigate the effects of clipping [11]. If there is little or no clipping, the signal is labeled as Normal Spoken 310 speech. Typical applications for Normal Spoken includes speech recognition, speaker identification, and language identification.
  • the speech is characterized by sustained energy and vowels.
  • the graph clearly shows the onsets/offsets of speech and the energy has a larger standard deviation due to the distinct classes of silence, unvoiced speech, and voiced speech.
  • voiced speech there is strong harmonicity which can be measured by a pitch estimator.
  • Features for normally phonated speech may include total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures.
  • the energy levels are lower for the voiced and unvoiced regions. There still is a glottal pulse but the various regions of silence, voiced, and unvoiced are not as distinct especially if there is some background noise present. The dynamic range of the energy levels are reduced. The voiced speech regions may be shorter and the unvoiced regions may be longer.
  • Features for low-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different.
  • the energy levels are higher for the voiced and unvoiced regions.
  • the glottal pulse is very strong due to the strong airflow causing the vocal folds to abduct and adduct.
  • the voiced speech regions may be longer and the unvoiced regions may be shorter.
  • Features for high-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different. Clipping may be an additional feature to detect high-level speech phonation.
  • the speech is characterized by lower volume levels (sound pressure) and the lack of a glottal excitation.
  • Whisper speech occurs when there is no vocal fold vibrations and the speech is characterized by a noise-like or hissing quality. There will be little, if any, voiced speech regions.
  • the same features of total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures can be used to detect whispered-speech phonation. However, there will be less energy in the lower frequency regions compared to normally phonated speech.
  • babble-speech phonation occurs, when there is more than one speaker talking at the same time. This leads to fewer silence regions and mixed-excitation where unvoiced and voiced speech overlap. There will be fewer unvoiced regions. The same features apply, but the reduced amount of silence and unvoiced regions provide an extra clue about the phonation style.
  • phonation style may exist such as loud whispering for dramatic purposes but the use of the same feature set along with a unique feature for loud whispering, would still allow for successful detection of the new phonation style.
  • the phonation style detection is not geared towards any one speech processing application but provides a decision point of how to proceed.
  • the pre-processing step to phonation style detection is to extract the speech signal, including but not limited to extraction from a microphone or an interception of a transmission containing speech.
  • the speech is then analyzed by a speech activity detector (SAD) 110 as shown in FIG. 1 .
  • the SAD analyzes a segment of audio data and determines if the signal is speech, silence, or background noise. If the SAD 110 detects speech, the feature extraction section 120 , 130 , 140 , 150 , 160 follows.
  • the feature extraction section is used to classify the phonation of an individual's speech.
  • the features include but are not limited to a harmonic measurement, signal energy measurement, voice activity detector, time domain measurements, and frequency domain measurements. Some systems may utilize more features, while others may utilize less.
  • the information fusion/decision 170 is an algorithm under software control 180 that any type of statistical classifier can perform to detect the type of phonation.
  • This statistical classifier can be any type of classifiers such as but not limited to iVectors, Gaussian mixture models, support vector machine, and/or a type of neural networks.
  • the information fusion and decision 170 will output the type of phonation, such as, but not limited to, normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation.
  • the output can feed a number of software applications, such as, but not limited to, speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications.
  • speech recognition systems e.g., speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications.
  • the phonation style becomes part of the game where how a person speaks is as important was what is being said and how the controller is being used.
  • phonation style detection can be used for hearing aids to detect whisper speech and convert high frequency, non-phonation information to normal, phonated to speech at a lower frequency region.
  • Phonation style detection can also be used for voice coaching to give the subject feedback as to his/her pronunciation and style of pronunciation.
  • Uses of the present invention include but are not limited to the following:

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Storage Device Security (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds. The purpose of the invention is to introduce the phonation style as a way to control computer software.

Description

    STATEMENT OF GOVERNMENT INTEREST
  • The invention described herein may be manufactured and used by or for the Government for governmental purposes without the payment of any royalty thereon.
  • BACKGROUND OF THE INVENTION
  • Phonation is the rapid, periodic opening and closing of the glottis through separation and apposition of the vocal folds that, accompanied by breath under lung pressure, constitutes a source of vocal sound. Technology exists that detects sound created through phonation and attempts to understand and decode the sounds. Examples include Siri, automated phone menus, and Shazaam. Limitations on the current technologies mean that they only work well when the speaker is using a normal speaking or phonation style not including loud, babble, whisper, or pitch and they assume that the speaker wants to be heard and understood.
  • Phonation style refers to different speaking styles which may include normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. Babble phonation occurs when there is more than one speaker talking at the same time. The current state of the art has considered noise degraded speech applications where the goal is to extract the target speech or suppress the interfering speech, but lacks a decision making process of how to address different phonation styles. For most speech recognition processes, the speech recognition algorithm tries to recognize whatever data that it has been given. Using a probabilistic approach, the speech recognition algorithm tries to decipher what was the most likely sequence of spoken words. For constrained environments where the vocabulary is limited, the speech recognition algorithm can be successful even in noisy environments due to a prior knowledge about the potential sequence. For example, if the spoken words are, ‘Find the nearest liquor store’, the speech recognizer only needs to recognize ‘liquor’ and then it can guess what the rest of the words are or, at least, get close to the lexical content. The detection of babble speech (where more than one speaker is speaking), allows for the speech to not be processed by a speech recognizer since the output would be unreliable.
  • Current methods also experience issues in situations of degraded speech which can occur in almost any communication setting. Typically, speech degradation is assumed to be due to environmental noise or communication channel artifacts. However, speech degradation can also occur due to changes in phonation style. A speech processing algorithm that is trained using normally phonated speech but is given whispered or high-volume phonation style speech will quickly degrade and create nonsensical outputs. Instead of assuming a phonation style, a pre-processing technique is used to classify the speech as being normally-phonated, whispered, low-level volume, high-level volume or babble speech. Features are extracted and analyzed so as to make a decision. Many of the same features such as spectral tilt, energy, and envelope shape are standard for each phonation class. However, each phonation class may have specific features just for that phonation class. Based on the outcome of the pre-processing, a decision tree is used to take the next appropriate step. When speech recognition is the desired next step and the pre-processer indicated that it was whispered speech, then the next step would be to send the data to a speech recognizer that is trained on whispered speech.
  • In an unconstrained, dynamically changing environment, speech recognizers have not succeeded in being able to accurately recognize the spoken dialogue. This process is complicated by multiple speakers, noisy environment, and unstructured lexical information. The current state of the art lacks an approach for uniquely classifying the various phonation styles; it also does not address how to make appropriate follow-on decisions. The current technology also assumes that there is only one speaker and that the speaker desires to be understood. When there are multiple speakers or the speaker wishes to obfuscate his/her communication, the output of speech recognizers quickly degrades into non-sense lexical information.
  • OBJECTS AND SUMMARY OF THE INVENTION
  • It is therefore an object of the invention to classify speech into several possible phonation styles.
  • It is a further object of the invention to extract from speech signals characteristics which lead to the classification of several phonation styles.
  • It is yet a further object of the present invention to make a determination as to which of several phonation styles is present based on the detection of the presence or absence of extracted speech characteristics.
  • Briefly stated, the invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds. The purpose of the invention is to introduce the phonation style as a way to control computer software.
  • In an embodiment of the invention, a method for phonation style detection, comprises detecting speech activity in a signal; extracting signal features from the detected speech activity; characterizing the extracted signal features; and performing a decision process on the characterized signal features which determines whether the detected speech activity is normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, or non-voice sound.
  • The above, and other objects, features and advantages of the invention will become apparent from the following description read in conjunction with the accompanying drawings, in which like reference numerals designate the same elements.
  • REFERENCES
    • [1] Stanley J. Wenndt, Edward J. Cupples; “Method and apparatus for detecting illicit activity by classifying whispered speech and normally phonated speech according to the relative energy content of formants and fricatives”; U.S. Pat. No. 7,577,564; Issued: Aug. 18, 2009
    • [2] Darren M. Haddad, Andrew J. Noga; “Generalized Harmonicity Indicator”; U.S. Pat. No. 7,613,579; Issued: Nov. 3, 2009
    • [3] Deller, Jr., J. R., Proakis, J. G., Hansen, J. H. L. (1993), “Discrete-Time Processing of Speech Signals,” New York, N.Y.: Macmillan Publishing Company, pp. 234-240.
    • [4] Deller, Jr., J. R., Proakis, J. G., Hansen, J. H. L. (1993), “Discrete-Time Processing of Speech Signals,” New York, N.Y.: Macmillan Publishing Company, pp. 110-115.
    • [5] Peter J. Watson, Angela H. Ciccia, Gary Weismer; “The relation of lung volume initiation to selected acoustic properties of speech”; Journal Acoustical Society of America 113 (5), May 2003; pages 2812-2819.
    • [6] Zhi Tao; Xue-Dan Tan; Tao Han; Ji-Hua Gu; Yi-Shen Xu; He-Ming Zhao (2010), “Reconstruction of Normal Speech from Whispered Speech Based on RBF Neural Network,” 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.
    • [7] Krishnamachari, K. R.; Yantorno, R. E.; Lovekin, J. M.; Benincasa, D. S.; Wenndt, S. J., “Use of local kurtosis measure for spotting usable speech segments in co-channel speech,”, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings.
    • [8] Kizhanatham, A. R.; Yantorno, R. E.; Smolenski, B. Y., “Peak difference of autocorrelation of wavelet transform (PDAWT) algorithm based usable speech measure,” 2003 7th World Multiconference on Systemics, Cybernetics and Informatics Proceedings
    • [9] Yantorno, R. E.; Smolenski, B. Y.; Chandra, N.; “Usable speech measures and their fusion,” Proceedings of the 2003 International Symposium on Circuits and Systems.
    • [10] Krishnamurthy, N.; Hansen, J. H. L.; “Speech babble: Analysis and modeling for speech systems”; IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, ICASSP 2008; pages 4505-4508
    • [11] Hayakawa, Makoto; Fukumori, Takahiro; Nakayama, Masato; Nishiura, Takanobu, “Suppression of clipping noise in observed speech based on spectral compensation with Gaussian mixture models and reference of clean speech,” Proceedings of Meetings on Acoustics-ICA 2013.
    BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 depicts the present invention's process for feature extraction and phonation style classification.
  • FIG. 2 depicts the present invention's decision process for detection and phonation classification based on extracted features.
  • FIG. 3 depicts a measured time domain and spectrogram for normally-phonated speech.
  • FIG. 4 depicts a measured time domain and spectrogram for low-level speech.
  • FIG. 5 depicts a measured time domain and spectrogram for high-level speech.
  • FIG. 6 depicts a measured time domain and spectrogram for whispered speech.
  • FIG. 7 depicts a measured time domain and spectrogram for babble speech.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • The invention described herein allows an audio stream to be analyzed and classified based on the phonation style and then, depending on the application, a software control process is able to make appropriate control decisions. The goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
  • For typical speech applications, normal phonation is assumed and the audio applications attempt to process all the incoming audio. Some audio applications try to detect aberrations to this assumption, for example, if the energy level drops significantly. In this case, the algorithm may label it as whispered speech. This invention is unique in that no assumptions are made as to the type of phonation style. Instead, a multi-feature set is used to classify the phonation style.
  • People communicate in different phonation style depending on the communication setting and lexical intent. If a person wishes to conceal the spoken message, then whispering may be employed. Whisper speech occurs when then is no vocal fold vibration and the speech is characterized by a noise-like or hissing quality. For low-level speech, vocal vibrations occur, but the sound level is at a reduced level compared to normally phonated speech. If a person wishes to emphasize a point or needs to speak so as to be heard above noisy environment, then a high-level phonation style may be employed. Additionally, babble-phonation may exist where more than one speaker is talking simultaneously. Other types of phonation are possible depending on the environment. For example, loud whispering may be used for dramatic purposes or emphasis of an opinion. This invention has applicability to any speech processing application.
  • Referring to FIG. 1 displays the invention's feature extraction and classifier approach to the phonation style detection while Error! Reference source not found. provides a decision tree process into which the extracted features are analyzed. As in FIG. 1, the invention allows an audio stream to be classified based on the phonation style and then, as in FIG. 2, depending on the application, a software control process is able to make appropriate control decisions. The goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.
  • For typical speech applications, normal phonation is assumed and the audio applications attempts to process all the incoming audio. Some audio applications try to detect aberrations to this assumption, for example, if the energy level drops significantly. In this case, the algorithm may label it as whispered speech. Earlier approaches, such as [1], only assumed two phonation states: normally-phonated speech or whispered speech. This invention is unique in that no assumptions are made as to the type of phonation style. Instead, a multi-feature set is used to classify the phonation style (see FIG. 2).
  • Referring to FIG. 2, the first step in classifying the signal is to decide if there is speech activity or not 110. Referring momentarily to Error! Reference source not found. shows how the invention delineates the phonation style once speech activity has been found.
  • Referring back to FIG. 1, the signal is processed in small blocks of data that are typically 10-30 milliseconds in duration. Features such as energy, pitch extraction, autocorrelation, spectral tilt, and/or cepstral coefficients coupled with classifiers and learning algorithms such as state vector machines, neural networks, and Gaussian mixture models are typical approaches to detect speech activity. If speech activity is detected 110, and signal features are extracted 120, 130, 140, 150, 160 and characterized through a fusion and decision process 170 under software control 180, then a multi-step, multi-feature approach is used to determine the phonation style as depicted in FIG. 2.
  • Referring again to FIG. 2, the first decision point is to measure if there is harmonic information 210 in the signal. A phoneme or a unit of a voice sound has a fundamental frequency associated with it, which is determined by how many times the vocal folds open and close in one second (given in units of Hz). The harmonics are produced when the fundamental frequency resonates within a cavity, in the case of voice that cavity is the vocal tract. Harmonics are an integer multiple of the fundamental frequency. The fundamental frequency is the first harmonic, the next octave up is the second harmonic and so on. There are many methods to determine the harmonics of a voice sample, one method is found in [2].
  • If insufficient harmonics are found in the decision block 210, then an energy measure 220 is used to decide if the signal is either non-voice sounds 230, softly spoken speech 250, or whisper 260. There are many ways to compute the energy of a signal. The easiest way to compute energy is in the time domain by summing up the squared values of the signal: E(n)=Σn=−∞ N+∞s2(n) [3]. Since speech is time-varying and may change very rapidly, a windowing function is applied: E(n)=Σm=0 N−1[w(m)s(n−m)]2
  • where w(m) is a weighting function such as a rectangular, hamming, or triangular window. The length of the window tends to approximately be 10-30 milliseconds due to the time-vary nature of speech signals. Additionally, the window length should encompass at least one pitch period, but not too many pitch periods. There are many variations of computing the energy such as using the absolute value instead of the squared value or the log of the energy or using a frequency range. In the frequency domain, E=Σf 1 f 2 |X[f]|2 [4] where X[f] is the Fourier transform of the signal. The bandwidth can be the full bandwidth between f1=0 and f2=fs/2 where fs is the Nyquist. In this case, the frequency domain energy calculation would be the same as the time domain energy calculation. Or, the bandwidth may be a reduced frequency range to avoid unreliable or noisy regions. The energy can also be computed using autocorrelation values. The important concept isn't how the energy is calculated, but to understand how to use the energy levels. Higher energy coupled with harmonics would be indicative of normal spoken speech. Lower energy would be indicative of softly spoken or whispered speech.
  • If the signal fails the harmonics test 210, but has higher energy 220, the phonation decision would be labeled Non-Voice sounds 230. If the signal is non-voice sounds, the next step may be to avoid these regions of the signal or to employ an additional step to classify the non-voice signal.
  • If the signal fails the harmonics test 210, but has lower energy 220, the next step would be to test for voicing 240. Voicing occurs as the vocal folds open and close due to air pressure building up behind the vocal folds and then being released [5]. This vibration of the vocal folds opening and closing is referred to as the fundamental frequency or pitch period of the speech signal. The bursts of air that are released as the vocal folds opens and closes then becomes an excitation source for the vocal cavity. Vowels are examples of sustained voicing where the pitch period is quasi-periodic. For males, the vocal folds open and close at about 110 cycles per second. For females, it is about 250 cycles per second. Unvoiced sounds occur in speech where there is a lack of the vocal folds vibrating. For unvoiced speech, such as a fricative /s/ or /f/, the vocal folds are not vibrating and there is a constriction in the vocal tract that gives it a noise-like, turbulent quality to the speech. Various combinations of these two main voicing states (voiced and unvoiced) allow other voicing states to be reached such as voiced fricatives (/z/) or whispered speech where no vocal fold vibration occur, even in the vowels.
  • If there is voicing 240, albeit at low energy levels 220, then the signal would be labeled as Softly Spoken speech 250. Decision processes for softly spoken speech may be to amplify the signal in order to make it more audible. If there is no voicing 240 with low energy levels 220, then the signal would be labeled as Whisper speech 260.
  • Detecting whisper speech may have many applications for hearing impaired people. An application may be to amplify the low energy sounds or to convert the whispered speech to normally phonated speech [6]. In addition to having less energy, the energy in whispered speech is at higher frequencies compared to normal phonation. The combination of the speech having lower energy at higher frequencies make it very troublesome for people with hearing loss. Detecting and converting whispered speech to normal speech could be very beneficial for the hearing impaired.
  • If harmonics are found in the decision block 210, then a mixed excitation measure 270 could be used to decide if the signal is either babble 280, loudly spoken speech 300, or normally phonated speech 310. Mixed excitations are detected 270 when the pitch is changing rapidly such as a transition between phonemes or when two or more speakers are talking at the same time. For single talker, mixed excitations, these regions will be short with a duration of about 10-20 milliseconds. For multi-talker mixed excitation, these regions will be longer and occur when one speaker's speech is corrupted by another speaker's speech. U.S. Pat. No. 7,177,808 B addresses how to improve speaker identification when multiple speakers are present by finding the “usable” (single-talker excitation) speech and processing only the usable speech. As with the energy measurement, there are many ways to estimate the usable speech by using techniques such as the kurtosis, linear predicative residual, autocorrelation, and wavelet transform to name a few [7], [8], [9].
  • If harmonics are detected 210 in the signal, but the signal has mixed excitations 270, the phonation decision would be to label the signal as Babble (multiple speakers) sounds 280 [10]. Once again, it is not important as to how to estimate mixed excitations, but to know when there is one talker present or multiple. Applications for Babble speech may be to avoid that region, label it as unreliable, or try to separate the speakers. If there is no mixed excitation detected 270, then one talker is present and the next step is to look for clipping 290.
  • Clipping is a form of waveform distortion that occurs when the speaker speaks loud enough to over drive the audio amplifier, the voltage or current that represents the speech is beyond its maximum capability. This typically occurs when the speaker is shouting. If too much clipping 290 occurs, the signal is labeled as Loudly Spoken 300. Applications for detecting Loudly Spoken speech may be to attenuate the signal, detect emotions, or mitigate the effects of clipping [11]. If there is little or no clipping, the signal is labeled as Normal Spoken 310 speech. Typical applications for Normal Spoken includes speech recognition, speaker identification, and language identification.
  • Referring to FIG. 3, for normally-phonated speech the speech is characterized by sustained energy and vowels. The graph clearly shows the onsets/offsets of speech and the energy has a larger standard deviation due to the distinct classes of silence, unvoiced speech, and voiced speech. For the voiced speech, there is strong harmonicity which can be measured by a pitch estimator. Features for normally phonated speech may include total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures.
  • Referring to FIG. 4, for low-level speech phonation the energy levels are lower for the voiced and unvoiced regions. There still is a glottal pulse but the various regions of silence, voiced, and unvoiced are not as distinct especially if there is some background noise present. The dynamic range of the energy levels are reduced. The voiced speech regions may be shorter and the unvoiced regions may be longer. Features for low-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different.
  • Referring to FIG. 5, for high-level speech phonation the energy levels are higher for the voiced and unvoiced regions. The glottal pulse is very strong due to the strong airflow causing the vocal folds to abduct and adduct. The voiced speech regions may be longer and the unvoiced regions may be shorter. Features for high-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different. Clipping may be an additional feature to detect high-level speech phonation.
  • Referring to FIG. 6, for whispered-speech phonation the speech is characterized by lower volume levels (sound pressure) and the lack of a glottal excitation. Whisper speech occurs when there is no vocal fold vibrations and the speech is characterized by a noise-like or hissing quality. There will be little, if any, voiced speech regions. The same features of total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures can be used to detect whispered-speech phonation. However, there will be less energy in the lower frequency regions compared to normally phonated speech.
  • Referring to FIG. 7, babble-speech phonation occurs, when there is more than one speaker talking at the same time. This leads to fewer silence regions and mixed-excitation where unvoiced and voiced speech overlap. There will be fewer unvoiced regions. The same features apply, but the reduced amount of silence and unvoiced regions provide an extra clue about the phonation style.
  • Other phonation style may exist such as loud whispering for dramatic purposes but the use of the same feature set along with a unique feature for loud whispering, would still allow for successful detection of the new phonation style. The phonation style detection is not geared towards any one speech processing application but provides a decision point of how to proceed.
  • Referring to FIG. 1, the pre-processing step to phonation style detection is to extract the speech signal, including but not limited to extraction from a microphone or an interception of a transmission containing speech. The speech is then analyzed by a speech activity detector (SAD) 110 as shown in FIG. 1. The SAD analyzes a segment of audio data and determines if the signal is speech, silence, or background noise. If the SAD 110 detects speech, the feature extraction section 120, 130, 140, 150, 160 follows. The feature extraction section is used to classify the phonation of an individual's speech. The features include but are not limited to a harmonic measurement, signal energy measurement, voice activity detector, time domain measurements, and frequency domain measurements. Some systems may utilize more features, while others may utilize less. Once the features are extracted from the speech, the information from the features is fused 170.
  • The information fusion/decision 170 is an algorithm under software control 180 that any type of statistical classifier can perform to detect the type of phonation. This statistical classifier can be any type of classifiers such as but not limited to iVectors, Gaussian mixture models, support vector machine, and/or a type of neural networks.
  • The information fusion and decision 170 will output the type of phonation, such as, but not limited to, normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. The output can feed a number of software applications, such as, but not limited to, speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications. For video gaming and simulated war games, the phonation style becomes part of the game where how a person speaks is as important was what is being said and how the controller is being used. Additionally, phonation style detection can be used for hearing aids to detect whisper speech and convert high frequency, non-phonation information to normal, phonated to speech at a lower frequency region. Phonation style detection can also be used for voice coaching to give the subject feedback as to his/her pronunciation and style of pronunciation.
  • Uses of the present invention include but are not limited to the following:
      • PTSD detection;
      • Screening/testing: thyroid cancer screening, eHealth over the phone diagnosis, stroke detection, intoxication detection, heart attack detection;
      • Hearing aids;
      • Pain management monitoring, pain level detection for non-verbal patients (including animals);
      • Improved closed caption translation;
      • Military—information exploitation, simulated war games;
      • Gaming software—vocal control;
      • Voice coaching;
      • Call center—customer service rep assist, auditory cues, suicide prevention;
      • Internet of Things—robotic caregivers, accents for robots, appliance language tone/accent;
      • Microphone improvement;
      • Apps—voice change tracker, meditation alert, health monitor, mood/calming, whisper transcription, tonal detection and feedback, PIN/code replacement, voice to text with emotion, mental health monitor;
      • Security—prioritized listening, voice interception, scanning chatter, crowd monitoring, airport monitoring/TSA, profiling, threat detection, interrogation;
  • Having described preferred embodiments of the invention with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope or spirit of the invention as defined in the appended claims.

Claims (16)

What is claimed is:
1. A method for phonation style detection, comprising:
detecting speech activity in a signal;
extracting signal features from said detected speech activity;
characterizing said extracted signal features; and
performing a decision process on said characterized signal features which determines whether said detected speech activity is one of normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, and non-voice sound.
2. In the method of claim 1, characterizing further comprises characterizing said extracted signal features in terms of harmonic measure, signal energy, mixed-excitation, clipping, and voicing.
3. In the method of claim 2, performing a decision process further comprises
classifying said speech activity as non-voice sounds in the absence of harmonics and low energy signal features.
4. In the method of claim 2, performing a decision process further comprises
classifying said speech activity as softly spoken speech in the absence of harmonics and low energy signal features but in the presence of voicing signal features.
5. In the method of claim 2, performing a decision process further comprises
classifying said speech activity as babble in the presence of harmonics and mixed excitation signal features.
6. In the method of claim 2, performing a decision process further comprises
classifying said speech activity as loudly spoken speech in the presence of harmonics and clipping but in the absence of mixed excitation signal features.
7. In the method of claim 2, performing a decision process further comprises
classifying said speech activity as normally spoken speech in the presence of harmonics but in the absence clipping and mixed excitation signal features.
8. In the method of claim 2, performing a decision process further comprises
classifying said speech activity as whisper speech in the absence of harmonics and voicing signal features but in the presence of low energy signal features.
9. In the method of claim 1, speech activity detection is performed on substantially 10 to 30 millisecond blocks of said signal.
10. In the method of claim 1, speech activity detection further comprises measurement of any one of the following: energy, pitch extraction, autocorrelation, spectral tilt, and cepstral coefficients.
11. In the method of claim 10, said speech activity detection further comprises coupling said measurements with classifiers and learning algorithms.
12. In the method of claim 11, said classifiers and leaning algorithms are selected from the group comprising state vector machines, neural networks, and Gaussian mixture models.
13. In the method of claim 10, said measurement of energy further comprises computing discrete signal energy in the time domain according to:

E(n)=Σm=0 N−1 [w(m)s(n−m)]2
where
w(m) is a weighting function;
s is signal amplitude;
n is a current discrete time sample;
m is a current discrete time sample of a window of time; and
E(m) is computed signal energy over said time window.
14. In the method of claim 13, said weighting function w(m) is selected from the group consisting of rectangular, hamming and triangular window functions.
15. In the method of claim 10, said measurement of energy further comprises computing signal energy in the frequency domain according to:

E=Σ f 1 f 2 |X[f]| 2
where
X[f] is the Fourier transform of said signal;
f1 and f2 are the Fourier transform frequency limits; and
E is the computed energy of said signal.
16. In the method of claim 15, f1 is 0 and f2 is one-half the Nyquist rate.
US15/434,164 2016-11-16 2017-02-16 Phonation Style Detection Abandoned US20180137880A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/434,164 US20180137880A1 (en) 2016-11-16 2017-02-16 Phonation Style Detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662422611P 2016-11-16 2016-11-16
US15/434,164 US20180137880A1 (en) 2016-11-16 2017-02-16 Phonation Style Detection

Publications (1)

Publication Number Publication Date
US20180137880A1 true US20180137880A1 (en) 2018-05-17

Family

ID=62107298

Family Applications (3)

Application Number Title Priority Date Filing Date
US15/397,142 Active US10121011B2 (en) 2016-11-16 2017-01-03 Apparatus, method and article of manufacture for partially resisting hardware trojan induced data leakage in sequential logics
US15/423,686 Active 2037-03-25 US10091092B2 (en) 2016-11-16 2017-02-03 Pseudorandom communications routing
US15/434,164 Abandoned US20180137880A1 (en) 2016-11-16 2017-02-16 Phonation Style Detection

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US15/397,142 Active US10121011B2 (en) 2016-11-16 2017-01-03 Apparatus, method and article of manufacture for partially resisting hardware trojan induced data leakage in sequential logics
US15/423,686 Active 2037-03-25 US10091092B2 (en) 2016-11-16 2017-02-03 Pseudorandom communications routing

Country Status (1)

Country Link
US (3) US10121011B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109067765A (en) * 2018-08-30 2018-12-21 四川创客知佳科技有限公司 Communication management method for Internet of Things security system
CN109246209A (en) * 2018-08-30 2019-01-18 广元量知汇科技有限公司 Forestry Internet of Things secure communication management method
CN110111776A (en) * 2019-06-03 2019-08-09 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
US20190311711A1 (en) * 2018-04-10 2019-10-10 Futurewei Technologies, Inc. Method and device for processing whispered speech
US20190348021A1 (en) * 2018-05-11 2019-11-14 International Business Machines Corporation Phonological clustering
US11848019B2 (en) 2021-06-16 2023-12-19 Hewlett-Packard Development Company, L.P. Private speech filterings

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3220304B1 (en) * 2016-02-22 2018-11-07 Eshard Method of testing the resistance of a circuit to a side channel analysis
US20180089426A1 (en) * 2016-09-29 2018-03-29 Government Of The United States As Represented By The Secretary Of The Air Force System, method, and apparatus for resisting hardware trojan induced leakage in combinational logics
CN109474641B (en) * 2019-01-03 2020-05-12 清华大学 Reconfigurable switch forwarding engine resolver capable of destroying hardware trojans
TWI743692B (en) 2020-02-27 2021-10-21 威鋒電子股份有限公司 Hardware trojan immunity device and operation method thereof
CN114692227B (en) * 2022-03-29 2023-05-09 电子科技大学 Large-scale chip netlist-level hardware Trojan detection method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5862238A (en) * 1995-09-11 1999-01-19 Starkey Laboratories, Inc. Hearing aid having input and output gain compression circuits
US20160293172A1 (en) * 2012-10-15 2016-10-06 Digimarc Corporation Multi-mode audio recognition and auxiliary data encoding and decoding

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4941143A (en) * 1987-11-10 1990-07-10 Echelon Systems Corp. Protocol for network having a plurality of intelligent cells
US5034882A (en) * 1987-11-10 1991-07-23 Echelon Corporation Multiprocessor intelligent cell for a network which provides sensing, bidirectional communications and control
US5113498A (en) * 1987-11-10 1992-05-12 Echelon Corporation Input/output section for an intelligent cell which provides sensing, bidirectional communications and control
US5784631A (en) * 1992-06-30 1998-07-21 Discovision Associates Huffman decoder
US7840803B2 (en) * 2002-04-16 2010-11-23 Massachusetts Institute Of Technology Authentication of integrated circuits
US8255443B2 (en) * 2008-06-03 2012-08-28 International Business Machines Corporation Execution unit with inline pseudorandom number generator
US20120063597A1 (en) * 2010-09-15 2012-03-15 Uponus Technologies, Llc. Apparatus and associated methodology for managing content control keys
EP2512061A1 (en) * 2011-04-15 2012-10-17 Hanscan IP B.V. System for conducting remote biometric operations
US9218511B2 (en) * 2011-06-07 2015-12-22 Verisiti, Inc. Semiconductor device having features to prevent reverse engineering
US9264892B2 (en) * 2013-07-03 2016-02-16 Verizon Patent And Licensing Inc. Method and apparatus for attack resistant mesh networks
CN104580027B (en) 2013-10-25 2018-03-20 新华三技术有限公司 A kind of OpenFlow message forwarding methods and equipment
CN104579968B (en) 2013-10-26 2018-03-09 华为技术有限公司 SDN switch obtains accurate flow table item method and SDN switch, controller, system
US20170118066A1 (en) 2014-04-30 2017-04-27 Hewlett-Packard Development Company, L.P. Data plane to forward traffic based on communications from a software defined (sdn) controller during a control plane failure
CN105099920A (en) 2014-04-30 2015-11-25 杭州华三通信技术有限公司 Method and device for setting SDN flow entry
US9692689B2 (en) 2014-08-27 2017-06-27 International Business Machines Corporation Reporting static flows to a switch controller in a software-defined network (SDN)
JP5922203B2 (en) * 2014-08-29 2016-05-24 株式会社日立製作所 Semiconductor device
US9917769B2 (en) 2014-11-17 2018-03-13 Telefonaktiebolaget Lm Ericsson (Publ) Method and system for virtualizing flow tables in a software-defined networking (SDN) system
US9794055B2 (en) * 2016-03-17 2017-10-17 Intel Corporation Distribution of forwarded clock

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5862238A (en) * 1995-09-11 1999-01-19 Starkey Laboratories, Inc. Hearing aid having input and output gain compression circuits
US20160293172A1 (en) * 2012-10-15 2016-10-06 Digimarc Corporation Multi-mode audio recognition and auxiliary data encoding and decoding

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Gaurang Parikh and Philipos C. Loizoua, The influence of noise on vowel and consonant cues, Department of Electrical Engineering, University of Texas at Dallas, Richardson, Texas 75083-0688; 20 September 2005URL:http://ecs.utdallas.edu/loizou/cimplants/jasa_noise_effect_dec05.pdf *
S. Nandhini, A. Shenbagavalli, Ph.D; Voiced/Unvoiced Detection using Short Term Processing, International conference on Innovations in Information, Embedded and Communication Systems (ICIIECS-2014)URL: https://pdfs.semanticscholar.org/c20b/e6aaeaed889444e225e2bd18a6eb44e9cb36.pdf *
Singal Processing, Date:(asked Jan 27th 2015 answered Jan27th 2015) URL:https://dsp.stackexchange.com/questions/20246/energy-calculation-in-frequency-domain *
Stanley J. Wenndt,Edward J. Cupples and Richard M. Floyd; A Study on the Classification of Whispered and Normally Phonated Speech, 7th International conference on spoken language processing [ICSLP2002]URL: https://pdfs.semanticscholar.org/05ec/a131bfc4b7106aa229c261b518894158d4fd.pdf *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190311711A1 (en) * 2018-04-10 2019-10-10 Futurewei Technologies, Inc. Method and device for processing whispered speech
US10832660B2 (en) * 2018-04-10 2020-11-10 Futurewei Technologies, Inc. Method and device for processing whispered speech
US20190348021A1 (en) * 2018-05-11 2019-11-14 International Business Machines Corporation Phonological clustering
US10943580B2 (en) * 2018-05-11 2021-03-09 International Business Machines Corporation Phonological clustering
CN109067765A (en) * 2018-08-30 2018-12-21 四川创客知佳科技有限公司 Communication management method for Internet of Things security system
CN109246209A (en) * 2018-08-30 2019-01-18 广元量知汇科技有限公司 Forestry Internet of Things secure communication management method
CN110111776A (en) * 2019-06-03 2019-08-09 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
US11848019B2 (en) 2021-06-16 2023-12-19 Hewlett-Packard Development Company, L.P. Private speech filterings

Also Published As

Publication number Publication date
US10121011B2 (en) 2018-11-06
US10091092B2 (en) 2018-10-02
US20180139119A1 (en) 2018-05-17
US20180137290A1 (en) 2018-05-17

Similar Documents

Publication Publication Date Title
US20180137880A1 (en) Phonation Style Detection
US11004461B2 (en) Real-time vocal features extraction for automated emotional or mental state assessment
Hansen Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect
Ibrahim Preprocessing technique in automatic speech recognition for human computer interaction: an overview
Mak et al. A study of voice activity detection techniques for NIST speaker recognition evaluations
JP4568371B2 (en) Computerized method and computer program for distinguishing between at least two event classes
Yegnanarayana et al. Epoch-based analysis of speech signals
US20050171774A1 (en) Features and techniques for speaker authentication
Chowdhury et al. Bayesian on-line spectral change point detection: a soft computing approach for on-line ASR
EP2083417B1 (en) Sound processing device and program
WO2011046474A2 (en) Method for identifying a speaker based on random speech phonograms using formant equalization
Hansen et al. Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system
Pohjalainen et al. Shout detection in noise
Bhangale et al. Synthetic speech spoofing detection using MFCC and radial basis function SVM
Nathwani et al. Speech intelligibility improvement in car noise environment by voice transformation
Gallardo Human and automatic speaker recognition over telecommunication channels
Garg et al. A comparative study of noise reduction techniques for automatic speech recognition systems
KR20210000802A (en) Artificial intelligence voice recognition processing method and system
Tzudir et al. Analyzing RMFCC feature for dialect identification in Ao, an under-resourced language
JP2797861B2 (en) Voice detection method and voice detection device
Galić et al. Whispered speech recognition using hidden markov models and support vector machines
Bhukya et al. End point detection using speech-specific knowledge for text-dependent speaker verification
Turan Enhancement of throat microphone recordings using gaussian mixture model probabilistic estimator
Joseph et al. Indian accent detection using dynamic time warping
Alam et al. Neural response based phoneme classification under noisy condition

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION