GB2499781A

GB2499781A - Acoustic information used to determine a user's mouth state which leads to operation of a voice activity detector

Info

Publication number: GB2499781A
Application number: GB201202662A
Authority: GB
Inventors: Ian Vince Mcloughlin; Faraneh Ahmadi
Original assignee: Individual
Current assignee: Individual
Priority date: 2012-02-16
Filing date: 2012-02-16
Publication date: 2013-09-04
Also published as: GB201202662D0

Abstract

An apparatus and a method for using low-frequency ultrasonic information to improve speech communications, recognition or other processing tasks through determination of mouth state, openness, orientation, proximity, shape and so on. A low frequency ultrasonic signal which can be a chirp pulse train is generated within a device located in the proximity of the mouth, transmitted towards the users face and the reflected acoustic signal is transmitted back from the human face to this or another cooperating device. The received signals are analyzed to determine both time-domain and frequency-domain representations to reveal specific and useful information pertaining to the users mouth state. The system comprises a voice activity detector (VAD) and the information is used to ensure that it only operates when speech is present and not when background noise is present.

Description

1

Method and apparatus for mouth state determination using acoustic information

BACKGROUND

Devices that capture human speech generally use microphones to pick up the sound, but they also pick up interfering noise (such as extraneous acoustic background noise and electrical noise) at the same time. This interfering noise is mixed in with the speech and corrupts, or reduces the quality of the recorded signal.

In general terms, the further away the microphone is from the mouth, the greater the proportion of background noise is picked up. Although some types of background noise can be mainly removed using post-processing, many types of common noise can not be effectively removed. Also, different levels of background noise exist in different usage situations. For example, a library environment may exhibit very low levels of background noise, whereas a busy railway platform may exhibit high levels. Most speech systems are required to be capable of operating in both classes of environment.

Systems such as mobile telephones commonly include something called a voice activity detector (VAD), that triggers whenever a spoken voice is detected. Such systems will often only process and transmit sound once the VAD has been triggered. At other times they will remain idle (and thus consume less power). This can be true for mobile phones, video conferencing systems, speech recognition systems and voice recorders. The VAD will often have a "hang time" of around a second, meaning that it will remain turned on for this length of time even after it has detected speech has ceased.

In normal use in a mobile phone scenario, if the user of such a system is involved in a conversation, he may typically speak for only 40% of the time. The use of a VAD switch means that the system can save 60% of the energy that would otherwise be spent on running complex coding algorithms. Likewise, the system can save up to 60% of the data that must be transmitted over the wireless connection.

In noisy environments, background noise will often trigger the VAD in such systems, even when the user is not speaking. Three negative consequences of this are that (i) the systems assume that speech is present when in fact it is not, and can consume a significant amount of energy attempting to encode or process speech that is not really there, (ii) in a communications system, the transmission channel (which would normally turn on to transmit useful data when speech is present) will be actively transmitting for a greater proportion of time, and will be spending much of that time transmitting nothing more than noise, (iii) non-speech noise signals will be confused as being speech, and will therefore be processed and transmitted, and eventually heard (as corrupting noise) at the other end of the communications system. This will reduce the quality and impact the intelligibility of the speech communications.

2

STATEMENT OF INVENTION

The present invention is able to overcome these issues in a number of ways. Firstly, it can determine the mouth state of a user (e.g. open/closed/in between). This is accomplished by examining the signals from an audio transducer, or two audio transducers (such as microphone and speaker), in either a passive or active configuration. In the active configuration, a transducer (e.g. loudspeaker) generates an acoustic signal. This signal propagates through the air, by skin contact or body transmission to the face and head of the user. In turn, an acoustic return signal is picked up a short time later by a transducer (e.g. microphone). In a passive configuration, signals that may already be present or generated elsewhere are used as the source.

The information gleaned from the return signal is not just a static indicator of whether the mouth is open or closed, but includes the dynamics statistics of mouth opening and closing, reveals face shape, and can differentiate degrees of mouth opening. The dynamic information reveals prosody, syllabic rate, word rate and sentence rate during speech. All of these are items of useful information to applications such as speech processing, and particularly in speech recognition.

The invention, in its active configuration, transmits an acoustic signal such as chirp pulse train (linear sweep-frequency cosine) from a source such as a loudspeaker, towards the face of the user, and receives back an altered signal (due to reflection, conduction and other mechanisms). The signals used can be audible or inaudible (such as ultrasonic), and in the preferred embodiment, the inventors use a chirp pulse train within a frequency range of approximately 14 to 21kHz, which is slightly above the threshold of hearing for most people (this is called low-frequency ultrasonics).

ADVANTAGES

Firstly, the system allows the operation of existing speech processing systems to be improved. For example (i) systems need only operate when speech is present (as with the VAD, but will not be triggered accidentally by background noises), (ii) systems need only transmit sound when speech is present and will not find themselves largely transmitting background noise, (iii) systems can ignore those background noises - even very speech-like noises that would always trigger a VAD - that occur when the users mouth is closed, (iv) speech recognition and/or processing systems can determine syllabic, prosodic, word, sentence rate, and adapt their energy use, data transmission use, and CPU scheduling patterns, accordingly. This will reduce energy consumption and increase the recognition accuracy of received speech. It will also allow such systems to adapt to new users quicker.

Using a near-audible ultrasonic chirp (such as the one starting at 14kHz mentioned above), a user will not hear any sound, and yet normal transducers can be used, like the microphones and loudspeakers already built into most modern smart phones. This means that the present invention can be used in a smart phone without requiring any special additional hardware to be installed.

The signal processing necessary to decode and analyse the received acoustic signal can range from very low complexity techniques with reasonable accuracy that determine only an open or closed state, up to far more advanced techniques that yield degree of mouth opening, as well as the dynamic statistics mentioned above which are of great advantage to computer speech recognition and associated systems (e.g. automatic speech recognition, pass phrase recognition or validation, speaker recognition or validation, language detection or validation, emotional state detection and so on). The revealed face information can also be used for security purposes.

3

INTRODUCTION TO DRAWINGS

An example of the invention will now be described by referring to the accompanying drawings.

^Figure 1 shows the invention installed within a standard mobile phone, operated by a user.

* Figure 2 illustrates an acoustic signal propagating from the unit containing the invention, impinging upon the face of the user, propagating back to the unit containing the invention, where it is captured and analysed.

^Figure 3 shows the computational analytical hardware connected to transducers (in this case a microphone and a loudspeaker).

^Figure 4 provides detail of the analytical system used within the computational hardware.

^Figure 5 shows how the output of the analytic system serves as an input to existing and common speech processing algorithms.

^Figure 6 shows the chirp signal as generated by the invention, showing normalised amplitude (y-axis) plotted against normalised time (x-axis).

^Figure 7 plots the received signal in the time domain after being reflected back from the mouth of the user (as amplitude against time), showing periods of closed mouth, followed by open mouth, and then closed mouth again. This plot spans a total of 15 received chirps.

^Figure 8 shows detection analysis results for mouth closed, open and then closed.

^Figure 9. Time-frequency analysis of the closed/open/closed reflected chirp signal.

4

DETAILED DESCRIPTION

In practice, the invention may exist within a mobile phone 101, where the presence of the invention may not even be obvious to the user 103, unless he notices the longer battery life and better quality speech transmission that the invention can lead to. In use, as the user locates the mobile phone in a natural orientation for speaking and listening, an acoustic signal 102 produced by the mobile phone impinges upon the users face, lip and mouth area, is reflected, and then received back by the device. A second embodiment can also be implemented as an external headset or other device used for vocal communications, which could also be attached to a mobile phone, to a sound recorder or to some other speech processing unit.

Hardware and software within the mobile phone 101 would pick up the received signal to determine the state of the users face. The most important characteristic of this state for the present application is whether the mouth is open or closed. Whilst the present apparatus is designed to detect either open or closed mouth, identical hardware and signal processing using a different decision-making process could also be used to provide an estimate of the degree of mouth openness, and indeed also gauge the proximity and orientation of the mobile phone with respect to the users mouth.

The front-end device 101 preferably creates and causes an ultrasonic signal 102 to impinge upon the face of the user 103. The reflection received back at 101 has different signal characteristics depending upon the shape of the face 103 and the proximity of 101 to 103.

As with most current digital audio and speech systems, one or more units of computational hardware 110 drive a loudspeaker 124 through a digital-to-analogue converter 120 and conditioner 123. Similarly, received acoustic signals are captured with a microphone 119, conditioned 118 and then converted to a form suitable for the computational hardware using an analogue-to-digital converter 115.

The computational hardware block 110, as well as creating the signal to be output, also analyses the received information. Those skilled in the art would recognise that 119 and 124 could be the same physical transducer device, and that there is no fixed requirement for 110 to handle both receive and transmit; these signals could equally well be handled by separate hardware, but are combined within the present embodiment of this invention for reasons of cost and efficiency.

It should also be recognised that 110, 115, 118, 119, 120, 123 and 124 are not specialised hardware - they are common to the vast majority of modern audio equipment that is capable of both recording and playback of sound. One requirement is that the digital elements of the system, the analogue signal paths, and transducers 119, 124 are all capable of handling the transmitted and received signals. For example, if the chirp frequency lies between 14kHz and 21kHz then the hardware should be capable of handling signals of up to 21kHz. For the digital part, the well known Nyquist criteria states that the sample rate should be at least 42kHz. In practice, most digital audio systems are able to operate with sample rates of at least 44.1kHz.

Ideally, the system should operate with inaudible signals so as not to impede normal operation. This implies either ultrasonic or infrasonic signals, although low-frequency ultrasonics are used in the preferred embodiment. These lie just above the threshold of human hearing (i.e. above 14kHz for most people, or 20kHz for 'golden eared' audiophiles. We place no upper limit on this frequency, however for this invention to be technologically attractive for use in smart phones, it is likely that the optimal signal range would need to be similar to that used in the demonstration

5

system (approximately 14 kHz - 21 kHz): this can be handled by most existing speakers and microphones, does not require extra regulatory approval (which higher ultrasound frequencies might require), and is easy to both generate and process.

In fact, a plurality of possible signals could be transmitted from the device. We have found that many alternatives can be made to work, but preferably signals that spread across the frequency band of interest. This would include pulses, steps, white noise or chirps. Also, we prefer to actively generate this ultrasonic signal, however use and detection of passive signals is also possible.

Best results have been demonstrated by transmitting a linear chirp 300 that slides in frequency in equal steps from 14 kHz up to 21 kHz at a repetition rate of perhaps 0.3Hz to 5Hz (reduced repetition rate does not affect the operation of the analysis, simply how frequently the determination of mouth open/mouth closed is made and the degree of computational power required to process it).

The invention can be operated in real-time (that is, the analysis of mouth open/mouth closed is made immediately after each chirp has been received), or the signals may be recorded, stored, and processed in retrospect. In this case, the same determination is made, chirp-by-chirp at the time of analysis.

In general use, a signal generator block 171 , operating inside the computation hardware unit 110, would produce acoustic signal 300, called the excitation signal, in conjunction with the acoustic transmission transducer 124, nominally located in a handset or headset 134 located as close to the mouth of the user 103 as possible.

As already described, the acoustic signal impinges upon the face of the user 103. The acoustic transducer 119 receives a signal 400 which could contain recognisable periodic chirp signals, but which differ from the originally transmitted chirps 300.

The analysis of the received signal 400 is key to this invention. One preferred method of analysis is to begin by comparing each transmitted chirp 300 as output from the signal generator 171 with the received chirp 400 as captured by the input transducer 134 and associated hardware. The time shift between the two signals yields information pertaining to the distance between the transducers 134 and the users face 103.

The amplitude envelope of the received chirp 400 also reveals the resonant frequencies of the spatial resonant chamber formed between the loudspeaker, face/mouth and the microphone (i.e. between 134 and 103). Very clearly, the resonant pattern in the received signal 400 changes between the two conditions of mouth open 402 and mouth closed 401.

The simplest explanation is that the human vocal tract is a highly resonant cavity, so opening the mouth provides the transmitted ultrasonic signal 300 with a frequency selective resonance chamber (comprising the mouth plus vocal tract, and taking into account the resonances between the transducers and face).

As a result the envelope of the received signal 400 in the open mouth state 402, shows significant peaks and troughs indicating the resonances within the mouth/vocal tract/face system (similar to the occurrence of formants in audible speech). This is contrary to the closed mouth state 401 in which the received signal is generally simply a reflection of the chirp from the face

6

skin, hence it is still chirp-like in shape.

In a dynamic context, mouth open and mouth closed conditions are determined through the difference between these chirp responses (i.e. the change in chirp response as the mouth is opened and closed). In a static context, the determination can be made by comparing amplitudes at various frequency positions. It is also possible, as anyone skilled in the art would know, that pattern matching and/or parametric determination can be used either in the frequency domain or in the time domain to interpret these signals.

A preferred analysis method uses a double approach algorithm. The received 14-21 kHz chirp signal 400, sampled at 96 kHz or 48 kHz is first demodulated to baseband to cover the frequency span of 0-7 kHz, and then re-sampled at 32 kHz. It is next segmented in overlapping segments, and a segment length of around one second works well. For each segment, the beginning of the chirp can be detected using autocorrelation 172 between the known generated chirp 300 and the segment being analysed 175.

The resulting autocorrelation or comparison output has clear peaks at the beginning of the chirp pulses. Since the linear source-filter theory applies to the ultrasonic excitations of the vocal tract (VT), the VT acts as the filter for the chirp signal (the source, 300). Consequently, the envelope of the received reflected chirp can be considered analogous to the frequency spectrum of the vocal tract. This envelope, 416, is extracted as the basis for detecting the status of the mouth.

A preferred method is to apply twin detecting approaches to determine the mouth status. The first considers the increase in the number of peaks in the received frequency spectrum 400, 410, when the mouth is open. Since in the open state 402, VT resonances appear in the response, the number of envelope 416 peaks dramatically increases and this can be counted for decision making, 417. The second approach assumes the resonances of the vocal tract to have distinct peaks and troughs. Considering Xp to denote the peaks and Xv to denote the valleys, Up and Uv to denote the mean values of each, a simple metric, C, can be derived:

C = E[(X-Up)2]+E[(Xv-Uv)2]

The count metric, C, thus indicates the variance of the peaks and valleys in the power spectrum and demonstrates a clear increase when opening the mouth, which can be used as an indication for understanding the mouth state. Applying a threshold to C will derive an indication 414 of the mouth state. Other detection approaches are usable, including zero crossing rate, kurtosis determination and differential energy.

It should also be clear to those skilled in the art that frequency domain analysis can equally be used. In this case the received reflected chirp signal 400, 410, 420 can be analysed either continually or on a frame-by-frame basis by being converted to the frequency domain 173. The resulting analysis 176 could include time-frequency analysis or any one of a plethora of similar techniques. The resulting signal 420, similar to 410, shows a smooth spectrum when the mouth is closed, but as soon as it opens, 421, the spectrum becomes significantly more peaky. Once the mouth closes, 422, the spectrum becomes smooth again.

7

Claims

1. A system that uses acoustic information to detect the proximity, shape, orientation and features of a nearby human face.

2. A system according to claim 1 in which the acoustic information comprises ultrasonic sounds.

3. A system according to claim 2 in which the ultrasonic sounds lie just above the threshold of human hearing, referred to here as "low-frequency ultrasonic signals".

4. A system according to claim 1 that generates an acoustic signal internally, and outputs this from a transducer such as a loudspeaker.

5. A system according to claim 1 that receives acoustic information from a transducer such as a microphone.

6. A system according to claim 4 in which the acoustic signal is generated according to a predefined or adjustable specification.

7. A system according to claims 3 and 6 in which the received ultrasonic signal is stored or processed periodically within a portable or mobile device or headset.

8. A system according to claims 4 and 5 in which the generated signal is a swept-frequency signal.

9. A system according to claim 3 in which swept-frequency low-frequency ultrasonic signals, reflected from a human face, are captured and analysed to reveal the proximity or features of that face.

10. A system according to claim 9 that determines the degree of mouth opening or mouth shape by analysing the captured acoustic signal.

11. A method that uses low-frequency ultrasonic reflection from the human face to detect the mouth state and or proximity.

12. A system that uses standard audio hardware to generate near-audible low-frequency ultrasonic swept-frequency signals and receive and analyse the reflected version of these same signals.

13. A system that excites the human mouth, nasal tract and vocal tract from the proximity of the mouth using ultrasonic excitation.

14. A system according to claim 13 that obtain resonances at feature sizes comparable to those of audible speech by making use of low-frequency ultrasonic excitation.

15. A system that uses a co-located excitation generator such as loudspeaker or buzzer with a detector or transducer such as a microphone, located in the proximity of the human mouth, for the purpose of determining mouth shape.

16. A system according to claim 12 that measures the presence of peaks in the time domain received signal to detect resonances.