GB2563868A - Sound responsive device and method - Google Patents

Sound responsive device and method Download PDF

Info

Publication number
GB2563868A
GB2563868A GB1710286.4A GB201710286A GB2563868A GB 2563868 A GB2563868 A GB 2563868A GB 201710286 A GB201710286 A GB 201710286A GB 2563868 A GB2563868 A GB 2563868A
Authority
GB
United Kingdom
Prior art keywords
sound
audio signal
real
world
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1710286.4A
Other versions
GB2563868B (en
GB201710286D0 (en
Inventor
Moorhead Paul
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kraydel Ltd
Original Assignee
Kraydel Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kraydel Ltd filed Critical Kraydel Ltd
Priority to GB1710286.4A priority Critical patent/GB2563868B/en
Publication of GB201710286D0 publication Critical patent/GB201710286D0/en
Priority to PCT/EP2018/067333 priority patent/WO2019002417A1/en
Publication of GB2563868A publication Critical patent/GB2563868A/en
Application granted granted Critical
Publication of GB2563868B publication Critical patent/GB2563868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A sound operated device is controlled e.g. by voice recognition which can distinguish between real world sound and recorded sounds reproduced via a loudspeaker. The device includes at least one microphone which may be used to determine the origin of the sound and check if this is a known location of a loudspeaker. The device may also recognise characteristics of the sound indicating that it has been subjected to audio recording, audio broadcast and/or audio reproduction processes e.g. evidence of missing frequencies due to compression, rendering or encoding. A change in bit rate may also distinguish reproduced sounds. The device can therefore discriminate between live speech and recorded speech broadcast through a radio or television. The device may only respond when the speech is determined to be real live speech rather than pre-recorded. Reference templates and machine learning may be used to train the device.

Description

Sound Responsive Device and Method
Field ofthe Invention
The present invention relates to sound responsive devices. In particular the invention relates to electronic devices for responding to real-world sounds.
Background to the Invention
Electronic devices that understand and respond to spoken commands are becoming common but issues are frequently encountered where the devices mistake audio from TV or Radio as sound from a live source.
There are also devices that attempt to classify noises in the home and respond appropriately including, for example, recognising gunfire, breaking glass, shouts etc., or even identifying coughs, sneezes, doorbells or telephones. Again, these devices may undesirably treat similar sounds from a TV program as being “real”.
Summary ofthe Invention A first aspect ofthe invention provides a method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone; producing a corresponding audio signal from the or each microphone; performing audio signal processing on the corresponding audio signal from the or each microphone; determining from said audio signal processing if said sound is a real-world sound; performing at least one action in response to detection of said sound only if said sound is determined to be a real-world sound. A second aspect ofthe invention provides a sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
Preferably determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process.
Said determining typically comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker.
Said audio signal processing may comprise frequency analysis, and said determining involves determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process. Said one or more frequency characteristic may comprise a spectral distribution of said audio signal. Said one or more frequency characteristic may comprise a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in one or more frequency bands. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz.
Said one or more characteristic may comprise one or more bit rate characteristic. Said one or more bit rate characteristic may comprise a change in bit rate. Said one or more bit rate characteristic comprises use of different bit rates for different frequency bands ofthe audio signal. Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz. Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz. Said one or more bit rate characteristic may comprise a change in bit rate, in particular a reduction otthe bit rate, after a high intensity signal event.
Said one or more characteristic may comprises noise floor level. Said one or more characteristic may comprise the noise floor level being above a threshold level.
Said determining may comprise determining if said sound was rendered by a loudspeaker.
Said determining from said audio signal processing if said sound is a real-world sound may involves comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template. The or each template may comprise a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.
Said one or more characteristics may be derived empirically from training data.
Said training data may comprise data representing pairs of non-processed and corresponding processed sound samples.
Said one or more characteristics may be derived from said training data by machine-learning.
Said determining if said sound is a real-world sound may comprise determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.
Preferred embodiments employ either one or both ofthe following approaches to overcome the problem outlined above: 1) Recognition by spatial localisation of sound sources 2) Recognising characteristics of sound that indicate that the sound has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction.
Preferred embodiments ofthe invention are capable of distinguishing between real-world sounds and pre-recorded or broadcast sound.
Further advantageous features ofthe invention will be apparent to those ordinarily skilled in the art upon review ofthe following description of a specific embodiment and with reference to the accompanying drawings.
Brief Description ofthe Drawings
An embodimet ofthe invention is now described byway of example and with reference to the accompanying drawings in which:
Figure 1 is a schematic diagram of a room in which a sound responsive device embodying one aspect ofthe invention is installed;
Figure 2 is a block diagram ofthe sound responsive device of Figure 1; and
Figure 3 is a flow diagram illustrating a preferred operation ofthe device of Figure 1
Detailed Description ofthe Drawings
Referring now to Figure 1 ofthe drawings there is shown a sound responsive device 10 embodying one aspect ofthe invention. The device 10 is shown installed in a room 12. In the illustrated example the room 12 is a typical living room but this is not limiting to the invention. At least one, but more typically a plurality of loudspeakers 14 are provided in the room 12. The loudspeakers 14 may be part of, or connected to (via wired or wireless connection), one or more electronic device (e.g. a television, radio, audio player, media player, computer, smart speaker) that is capable of providing audio signals to the loudspeakers 14 for rendering to listeners (not shown) in the room. In Figure 1, a television 16 is shown as an example of such an electronic device. Each ofthe loudspeakers 14 shown in Figure 1 may for example be connected to the TV 16. More generally, the room may contain one or more electronic device connected to, or including, one or more loudspeakers 14. Ideally, the loudspeakers 14 occupy a fixed position in the room 12, or at least a position that does not change frequently. In typical embodiments, the loudspeakers 14 are not part ofthe sound responsive device 10, although the sound responsive device 10 may have one or more loudspeakers (not shown) of its own. Advantageously the sound responsive device 10 is connectable (by wired or wireless connection) to one or more ofthe loudspeakers 14.
The sound responsive device 10 may comprise any electronic apparatus or system (not illustrated) that supports speech and/or sound recognition as part of its overall functionality. For example the system/apparatus may comprise a smart speaker, or a voice-controlled TV, audio player, media player or computing device, or a monitoring system that detects sounds in its environment and responds accordingly (e.g. issues an alarm or operates itself or some other equipment accordingly, or takes any other responsive action(s)). The nature ofthe action(s) taken by the device 10 in response to detecting a sound depends on the overall functionality ofthe device 10 and may also depend on the type ofthe detected sound. Accordingly, the device 10 is typically configured to perform classification of received sounds. This may be achieved using any conventional speech recognition and/or sound recognition techniques. The device 10 may be configured to take one or more action only in response to sounds that it recognises as being of a known type as determined by the classification process. The device 10 may be configured to monitor the status of its environment depending on the detected recognised sounds (without necessarily taking action, or taking action depending on the determined status).
The device 10 typically includes a controller 11 for controlling the overall operation of the device 10. The controller 11 may comprise any suitably configured or programmed processor(s), for example a microprocessor, microcontroller or multi-core processor. Typically the controller 11 causes the device 10 to take whichever action(s) are required in response to detection of recognised sounds. The controller 11 may also perform the sound classification or control the operation of a sound classification module as is convenient. Typically the device 10 is implemented using a multi-core processor running a plurality of processes, one of which may be designated as the controller and the others performing the other tasks described herein as required. Each process may be performed in software, hardware ora combination of software as is convenient. One or more hardware digital signal processors may be provided to perform one or more ofthe processes as is convenient and applicable.
Advantageously, the device 10 is capable of distinguishing between real-world sounds and non real-world sounds. In this context a real-world sound is a sound that is created, usually spontaneously, in the environment (which in this example comprises the room 12) in which the device 10 is located by a person, object or event in real time. As such, real-world sounds typically comprise sounds that have not been processed by any audio signal processing technique and/or that are not pre-recorded. Real-world sounds may also be said to comprise sounds that have not been rendered by a loudspeaker. Examples include live human and animal utterances, including live speech and other noises, crashes, bangs, alarms, bells and so on. In the present context therefore real-world sounds may be referred to as non-processed sounds, or sounds not emanating from a loudspeaker.
Non real-world sounds are typically sounds that have been processed by one or more audio signal processing technique, and may comprise pre-recorded or broadcast sounds. Non real-world sounds are usually rendered by a loudspeaker. Examples include sounds emanating from a TV, radio, audio or media player and so on. Non real-world sounds may be referred to as processed sounds or sounds emanating from a loudspeaker.
Advantageously, the device 10 is capable of distinguishing between real-world sounds and non real-world sounds even if the sounds are ofthe same type, e.g. distinguishing between live speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a person in the environment and recorded speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a TV or media player.
In preferred embodiments the device 10 is configured to employ either one or both ofthe following methods to achieve the above aim: 1) Recognition ot sounds by spatial localisation ot sound sources 2) Recognising characteristics of sound that indicate that the sound has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression, decompression and/or rendering (or reproduction) via electronic amplifier and/or loudspeaker.
Either or both ofthe above techniques may be used by the device 10 to determine if a detected sound is a real-world sound ora non-real-world sound. In preferred embodiments, the device 10 is configured to respond only sounds that it has determined to be real-world sounds.
Figure 2 is a block diagram of a typical embodiment ofthe sound responsive device 10. The device 10 comprises at least one microphone 18. Typical embodiments include two or more (4 or more is preferred) microphones 18 to facilitate determining the location of sound sources. The device 10 comprises an audio signal processor 20 for receiving and processing audio signals produced by the microphones 18 in response to detecting sounds in the room 12 or other environment. The audio signal processor 20 may take any convenient conventional form, being implemented in hardware, software or a combination of hardware and software. Accordingly, the audio signal processor 20 may be implemented by one or more suitably configured ASIC, FPGA or other integrated circuit, and/or a computing device with suitably programmed microprocessor(s). In preferred embodiments the audio signal processor 20 may be configured to perform any one or more ofthe following audio signal processing functions: frequency spectrum analysis; compression artefact detection; and/or location analysis. The audio signal processor 20 includes components or other means for performing the relevant audio signal processing functions, as indicated in the example of Figure 2 as 22, 24 and 26. Optionally, the audio signal processor 20 may be configured to perform classification of detected sounds using any convention sound and/or speech recognition techniques.
Location analysis involves identifying one or more locations in the environment corresponding to the source of detected sounds, i.e. spatial localisation of sound sources within the environment. In the present example, this involves determining the location ofthe loudspeakers 14.
In preferred embodiments where the device 10 has two or more microphones 18, any one or more of several known techniques may be used to locate the source of a sound in space with accuracy, for example: using differential arrival times (phase difference) at each microphone; and/or using the difference in volume level at each microphone (optionally amplified by the use of highly directionally sensitive microphones).
The preferred device 10 is operable in a training mode in which it learns the location of one or more non-real-world sound source in its environment. In the present example this involves determining the location ofthe loudspeakers 14. In the training mode, the device 10 detects sounds using the microphones 18 (or at least two of them) and performs location analysis on the output signals of the microphones 18 to determine the location of one or more loudspeaker or other sound source.
Preferably, in the training mode each loudspeaker 14 or other sound source is operated individually (i.e. one at a time) to produce sound for detection by the device 10. Alternatively, two or more loudspeakers 14 or other sound sources may be operated simultaneously in the training mode (for example where two or more loudspeakers 14 are driven by the same TV or other electronic device). In the training mode, the loudspeakers 14 or other sound source may be operated to produce sounds that they would produce during normal operation, or may be operated to produce one or more test sounds. In preferred embodiments, the device 10 is connectable (by wired or wireless connection as is convenient) to one or more ofthe sound producing devices (e.g. TV, radio, media player or other device having or being connected to one or more loudspeaker 14) in the environment in order to cause them to generate the sounds during the training mode. Advantageously, the device 10 uses test sounds for this purpose and may store test signals for sending to the sound producing devices for this purpose. For example the test signals may include full 5.1 or 7.1 sound signals to deal with environments with cinema-like loudspeaker installations.
The preferred device 10 is also operable in a listening mode in which it detects real-world sounds in the environment and may take one or more actions in response to detecting a real-world sound. The nature ofthe actions may depend on a wider functionality ofthe device 10, or of a system or apparatus of which the device 10 is part. The actions may comprise generating one or more output, for example an audio and/or visual output, and/or one or more output signal for operating one or more other device to which the device 10 is connected or of which it is part. For example the device 10 may be connected to (or be integrated with) a TV or other electronic device and may operate the TV/electronic device depending on one or more detected sounds. The device 10 may be configured to take different actions depending on what sounds are detected. The device 10 itself may be provided with one or more output device (e.g. a loudspeaker, lamp, video screen, klaxon, buzzer or other alarm device or telecommunications device), which it may operate depending on what sounds are detected.
Advantageously, the device 10, upon determining that a detected sound is not a real-world sound, can ignore the detected sound, e.g. take no action in response to the detected sound. Optionally, the device 10 may be configured to take one or more actions in response to detecting non-real-world sounds. Typically such actions are different from those taken in response to detected real-world sounds. In embodiments where the device 10 is configured to classify detected sounds according to multiple sound types (e.g. speech, bangs, doorbells, telephone rings and so on), the device 10 may be configured to take different action (including no action) for real-world sounds and non-real-world sounds even if the sounds are ofthe same type.
Limitations to the sound source localisation technique include: localising portable devices such as radios or wireless speakers which may be moved regularly; incorrectly ignoring sounds from a person positioned close to one ofthe locations the device 10 has determined should be ignored; and locating sound sources that are close to the device 10 (e.g. speakers built into a TV set on which the device 10 is located). Such limitations can be mitigated by determining whether or not a detected sound has one or more characteristic indicating that it has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression and/or rendering via electronic amplifier and/or loudspeaker, rather than being a non-processed, or raw, real-world sound. This analysis can be achieved by performing audio signal processing ofthe output signals produced by at least one ofthe microphones 18 when a sound is detected. Analysis of detected sounds to differentiate between processed and non-processed sounds (and therefore between non-real-world and real-world sounds) can be performed in addition to, or instead of, the spatial localisation of sounds described above.
For example, sound broadcast via TV or radio, sounds produced from a CD, DVD or Blu-ray disc, or streamed media sounds have been subjected to one or more audio processes, including any one or more ofthe following: A. Encoding
Almost all recorded and/or broadcast sound (barring analogue vinyl records and magnetic tape played directly through an amplifier) has gone through an encoding process. While this can involve high sampling rates and very high-fidelity capture ofthe original analogue wave form, it will in almost all cases have been subject to a process of band-pass filtering where sounds at a frequency above or below “normal” hearing ranges have been removed (usually from 20Hzto 20kHz). So, although sound encoded at the sampling rate of a CD or higher is often referred to as “lossless”, in practice not all ofthe original information is present and inaudible frequencies and harmonics will be missing; B. Compression
For broadcast, recording and/or reproduction, audio signals will usually have undergone some form of audio compression, e.g. dynamic range compression. There are lossless forms of compression which can be restored to the full original encoding, but in practice audio signals tend to go through a lossy compression process using a codec (coder-decoder) which removes some ofthe audio information. For example, codecs commonly use a psychoacoustic technique that relies on knowledge of how humans perceive sound. Pyschoacoustic codecs compress the sound by removing parts ofthe sound that humans do not pay attention to, and/or devoting fewer bits ofthe data stream to capturing parts ofthe signal which are less important to the human experience than the others. So, for example, a codec might: 1) divide up the audio signal into multiple frequency bands and devote fewer bits ofthe compressed encoding to the highest or lowest frequency bands where the human ear/brain is less discerning and more to the range in which normal speech occurs 2) devote fewer bits to the sound immediately after a loud noise, during which time it is known that the brain is paying less attention; 3) devote fewer bits to frequency ranges with less acoustic energy in the signal -louder sounds are known to mask quieter sounds in human perception; and/or 4) further remove the highest and lowest frequency sounds, i.e. be more aggressive in removing those frequencies which few people can hear - especially as they get older.
Not all audio codecs make use of psycho-acoustics to an appreciable degree e.g. the popular Aptx (trade mark) codec provided by Qualcomm. In such cases other techniques such as “dithering” are used to mask the audibly unpleasant artefacts ofthe compression process, and that in turn raises the noise floor ofthe signal which can be detected as an artefact in the audio signal. Hence, compression of an audio signal can lead to the presence of detectable artefacts in the signal that are not necessarily the result of psycho-acoustic compression techniques. C. Reproduction
When decoding an encoded signal back to renderable sound, an amplifier generates a varying voltage/current to operate a loudspeaker. In practice, both the amplifier’s electrical characteristics, and the loudspeaker’s mechanical characteristics leave an imprint on the sound being produced -often referred to as the “transfer function”. In most cases loudspeakers associated with a TV have a limited frequency response and yet more ofthe high and low frequencies will be lost.
In a typical broadcast chain, the audio signal is likely to undergo encoding, compression, decoding and decompression at least once and often more than once as it passes through the various network links from initial recording, to studio to transmitter. Different codecs may be used at different stages so the end result may bear traces of more than one kind of processing.
As a result of any one or more ofthe above (and/or other) processes, processed sounds commonly have one or more characteristics that non-processed real-world sounds do not have, and vice versa. For example, non-processed real-world sounds tend to include audio signal components at higher and/or lower frequencies than processes sounds such as those emanating from a television or audio system. Also, non-processed real-world sounds tend to have less inherent background noise than processed signals. Further, non-processed real-world sounds tend to have a more natural spread of frequency components than processed sounds. Hence the spectral distribution (which may be referred to as spectral power distribution) ofthe or each audio signal representing a detected sound can provide an indication of whether the sound is a real world sound or not.
Even within frequency band(s) that are common to both processed and non-processed sounds, the frequency distribution, i.e. the distribution ofthe frequency components ofthe audio signal, and other characteristics of processed sound are detectably different from those of real-world non-processed sounds. Some of these characteristics are complex, e.g. changes in bitrates of encoding (e.g. lower bit rate after a loud noise, or for very high or low frequencies), and introduce identifiable artefacts into the processed audio signals. A processed audio signal may include detectable artefacts arising from any one or more ofthe processes described above.
Accordingly, any sound (and more particularly any corresponding audio signal representing the sound) detected by the device 10 may be analysed in respect of any one or more signal characteristics in order to identify it as a processed sound or a non-processed sound. The relevant characteristics include, but are not limited to: i. the frequency content of the audio signal, in particular the presence or absence of signal components in one or more frequency bands, especially a high frequency band (e.g. above 20kHz or above 500kHz) and/or a low frequency band (e.g. below 20Hz or below 50Hz). ii. the spectral distribution ofthe audio signal, especially within one or more frequency bands, e.g. between 20Hz and 500kHz, or between 500Hz and 2kHz (or other frequency range deemed to correspond with the human voice) iii. the bitrate ofthe audio signal, including the absolute bitrate and/or changes in bitrate. For example this may involve detecting different bitrates being used for different frequency components (in particular relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. <500Hz) frequency bands), and/or relatively low bitrates being used after a signal event such as loud noise (which may be referred to as a high intensity signal event). iv. The noise floor level ofthe audio signal, in particular a relatively high noise floor level (e.g. above a threshold value that can be determined from reference data).
To make efficient use of computational resources, it is preferred to trigger the sound analysis once a minimum sound level and/or duration has been reached. For example, a rolling window of sound (e.g. of up to a few seconds) may be captured continuously from each microphone 18, and once the trigger condition(s) has been met a sound segment of defined duration, commencing with the trigger sound may be put into a queue for analysis. Any convenient technique to implement early, random or other discard technique may be employed if the queue grows beyond acceptable limits.
Figure 3 shows a preferred operation ofthe device 10 in the listening mode. In step 301 the device 10 captures a sample of detected sound from the output of one or more microphone 18 in response to the trigger condition(s) being met.
In step 302 the device 10 performs location analysis on the detected sound as described above. This may involve determining the location ofthe sound’s source using the phase difference between corresponding signals captures from at least two microphones 18 and/or sound intensity difference between corresponding signals captures from at least two microphones 18, and may depend on the directional sensitivity ofthe or each relevant microphone 18. Sounds that are determined as having emanated from the location of a know loudspeaker 14 (as determined during the training mode) can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
In step 303 the device 10 performs transfer function, or frequency spectrum, analysis ofthe detected sound to identify one or more frequency characteristics that are indicative of it being either a real-world sound or a non-real-world sound. Typically this involves determining that the sound is a processed, or non-real-world, sound if it lacks high and/or low frequency components that are commonly removed by audio encoding and/or by rendering via amplifier and/or loudspeaker. Sounds that are determined as having been processed can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
Alternatively or in addition the transfer function analysis may involve comparing the sound sample (conveniently a transfer function representing the sound sample) against one or more transfer function template associated with audio recording, audio broadcast and/or audio reproduction. Any audio playback system will have a transfer response h(t) and corresponding frequency domain response H(s). Playing the audio source signal sig(t) through the system will convolve sig(t) with h(t), or in a frequency domain representation, multiplication of SIG(s) (being the frequency domain representation of sig(t)) with H(s). For a given transient signal that has sufficient bandwidth across the region of H(s) where there is maximal variability, it is possible to recover an estimate ofthe multiplicative envelope of H(s) through parameter fitting to produce an estimate with some measure of certainty that the source signal was altered by reproduction through a rebroadcast system. The fitting technique can use any number of standard parametric techniques. The transfer function for broadcast compression and blu-ray encoding can for example be used as templates. Such templates are best suited to transients such as gunshots, breaking glass or TV screams, it will be less effective for narrower band sounds such as vehicle noise or human speech that does not have significant variability (inflection or emotion).
In step 304 the device 10 looks for artefacts in the detected sound (i.e. in the corresponding frequency spectrum and/or waveform ofthe corresponding audio signal) which indicate that the sound has been subjected to audio compression, e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
Sounds can be deemed to be processed or non-real-world sounds (and therefore ignored or rejected) upon being identified as such by any one of steps 302, 303 or 304, or alternatively upon being identified as such by any two or more of steps 302, 303 and 304. Any determinations made by the audio signal processor 20 in this regard may be communicated to the controller 11 which may make the decision on whether or not to ignore the detected sound and/or to determine which actions are to be taken in response to the detected sound.
It is noted that the sequencing ofthe sequence of steps 302, 303, 304 in Figure 3 is illustrative and in alternative embodiments, these (and/or other) steps may be performed in different orders, or be merged, and/or be operated in parallel, dependent on the requirements ofthe application and capabilities ofthe device 10.
In preferred embodiments the device 10 combines the techniques of location analysis, spectrum analysis and artefact detection. In alternative embodiments any one of these techniques may be used on its own, or in combination with any other ofthe techniques. For example, spectrum analysis and artefact detection in particular may each be sufficient on its own to achieve an effective level of specificity for a given use-case.
It is noted that there are at least two approaches to implementation of spectrum analysis and artefact detection: 1) Development of one or more specific algorithm to detect the or each relevant signal characteristic, for example based on analysis of reference data, and 2) Training using machine-learning techniques - this may, for example, involve training the device 10 with training data which may comprise pairs of sound samples - an original sound generated live (or a very high-fidelity or artefact-free recording), and the same sound after typical encoding, compression and/or reproduction.
The second process does not generate an algorithm as such and it may not be apparent how the system is achieving subsequent levels of effective differentiation. The machine-learning approach may also collapse steps in the processing: in other words it may not be necessary to separately look for spectrum differences and compression artefacts, a trained system may just learn the difference between processed and non-processed sounds using whatever characteristics it finds to be most capable of allowing the distinction to be made. The machine learning approach may involve providing the device 10 with reference real-world and non-real-world sounds, the device 10 being configured through machine-learning to develops its own criteria empirically for distinguishing between them. These criteria may involve elements of location, spectral distribution and artefacts, and may differ for different types of input sound.
An example of a practical application ofthe device 10 is now described for illustration purposes. In this example it is assumed that the device 10 is installed in the home of a vulnerable person to monitor their health and safety. For maximum visibility ofthe room 12 being monitored the device 10 is positioned on top ofthe TV set 16.
The device 10 is intended to monitor for coughs, sneezes, cries for help, sounds of danger and other noises but in a normal home the TV is likely to be active for several hours a day and generate many similar artificial sound events.
The device 10 has a plurality of microphones 18 and audio signal processing circuitry 20 configured to perform the following: • Separate processing ofthe audio signal from each microphone 18 • Capture of audio input samples in response to detection of a trigger signal, e.g. when sound exceeding a trigger intensity and/or duration is detected • Measurement ofthe phase shift between corresponding sound samples from each (or at least two) microphone 18 • Audio signal analysis of each sample, which may involve transfer function analysis and or artefact detection.
During the training mode the device 10 determines the position ofthe loudspeakers 14 within the room 12, preferably by playing test signals through the television (e.g. via HDMI or other connection) and detecting the corresponding sounds rendered by the loudspeakers 14 using the microphones 18. At a minimum it is preferred that alternate left and right channel test signals are used, but more preferably test signals for 2.1, 5.1 and 7.1 sound set-ups are used, selecting channels and frequencies as appropriate.
During the listening mode, the device 10 can preform location analysis and reject sounds from the designated speaker locations. Alternately, sounds from those locations can simply be marked as “suspect” and further processed before making a final decision, for example based on weighted probabilities from each phase of analysis.
The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope ofthe present invention.

Claims (30)

CLAIMS:
1. A method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone; producing a corresponding audio signal from the or each microphone; performing audio signal processing on the corresponding audio signal from the or each microphone; determining from said audio signal processing if said sound is a real-world sound; performing at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
2. The method of claim 1, wherein said determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process.
3. The method of claim 2, wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.
4. The method of claim 2 or 3 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding.
5. The method of any one of claims 2 to 4 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.
6. The method of claim 5, wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.
7. The method of any one of claims 2 to 6 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker.
8. The method of any one of claims 2 to 7, wherein said audio signal processing comprises frequency analysis, and said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process.
9. The method of claim 8 wherein said one or more frequency characteristic comprises a spectral distribution of said audio signal.
10. The method of claim 9 wherein said one or more frequency characteristic comprises a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds.
11. The method of claim 10 wherein said one or more frequency band comprises the frequency band from 20Hz to 500Hz, or from 500Hz to 50kHz.
12. The method of any one of claims 8 to 11 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in one or more frequency bands.
13. The method of claim 12 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz.
14. The method of claim 12 or 13 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz.
15. The method of any one of claims 2 to 14 wherein said one or more characteristic comprises one or more bit rate characteristic.
16. The method of claim 15 wherein said one or more bit rate characteristic comprises a change in bit rate.
17. The method of claim 15 or 16 wherein said one or more bit rate characteristic comprises use of different bit rates for different frequency bands ofthe audio signal.
18. The method of claim 17 wherein said one or more bit rate characteristic comprises use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz.
19. The method of claim 17 or 18 wherein said one or more bit rate characteristic comprises use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz.
20. The method of claim 15 wherein said one or more bit rate characteristic comprises a change in bit rate, in particular a reduction ofthe bit rate, after a high intensity signal event.
21. The method of any one of claims 2 to 20, wherein said one or more characteristic comprises noise floor level.
22. The method of claim 21, wherein said one or more characteristic comprises the noise floor level being above a threshold level.
23. The method of any preceding claim wherein said determining comprises determining if said sound was rendered by a loudspeaker.
24. The method of any preceding claim wherein said determining from said audio signal processing if said sound is a real-world sound comprises comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template.
25. The method of claim 24 wherein the or each template is a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.
26. The method of any one of claims 2 to 25 wherein said one or more characteristics are derived empirically from training data.
27. The method of claim 26 wherein said training data comprises data representing pairs of non-processed and corresponding processed sound samples.
28. The method of claim 26 or 27 wherein said one or more characteristics are derived from said training data by machine-learning.
29. The method of any preceding claim wherein said determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.
30. A sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
GB1710286.4A 2017-06-28 2017-06-28 Sound responsive device and method Active GB2563868B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1710286.4A GB2563868B (en) 2017-06-28 2017-06-28 Sound responsive device and method
PCT/EP2018/067333 WO2019002417A1 (en) 2017-06-28 2018-06-27 Sound responsive device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1710286.4A GB2563868B (en) 2017-06-28 2017-06-28 Sound responsive device and method

Publications (3)

Publication Number Publication Date
GB201710286D0 GB201710286D0 (en) 2017-08-09
GB2563868A true GB2563868A (en) 2019-01-02
GB2563868B GB2563868B (en) 2020-02-19

Family

ID=59523583

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1710286.4A Active GB2563868B (en) 2017-06-28 2017-06-28 Sound responsive device and method

Country Status (2)

Country Link
GB (1) GB2563868B (en)
WO (1) WO2019002417A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578688A (en) * 2019-09-12 2021-03-30 中强光电股份有限公司 Projection device and sound output control device thereof
KR20220104693A (en) 2019-11-21 2022-07-26 시러스 로직 인터내셔널 세미컨덕터 리미티드 Live speech detection

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998034216A2 (en) * 1997-01-31 1998-08-06 T-Netix, Inc. System and method for detecting a recorded voice
JP2005250233A (en) * 2004-03-05 2005-09-15 Sanyo Electric Co Ltd Robot device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998034216A2 (en) * 1997-01-31 1998-08-06 T-Netix, Inc. System and method for detecting a recorded voice
JP2005250233A (en) * 2004-03-05 2005-09-15 Sanyo Electric Co Ltd Robot device

Also Published As

Publication number Publication date
GB2563868B (en) 2020-02-19
GB201710286D0 (en) 2017-08-09
WO2019002417A1 (en) 2019-01-03

Similar Documents

Publication Publication Date Title
JP7271674B2 (en) Optimization by Noise Classification of Network Microphone Devices
US20200234719A1 (en) Multi-mode audio recognition and auxiliary data encoding and decoding
JP7397066B2 (en) Method, computer readable storage medium and apparatus for dynamic volume adjustment via audio classification
US10275210B2 (en) Privacy protection in collective feedforward
JP5485256B2 (en) Hearing aid, hearing aid system, gait detection method and hearing aid method
JP6576934B2 (en) Signal quality based enhancement and compensation of compressed audio signals
US9959886B2 (en) Spectral comb voice activity detection
US10275209B2 (en) Sharing of custom audio processing parameters
KR20210038871A (en) Detection of replay attacks
US8103007B2 (en) System and method of detecting speech intelligibility of audio announcement systems in noisy and reverberant spaces
RU2008142956A (en) DEVICE FOR DATA PROCESSING AND METHOD OF DATA PROCESSING
JP2010507101A (en) Method and apparatus for automatic noise compensation used in audio playback equipment
CN104937955B (en) Automatic loud speaker Check up polarity
TWI831785B (en) Personal hearing device
US20190324710A1 (en) Sharing Of Custom Audio Processing Parameters
WO2019002417A1 (en) Sound responsive device and method
CN116844559A (en) Method for detecting an alert signal in a changing environment
US11817114B2 (en) Content and environmentally aware environmental noise compensation
US8098833B2 (en) System and method for dynamic modification of speech intelligibility scoring
CN109997186B (en) Apparatus and method for classifying acoustic environments
KR20200113058A (en) Apparatus and method for operating a wearable device
EP3419021A1 (en) Device and method for distinguishing natural and artificial sound
CN115243183A (en) Audio detection method, device and storage medium
JP2010230972A (en) Voice signal processing device, method and program therefor, and reproduction device
WO2008075305A1 (en) Method and apparatus to address source of lombard speech