WO2019002417A1 - Dispositif et procédé sensible au son - Google Patents

Dispositif et procédé sensible au son Download PDF

Info

Publication number
WO2019002417A1
WO2019002417A1 PCT/EP2018/067333 EP2018067333W WO2019002417A1 WO 2019002417 A1 WO2019002417 A1 WO 2019002417A1 EP 2018067333 W EP2018067333 W EP 2018067333W WO 2019002417 A1 WO2019002417 A1 WO 2019002417A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
audio signal
real
world
determining
Prior art date
Application number
PCT/EP2018/067333
Other languages
English (en)
Inventor
Paul MOORHEAD
Original Assignee
Kraydel Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kraydel Limited filed Critical Kraydel Limited
Publication of WO2019002417A1 publication Critical patent/WO2019002417A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Definitions

  • the present invention relates to sound responsive devices.
  • the invention relates to electronic devices for responding to real-world sounds.
  • a first aspect of the invention provides a method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone;
  • a second aspect of the invention provides a sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
  • determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process.
  • Said determining typically comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.
  • Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding. Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.
  • Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.
  • Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker.
  • Said audio signal processing may comprise frequency analysis, and said determining involves determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process.
  • Said one or more frequency characteristic may comprise a spectral distribution of said audio signal.
  • Said one or more frequency characteristic may comprise a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds.
  • Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in one or more frequency bands.
  • Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz.
  • Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz.
  • Said one or more characteristic may comprise one or more bit rate characteristic.
  • Said one or more bit rate characteristic may comprise a change in bit rate.
  • Said one or more bit rate characteristic comprises use of different bit rates for different frequency bands of the audio signal.
  • Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz.
  • Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz.
  • Said one or more bit rate characteristic may comprise a change in bit rate, in particular a reduction of the bit rate, after a high intensity signal event.
  • Said one or more characteristic may comprises noise floor level.
  • Said one or more characteristic may comprise the noise floor level being above a threshold level.
  • Said determining may comprise determining if said sound was rendered by a loudspeaker.
  • Said determining from said audio signal processing if said sound is a real-world sound may involves comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template.
  • the or each template may comprise a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.
  • Said one or more characteristics may be derived empirically from training data.
  • Said training data may comprise data representing pairs of non-processed and corresponding processed sound samples.
  • Said one or more characteristics may be derived from said training data by machine-learning.
  • Said determining if said sound is a real-world sound may comprise determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.
  • Preferred embodiments employ either one or both of the following approaches to overcome the problem outlined above: 1 ) Recognition by spatial localisation of sound sources
  • Figure 1 is a schematic diagram of a room in which a sound responsive device embodying one aspect of the invention is installed;
  • Figure 2 is a block diagram of the sound responsive device of Figure 1 ;
  • FIG 3 is a flow diagram illustrating a preferred operation of the device of Figure 1 Detailed Description of the Drawings
  • the device 10 is shown installed in a room 12.
  • the room 12 is a typical living room but this is not limiting to the invention.
  • At least one, but more typically a plurality of loudspeakers 14 are provided in the room 12.
  • the loudspeakers 14 may be part of, or connected to (via wired or wireless connection), one or more electronic device (e.g. a television, radio, audio player, media player, computer, smart speaker) that is capable of providing audio signals to the loudspeakers 14 for rendering to listeners (not shown) in the room.
  • electronic device e.g. a television, radio, audio player, media player, computer, smart speaker
  • a television 16 is shown as an example of such an electronic device.
  • Each of the loudspeakers 14 shown in Figure 1 may for example be connected to the TV 16.
  • the room may contain one or more electronic device connected to, or including, one or more loudspeakers 14.
  • the loudspeakers 14 occupy a fixed position in the room 12, or at least a position that does not change frequently.
  • the loudspeakers 14 are not part of the sound responsive device 10, although the sound responsive device 10 may have one or more loudspeakers (not shown) of its own.
  • the sound responsive device 10 is connectable (by wired or wireless connection) to one or more of the loudspeakers 14.
  • the sound responsive device 10 may comprise any electronic apparatus or system (not illustrated) that supports speech and/or sound recognition as part of its overall functionality.
  • the system/apparatus may comprise a smart speaker, or a voice-controlled TV, audio player, media player or computing device, or a monitoring system that detects sounds in its environment and responds accordingly (e.g. issues an alarm or operates itself or some other equipment accordingly, or takes any other responsive action(s)).
  • the nature of the action(s) taken by the device 10 in response to detecting a sound depends on the overall functionality of the device 10 and may also depend on the type of the detected sound. Accordingly, the device 10 is typically configured to perform classification of received sounds. This may be achieved using any conventional speech recognition and/or sound recognition techniques.
  • the device 10 may be configured to take one or more action only in response to sounds that it recognises as being of a known type as determined by the classification process.
  • the device 10 may be configured to monitor the status of its environment depending on the detected recognised sounds (without necessarily taking action, or taking action depending on the determined status).
  • the device 10 typically includes a controller 1 1 for controlling the overall operation of the device 10.
  • the controller 1 1 may comprise any suitably configured or programmed processor(s), for example a microprocessor, microcontroller or multi-core processor. Typically the controller 1 1 causes the device 10 to take whichever action(s) are required in response to detection of recognised sounds.
  • the controller 1 1 may also perform the sound classification or control the operation of a sound classification module as is convenient.
  • the device 10 is implemented using a multi-core processor running a plurality of processes, one of which may be designated as the controller and the others performing the other tasks described herein as required. Each process may be performed in software, hardware or a combination of software as is convenient. One or more hardware digital signal processors may be provided to perform one or more of the processes as is convenient and applicable.
  • the device 10 is capable of distinguishing between real-world sounds and non real- world sounds.
  • a real-world sound is a sound that is created, usually spontaneously, in the environment (which in this example comprises the room 12) in which the device 10 is located by a person, object or event in real time.
  • real-world sounds typically comprise sounds that have not been processed by any audio signal processing technique and/or that are not pre-recorded.
  • Real-world sounds may also be said to comprise sounds that have not been rendered by a loudspeaker. Examples include live human and animal utterances, including live speech and other noises, crashes, bangs, alarms, bells and so on. In the present context therefore real-world sounds may be referred to as non-processed sounds, or sounds not emanating from a loudspeaker.
  • Non real-world sounds are typically sounds that have been processed by one or more audio signal processing technique, and may comprise pre-recorded or broadcast sounds.
  • Non real-world sounds are usually rendered by a loudspeaker. Examples include sounds emanating from a TV, radio, audio or media player and so on.
  • Non real-world sounds may be referred to as processed sounds or sounds emanating from a loudspeaker.
  • the device 10 is capable of distinguishing between real-world sounds and non real- world sounds even if the sounds are of the same type, e.g. distinguishing between live speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a person in the environment and recorded speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a TV or media player.
  • live speech or other sounds e.g. coughs, sneezes or shouts
  • recorded speech or other sounds e.g. coughs, sneezes or shouts
  • the device 10 is configured to employ either one or both of the following methods to achieve the above aim: 1 ) Recognition of sounds by spatial localisation of sound sources
  • the device 10 may be used by the device 10 to determine if a detected sound is a real-world sound or a non-real-world sound.
  • the device 10 is configured to respond only sounds that it has determined to be real-world sounds.
  • FIG. 2 is a block diagram of a typical embodiment of the sound responsive device 10.
  • the device 10 comprises at least one microphone 18. Typical embodiments include two or more (4 or more is preferred) microphones 18 to facilitate determining the location of sound sources.
  • the device 10 comprises an audio signal processor 20 for receiving and processing audio signals produced by the microphones 18 in response to detecting sounds in the room 12 or other environment.
  • the audio signal processor 20 may take any convenient conventional form, being implemented in hardware, software or a combination of hardware and software. Accordingly, the audio signal processor 20 may be implemented by one or more suitably configured ASIC, FPGA or other integrated circuit, and/or a computing device with suitably programmed microprocessor(s).
  • the audio signal processor 20 may be configured to perform any one or more of the following audio signal processing functions: frequency spectrum analysis; compression artefact detection; and/or location analysis.
  • the audio signal processor 20 includes components or other means for performing the relevant audio signal processing functions, as indicated in the example of Figure 2 as 22, 24 and 26.
  • the audio signal processor 20 may be configured to perform classification of detected sounds using any convention sound and/or speech recognition techniques.
  • Location analysis involves identifying one or more locations in the environment corresponding to the source of detected sounds, i.e. spatial localisation of sound sources within the environment. In the present example, this involves determining the location of the loudspeakers 14.
  • any one or more of several known techniques may be used to locate the source of a sound in space with accuracy, for example: using differential arrival times (phase difference) at each microphone; and/or using the difference in volume level at each microphone (optionally amplified by the use of highly directionally sensitive microphones).
  • the preferred device 10 is operable in a training mode in which it learns the location of one or more non-real-world sound source in its environment. In the present example this involves determining the location of the loudspeakers 14.
  • the device 10 detects sounds using the microphones 18 (or at least two of them) and performs location analysis on the output signals of the microphones 18 to determine the location of one or more loudspeaker or other sound source.
  • each loudspeaker 14 or other sound source is operated individually (i.e. one at a time) to produce sound for detection by the device 10.
  • two or more loudspeakers 14 or other sound sources may be operated simultaneously in the training mode (for example where two or more loudspeakers 14 are driven by the same TV or other electronic device).
  • the loudspeakers 14 or other sound source may be operated to produce sounds that they would produce during normal operation, or may be operated to produce one or more test sounds.
  • the device 10 is connectable (by wired or wireless connection as is convenient) to one or more of the sound producing devices (e.g. TV, radio, media player or other device having or being connected to one or more loudspeaker 14) in the environment in order to cause them to generate the sounds during the training mode.
  • the device 10 uses test sounds for this purpose and may store test signals for sending to the sound producing devices for this purpose.
  • test signals may include full 5.1 or 7.1 sound signals to deal with environments with cinema-like loudspeaker installations.
  • the preferred device 10 is also operable in a listening mode in which it detects real-world sounds in the environment and may take one or more actions in response to detecting a real-world sound.
  • the nature of the actions may depend on a wider functionality of the device 10, or of a system or apparatus of which the device 10 is part.
  • the actions may comprise generating one or more output, for example an audio and/or visual output, and/or one or more output signal for operating one or more other device to which the device 10 is connected or of which it is part.
  • the device 10 may be connected to (or be integrated with) a TV or other electronic device and may operate the TV/electronic device depending on one or more detected sounds.
  • the device 10 may be configured to take different actions depending on what sounds are detected.
  • the device 10 itself may be provided with one or more output device (e.g. a loudspeaker, lamp, video screen, klaxon, buzzer or other alarm device or telecommunications device), which it may operate depending on what sounds are detected.
  • the device 10 upon determining that a detected sound is not a real-world sound, can ignore the detected sound, e.g. take no action in response to the detected sound.
  • the device 10 may be configured to take one or more actions in response to detecting non-real-world sounds. Typically such actions are different from those taken in response to detected real-world sounds.
  • the device 10 may be configured to take different action (including no action) for real-world sounds and non-real-world sounds even if the sounds are of the same type.
  • Limitations to the sound source localisation technique include: localising portable devices such as radios or wireless speakers which may be moved regularly; incorrectly ignoring sounds from a person positioned close to one of the locations the device 10 has determined should be ignored; and locating sound sources that are close to the device 10 (e.g. speakers built into a TV set on which the device 10 is located).
  • Such limitations can be mitigated by determining whether or not a detected sound has one or more characteristic indicating that it has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression and/or rendering via electronic amplifier and/or loudspeaker, rather than being a non-processed, or raw, real-world sound.
  • This analysis can be achieved by performing audio signal processing of the output signals produced by at least one of the
  • microphones 18 when a sound is detected. Analysis of detected sounds to differentiate between processed and non-processed sounds (and therefore between non-real-world and real-world sounds) can be performed in addition to, or instead of, the spatial localisation of sounds described above.
  • sound broadcast via TV or radio sounds produced from a CD, DVD or Blu-ray disc, or streamed media sounds have been subjected to one or more audio processes, including any one or more of the following:
  • audio signals will usually have undergone some form of audio compression, e.g. dynamic range compression.
  • audio compression e.g. dynamic range compression.
  • codec coder-decoder
  • codecs commonly use a psychoacoustic technique that relies on knowledge of how humans perceive sound. Pyschoacoustic codecs compress the sound by removing parts of the sound that humans do not pay attention to, and/or devoting fewer bits of the data stream to capturing parts of the signal which are less important to the human experience than the others. So, for example, a codec might:
  • an amplifier When decoding an encoded signal back to renderable sound, an amplifier generates a varying voltage/current to operate a loudspeaker.
  • both the amplifier's electrical characteristics, and the loudspeaker's mechanical characteristics leave an imprint on the sound being produced - often referred to as the "transfer function".
  • loudspeakers associated with a TV have a limited frequency response and yet more of the high and low frequencies will be lost.
  • the audio signal is likely to undergo encoding, compression, decoding and decompression at least once and often more than once as it passes through the various network links from initial recording, to studio to transmitter.
  • Different codecs may be used at different stages so the end result may bear traces of more than one kind of processing.
  • processed sounds commonly have one or more characteristics that non-processed real-world sounds do not have, and vice versa.
  • non-processed real-world sounds tend to include audio signal components at higher and/or lower frequencies than processes sounds such as those emanating from a television or audio system.
  • non-processed real-world sounds tend to have less inherent background noise than processed signals.
  • non-processed real-world sounds tend to have a more natural spread of frequency components than processed sounds.
  • the spectral distribution (which may be referred to as spectral power distribution) of the or each audio signal representing a detected sound can provide an indication of whether the sound is a real world sound or not.
  • the frequency distribution i.e. the distribution of the frequency components of the audio signal, and other characteristics of processed sound are detectably different from those of real-world non-processed sounds. Some of these characteristics are complex, e.g. changes in bitrates of encoding (e.g. lower bit rate after a loud noise, or for very high or low frequencies), and introduce identifiable artefacts into the processed audio signals.
  • a processed audio signal may include detectable artefacts arising from any one or more of the processes described above. Accordingly, any sound (and more particularly any corresponding audio signal representing the sound) detected by the device 10 may be analysed in respect of any one or more signal
  • the relevant characteristics include, but are not limited to:
  • the frequency content of the audio signal in particular the presence or absence of signal components in one or more frequency bands, especially a high frequency band (e.g. above 20kHz or above 500kHz) and/or a low frequency band (e.g. below 20Hz or below 50Hz).
  • a high frequency band e.g. above 20kHz or above 500kHz
  • a low frequency band e.g. below 20Hz or below 50Hz.
  • the spectral distribution of the audio signal especially within one or more frequency bands, e.g. between 20Hz and 500kHz, or between 500Hz and 2kHz, from 500Hz to 50kHz (or other frequency range, e.g. a frequency range deemed to correspond with the human voice)
  • the bitrate of the audio signal including the absolute bitrate and/or changes in bitrate.
  • this may involve detecting different bitrates being used for different frequency components (in particular relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. ⁇ 500Hz) frequency bands), and/or relatively low bitrates being used after a signal event such as loud noise (which may be referred to as a high intensity signal event).
  • relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. ⁇ 500Hz) frequency bands
  • a signal event such as loud noise
  • the noise floor level of the audio signal in particular a relatively high noise floor level (e.g. above a threshold value that can be determined from reference data).
  • a rolling window of sound (e.g. of up to a few seconds) may be captured continuously from each microphone 18, and once the trigger condition(s) has been met a sound segment of defined duration, commencing with the trigger sound may be put into a queue for analysis. Any convenient technique to implement early, random or other discard technique may be employed if the queue grows beyond acceptable limits.
  • Figure 3 shows a preferred operation of the device 10 in the listening mode.
  • the device 10 captures a sample of detected sound from the output of one or more microphone 18 in response to the trigger condition(s) being met.
  • the device 10 performs location analysis on the detected sound as described above. This may involve determining the location of the sound's source using the phase difference between corresponding signals captures from at least two microphones 18 and/or sound intensity difference between corresponding signals captures from at least two microphones 18, and may depend on the directional sensitivity of the or each relevant microphone 18. Sounds that are determined as having emanated from the location of a know loudspeaker 14 (as determined during the training mode) can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
  • step 303 the device 10 performs transfer function, or frequency spectrum, analysis of the detected sound to identify one or more frequency characteristics that are indicative of it being either a real- world sound or a non-real-world sound.
  • transfer function or frequency spectrum
  • this involves determining that the sound is a processed, or non-real-world, sound if it lacks high and/or low frequency components that are commonly removed by audio encoding and/or by rendering via amplifier and/or loudspeaker. Sounds that are determined as having been processed can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
  • the transfer function analysis may involve comparing the sound sample (conveniently a transfer function representing the sound sample) against one or more transfer function template associated with audio recording, audio broadcast and/or audio reproduction.
  • Any audio playback system will have a transfer response h(t) and corresponding frequency domain response H(s). Playing the audio source signal sig(t) through the system will convolve sig(t) with h(t), or in a frequency domain representation, multiplication of SIG(s) (being the frequency domain representation of sig(t)) with H(s).
  • the fitting technique can use any number of standard parametric techniques.
  • the transfer function for broadcast compression and blu-ray encoding can for example be used as templates. Such templates are best suited to transients such as gunshots, breaking glass or TV screams, it will be less effective for narrower band sounds such as vehicle noise or human speech that does not have significant variability (inflection or emotion).
  • step 304 the device 10 looks for artefacts in the detected sound (i.e. in the corresponding frequency spectrum and/or waveform of the corresponding audio signal) which indicate that the sound has been subjected to audio compression, e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
  • audio compression e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e
  • Sounds can be deemed to be processed or non-real-world sounds (and therefore ignored or rejected) upon being identified as such by any one of steps 302, 303 or 304, or alternatively upon being identified as such by any two or more of steps 302, 303 and 304. Any determinations made by the audio signal processor 20 in this regard may be communicated to the controller 1 1 which may make the decision on whether or not to ignore the detected sound and/or to determine which actions are to be taken in response to the detected sound.
  • the sequencing of the sequence of steps 302, 303, 304 in Figure 3 is illustrative and in alternative embodiments, these (and/or other) steps may be performed in different orders, or be merged, and/or be operated in parallel, dependent on the requirements of the application and capabilities of the device 10.
  • the device 10 combines the techniques of location analysis, spectrum analysis and artefact detection.
  • any one of these techniques may be used on its own, or in combination with any other of the techniques.
  • spectrum analysis and artefact detection in particular may each be sufficient on its own to achieve an effective level of specificity for a given use-case.
  • Training using machine-learning techniques - this may, for example, involve training the device 10 with training data which may comprise pairs of sound samples - an original sound generated live (or a very high-fidelity or artefact-free recording), and the same sound after typical encoding, compression and/or reproduction.
  • the second process does not generate an algorithm as such and it may not be apparent how the system is achieving subsequent levels of effective differentiation.
  • the machine-learning approach may also collapse steps in the processing: in other words it may not be necessary to separately look for spectrum differences and compression artefacts, a trained system may just learn the difference between processed and non-processed sounds using whatever characteristics it finds to be most capable of allowing the distinction to be made.
  • the machine learning approach may involve providing the device 10 with reference real-world and non-real-world sounds, the device 10 being configured through machine-learning to develops its own criteria empirically for distinguishing between them. These criteria may involve elements of location, spectral distribution and artefacts, and may differ for different types of input sound.
  • the device 10 is intended to monitor for coughs, sneezes, cries for help, sounds of danger and other noises but in a normal home the TV is likely to be active for several hours a day and generate many similar artificial sound events.
  • the device 10 has a plurality of microphones 18 and audio signal processing circuitry 20 configured to perform the following:
  • Measurement of the phase shift between corresponding sound samples from each (or at least two) microphone 18 Audio signal analysis of each sample, which may involve transfer function analysis and or artefact detection.
  • the device 10 determines the position of the loudspeakers 14 within the room 12, preferably by playing test signals through the television (e.g. via HDMI or other connection) and detecting the corresponding sounds rendered by the loudspeakers 14 using the microphones 18. At a minimum it is preferred that alternate left and right channel test signals are used, but more preferably test signals for 2.1 , 5.1 and 7.1 sound set-ups are used, selecting channels and frequencies as appropriate.
  • the device 10 can preform location analysis and reject sounds from the designated speaker locations. Alternately, sounds from those locations can simply be marked as "suspect" and further processed before making a final decision, for example based on weighted probabilities from each phase of analysis.

Abstract

L'invention concerne un procédé de reconnaissance de son qui est apte à distinguer des sons du monde réel et des sons pré-enregistrés ou diffusés en déterminant si le son émane d'un emplacement désigné, tel que l'emplacement d'un haut-parleur, ou en reconnaissant des caractéristiques du son indiquant qu'il a été soumis à des processus d'enregistrement audio, de diffusion audio et/ou de reproduction audio.
PCT/EP2018/067333 2017-06-28 2018-06-27 Dispositif et procédé sensible au son WO2019002417A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1710286.4 2017-06-28
GB1710286.4A GB2563868B (en) 2017-06-28 2017-06-28 Sound responsive device and method

Publications (1)

Publication Number Publication Date
WO2019002417A1 true WO2019002417A1 (fr) 2019-01-03

Family

ID=59523583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/067333 WO2019002417A1 (fr) 2017-06-28 2018-06-27 Dispositif et procédé sensible au son

Country Status (2)

Country Link
GB (1) GB2563868B (fr)
WO (1) WO2019002417A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578688A (zh) * 2019-09-12 2021-03-30 中强光电股份有限公司 投影装置以及其声音输出控制装置
WO2021099760A1 (fr) * 2019-11-21 2021-05-27 Cirrus Logic International Semiconductor Limited Détection d'un discours en direct

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998034216A2 (fr) * 1997-01-31 1998-08-06 T-Netix, Inc. Systeme et procede pour detecter une voix enregistree
JP2005250233A (ja) * 2004-03-05 2005-09-15 Sanyo Electric Co Ltd ロボット装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998034216A2 (fr) * 1997-01-31 1998-08-06 T-Netix, Inc. Systeme et procede pour detecter une voix enregistree
JP2005250233A (ja) * 2004-03-05 2005-09-15 Sanyo Electric Co Ltd ロボット装置

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BRIAN D'ALESSANDRO ET AL: "Mp3 bit rate quality detection through frequency spectrum analysis", MULTIMEDIA AND SECURITY, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 7 September 2009 (2009-09-07), pages 57 - 62, XP058088142, ISBN: 978-1-60558-492-8, DOI: 10.1145/1597817.1597828 *
GRIGORAS ET AL: "Statistical Tools for Multimedia Forensics", CONFERENCE: 39TH INTERNATIONAL CONFERENCE: AUDIO FORENSICS: PRACTICES AND CHALLENGES; JUNE 2010, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 17 June 2010 (2010-06-17), XP040567050 *
HANY FARID: "Detecting Digital Forgeries Using Bispectral Analysis", 1 January 1999 (1999-01-01), XP055499185, Retrieved from the Internet <URL:https://dspace.mit.edu/bitstream/handle/1721.1/6678/AIM-1657.pdf?sequence=2> [retrieved on 20180813] *
HENNEQUIN ROMAIN ET AL: "Codec independent lossy audio compression detection", 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 5 March 2017 (2017-03-05), pages 726 - 730, XP033258513, DOI: 10.1109/ICASSP.2017.7952251 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578688A (zh) * 2019-09-12 2021-03-30 中强光电股份有限公司 投影装置以及其声音输出控制装置
WO2021099760A1 (fr) * 2019-11-21 2021-05-27 Cirrus Logic International Semiconductor Limited Détection d'un discours en direct
GB2603397A (en) * 2019-11-21 2022-08-03 Cirrus Logic Int Semiconductor Ltd Detection of live speech
US11705109B2 (en) 2019-11-21 2023-07-18 Cirrus Logic, Inc. Detection of live speech

Also Published As

Publication number Publication date
GB2563868A (en) 2019-01-02
GB2563868B (en) 2020-02-19
GB201710286D0 (en) 2017-08-09

Similar Documents

Publication Publication Date Title
JP7271674B2 (ja) ネットワークマイクロフォンデバイスのノイズ分類による最適化
US11183198B2 (en) Multi-mode audio recognition and auxiliary data encoding and decoding
JP7397066B2 (ja) オーディオ分類を介した動的な音量調整のための方法、コンピュータ可読記憶媒体及び装置
US10275210B2 (en) Privacy protection in collective feedforward
JP5485256B2 (ja) 補聴器、補聴システム、歩行検出方法および補聴方法
JP6576934B2 (ja) 圧縮済みオーディオ信号の信号品質ベース強調及び補償
CN102016994B (zh) 用于处理音频信号的设备及其方法
US10275209B2 (en) Sharing of custom audio processing parameters
US9959886B2 (en) Spectral comb voice activity detection
CN103903606B (zh) 一种噪声控制方法及设备
RU2008142956A (ru) Устройство для обработки данных и способ обработки данных
US10853025B2 (en) Sharing of custom audio processing parameters
CN104937955B (zh) 自动的扬声器极性检测
TWI831785B (zh) 個人聽力裝置
WO2019002417A1 (fr) Dispositif et procédé sensible au son
CN116844559A (zh) 用于检测在变化的环境中的警戒信号的方法
WO2017156895A1 (fr) Procédé et dispositif de lecture de contenu multimédia
CN109997186B (zh) 一种用于分类声环境的设备和方法
CN115348507A (zh) 脉冲噪声抑制方法、系统、可读存储介质及计算机设备
EP2849341A1 (fr) Contrôle de volume pour le rendu audio d&#39;un signal audio
EP3419021A1 (fr) Dispositif et procédé permettant de distinguer des sons naturels et artificiels
CN115243183A (zh) 一种音频检测方法、设备及存储介质
JP2010230972A (ja) 音信号処理装置、その方法、そのプログラム、および、再生装置
CN107296613B (zh) 一种基于bic的双耳相关性听力检测方法及系统
KR102437054B1 (ko) 음성 자동감지 및 주파수를 이용한 도청기 탐지 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18734552

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.03.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18734552

Country of ref document: EP

Kind code of ref document: A1