EP3419021A1

EP3419021A1 - Device and method for distinguishing natural and artificial sound

Info

Publication number: EP3419021A1
Application number: EP17305754.8A
Authority: EP
Inventors: Jean-Ronan Vigouroux; Alexey Ozerov; Erwan Le Merrer; Philippe Gilberton
Original assignee: Thomson Licensing SAS
Current assignee: Thomson Licensing SAS
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2018-12-26

Abstract

Device (400) and method for determining if sound is artificial. A hardware input interface (440) obtains (S510) a signal corresponding to sound in an environment, and at least one hardware processor (410) calculates (S520, S530), by from the signal at least one of a descriptor related to loudness and a descriptor related to silence and determines (S540) that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.

Description

TECHNICAL FIELD

The present disclosure relates generally to audio recognition and in particular to determining if sound is natural or artificial.

BACKGROUND

This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Audio (acoustic, sound) recognition is particularly suitable for monitoring people activity as it is relatively non-intrusive, does not require other detectors than microphones and is relatively accurate.
Figure 1 illustrates a generic conventional audio classification pipeline 100 that comprises an audio sensor 110 capturing a raw audio signal, a preprocessing module 120 that prepares the captured audio for a features extraction module 130 that outputs extracted features to a classifier module 140 that uses entries in an audio database 150 to label audio that is then output.
The labelled audio can then be used to, for example, determine the activities of persons (and even pets) in the location where the audio was captured. Knowledge of the activities can be used in situations like e-health, care of children or the elderly, and home security. In addition, parents could use the knowledge to determine what their children do when they are alone at home: for instance, after school, are they doing their homework or watching television?
In some of these cases, it can be important to distinguish between natural sound - for example talking or singing persons, or a barking dog - and artificial sound, i.e. sound that is rendered by a rendering device such as a radio, a television or a hi-fi system. It will be appreciated that persons talking on the television could be mistaken for real persons discussing. So far, this issue appears to have no suitable conventional solution.
It will be appreciated that there is a desire for a solution that addresses this problem. The present principles provide such a solution.

SUMMARY OF DISCLOSURE

In a first aspect, the present principles are directed to a method for determining if sound is artificial. At a device, a hardware input interface obtains a signal corresponding to sound in an environment and at least one hardware processor calculates from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determines that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
Various embodiments of the first aspect include:

That the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent. Adjacent windows can be overlapping. A window can be deemed as silent in case its Root Mean Square (RMS) power is below a third threshold.
That the descriptor related to loudness is a standard deviation for power of the signal.

In a second aspect, the present principles are directed to a device for determining if sound is artificial, comprising a hardware input interface configured to obtain a signal corresponding to sound in an environment, and at least one hardware processor configured to calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence, and determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
Various embodiments of the second aspect include:

That the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent. Adjacent windows can be overlapping. A window can be deemed as silent in case its Root Mean Square (RMS) power is below a third threshold.
That the descriptor related to loudness is a standard deviation for power of the signal.
That the input interface is configured to capture the sound. The input interface can comprise a microphone.
That the device further comprises an output interface for outputting information about whether the sound is natural or artificial.

In a third aspect, the present principles are directed to a computer program comprising program code instructions executable by a processor for implementing the method according to the first aspect.
In a fourth aspect, the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and comprises program code instructions executable by a processor for implementing the method according to the first aspect.

BRIEF DESCRIPTION OF DRAWINGS

Preferred features of the present principles will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:

Figure 1 illustrates a generic conventional audio classification pipeline;
Figure 2 illustrates conventional downward compression with a hard knee;
Figure 3 illustrates a signal without dynamic range compression and the same signal with dynamic range compression;
Figure 4 illustrates a device for audio distinction according to the present principles; and
Figure 5 illustrates a flowchart for a method of audio distinction according to the present principles.

DESCRIPTION OF EMBODIMENTS

One way of monitoring a person in order to, for instance, anticipate problems, is to verify if the habits of the person are followed. To do this, it can be useful to classify ambient sound in the person's location as:

no sound, i.e., silence.
natural ambient sound, such as for example physical people speaking, cooking, dog barking.
artificial ambient sound, such as sound coming from a radio, a television, or a hi-fi system. In this context, "artificial" means that the sound was processed for broadcast or recording and subsequent rendering.

To detect artificial ambient sound, the present principles rely on the fact that most artificial audio sources use dynamic range compression to enhance the sound and to make it more present. It is for example possible to enhance the sound to avoid a clipping effect, amplifier chain saturation or better to fit into Frequency Modulation standard that has limited frequency spectrum range.
Dynamic range compression, which is a very common technique in the broadcast chain and in media content workflows, amplifies parts of the audio signal with low amplitude (upward compression), reduces the loud parts of the sound (downward compression), or both. On the other hand, natural sounds tend to be characterized by a wider dynamic range, which typically means that more low power sounds tend to be present in a natural audio signal than in a dynamic range compressed audio signal. Hence, detecting such dynamic differences within the sound can help differentiating artificial and natural sound.

Dynamic Range Compression

Dynamic Range Compression (DRC) for audio will now be described in further detail. As already mentioned, DRC can amplify low sounds, attenuate high sounds, or both.
Figure 2 illustrates conventional downward compression with a hard knee. A compression function curve 210 has a first part 212 that is neutral - i.e., an input level transformed by this part results in an equal output level. The curve further has a second part 214 that meets the first part 212 at a hard knee 216. The second part 214 performs downward compression, which means that an input level L_l is transformed into a lower output level Lo. Figure 2 also shows a threshold 220 that lies between the first part 212 and the second part 214. In this example, the threshold 220 coincides with the hard knee 216, but it will be appreciated that in case a soft knee is used, this will extend around the threshold 220 and comprise part of the first part 212 and the second part 214 as well.
It will also be understood that for upward compression, the first part of the curve would be flatter so that an input level results in a higher output level (except perhaps at the hard knee). It will further be understood that the function can allow both downward and upward compression, in which case the first part and the second part can have identical slopes or different slopes.
DRC can for example be used:

In public spaces to make music sound louder without having to increase the peak amplitude.
In music production for a better mix between vocals and instruments.
In voice processing to avoid sibilance.
In broadcasting to fit a broadcast signal with narrow range, as will be explained in more detail.
In marketing to increase the impact of commercials.
To protect circuitry in devices with amplifiers, and also to avoid clipping or saturation effects.
In hearing aids and headphones to make certain sounds more audible while others are attenuated.

Figure 3 illustrates a signal 310 without dynamic range compression and the same signal 320 with dynamic range compression. As can be seen, the loud parts have been attenuated (downward compression) and the low parts have been amplified (upward compression).
In the case of FM (Frequency Modulation) radio broadcasting, the characteristic of the frequency modulation limits the frequency spectrum range, which in turn limits the acoustic dynamic range. If the frequency spectrum range is not respected, this will result in spectrum overlaps and audio distortion. Simply reducing the amplitude of the signal fed to the modulator so that it never clips the signal requires an important reduction of the input signal, which results in a reduction in the signal-to-noise (SNR) ratio. A lower SNR ratio in turn means that a listener will hear more transmission noise, especially during the more quiet part of the transmission.
The effect on FM also applies to digital radio that includes an ADC (Analog to Digital Converter) in front of the modulator and for which dynamic range is limited.
DRC can also preserve the audio amplification chain and as well as any speakers from saturation when they are not dimensioned to render the natural dynamic range.
In addition, compressing broadcast radio FM or broadcast TV enables the high-power amplifier transmitter required to broadcast the signal over the air to transmit using a more constant output power. Doing so can increase the lifetime of the amplifier. Indeed, the standardization community tries to find the best compromise between audio quality for the end user and economy when it comes to the broadcasting infrastructure.
Further, Automatic Gain Control (AGC) is useful for microphone capture when a speaker talks over a low background sound that should be shared with the audience. AGC aims to provide a control level output signal regardless of the input signal. In other words, weak input signals are amplified and loud input signals are attenuated. The outcome is a less dynamic sound that is suitable for network broadcasting.
It will be understood that an audio signal that is broadcast or streamed over the air, a cellular network or a broadband network typically has a compressed acoustic dynamic. Hence, a music or voice audio signal listened through a speaker has different dynamical properties than audio produced by natural sources like human voices, animal sounds and (non-amplified) instruments.
As an effect of DRC is an amplification of sounds below a first amplitude threshold and an attenuation of sounds above a second amplitude threshold (possibly the same as the first threshold), it can be seen that sounds with DRC have a smaller amplitude variance than natural sounds. In addition, most broadcast sources - television, radio, music - tend to avoid silence. Therefore, the proportion of silence will be low for artificial sound sources.
Hence to distinguish artificial ambient sound from natural ambient sound, a device can analyse captured ambient sound to determine at least one of:

if the variance of the amplitude is above (natural ambient sound) or below (artificial ambient sound) a variance threshold value, and
if the level of silence is above (natural ambient sound) or below (artificial ambient sound) a silence threshold value

One way of calculating the amplitude variance is as follows, but it will be appreciated that other ways exist. First, the captured sound is divided into a number of sections (or windows); the windows can be distinct, but are generally overlapping with a subsequent window starting at the middle of the window just before. Each window has an index, that we note k for instance. The windows have a same size, noted w. The captured sound, i.e. the part for which it should be determined if it is natural or artificial, is thus divided in a set of K possibly overlapping windows.
The Root Mean Square (RMS) power of the sound for the window k is defined as: $P_{k} = \sqrt{\frac{1}{w} \sum_{i = 0}^{w - 1} s_{i}^{2}}$
where the s_i are the w contiguous samples of the sound in the window k.
The size w may take the value of 1024, but other values such as 2048 have also been contemplated.
The output for the different windows defines a series of instantaneous power values P_k for the K windows of the captured sound signal.
The mean power can then be calculated as $\overline{P} = \frac{1}{K} \sum_{k = 1}^{K} P_{k}$
and the standard deviation as $σ (P) = \sqrt{\frac{1}{K} \sum_{k = 1}^{K} (P_{k} - \overline{P})}$
Since amplitude and power are inextricably linked, the standard deviation for the power is also an indirect measure of the standard deviation for the amplitude.
It is preferred to obtain a normalised measure by dividing the standard deviation by the mean power: $CV (P) = \frac{σ (P)}{\overline{P}}$
This coefficient of variation of the power is a first descriptor, related to loudness, used to distinguish natural sounds from artificial sounds.
To calculate the level of silence, first the windows whose RMS power is below a given threshold τ are marked as 'silent'.
Then in an optional step, consecutive windows marked 'silent' are grouped in 'silent' groups, and consecutive windows marked 'non-silent' are grouped in 'non-silent' groups. The signal is therefore seen a series of interleaved 'silent' and 'non-silent' groups. To clean the signal of anomalous or outlying events, groups of 'non-silent' windows smaller than a certain size (such as a few windows, e.g. three) are marked 'silent'.
Finally, the second descriptor, related to silence, for distinguishing sound is the proportion of 'silent' windows over the number of windows K of the signal subject to examination. $\frac{|silent windows|}{K}$
where K is the number of windows considered.
In a variation, the detection of the silent windows may occur before the calculation of the descriptor CV(P) explained above, and this descriptor may be computed only on the windows which are marked as 'non silent'.
The two descriptors described above are expected to have a high value for the first (high variation of the power) and the second (large number of silent windows) in case of a natural sound, and the opposite for an artificial sound (power constantly high, nearly no silent window). This will be used in a classification system as exposed hereafter.
To classify the sound, different possibilities exist. A first possibility is to take the first and second descriptors as input to a supervised classifier that is trained to separate the natural sound from the artificial sound. The supervised classifier may for instance be based on a decision tree, using two thresholds corresponding to the two descriptors.
A second possibility is to use a set of conditions such as:

IF descriptor1 > thresholdl THEN sound is artificial
ELSE IF descriptor2 > threshold2 THEN sound is natural
ELSE IF descriptor1 > threshold3 AND descriptor2 < threshold 4 THEN sound is artificial
ELSE IF descriptor1 < threshold5 and descriptor2 > threshold6 THEN sound is natural

Naturally, there are many ways of expressing the conditions, using different thresholds that in addition may depend on many things such as locality and equipment.
Figure 4 illustrates a device for audio distinction 400 according to the present principles. The device 400 comprises at least one hardware processing unit ("processor") 410 configured to execute instructions of a first software program and to process audio for distinction, as described herein. The device 400 further comprises at least one memory 420 (for example ROM, RAM and Flash or a combination thereof) configured to store the software program and data required to distinguish sound. The device 400 can also comprise at least one user communications interface ("User I/O") 430 for interfacing with a user.
The device 400 further comprises an input interface 440 and an output interface 450. The input interface 440 is configured to obtain audio for distinguishing; the input interface 440 can be adapted to capture audio, for example a microphone, but it can also be an interface adapted to receive captured audio. The output interface 450 is configured to output information about distinguished audio - is it natural or artificial sound - for example for presentation on a screen or by transfer to a further device.
Non-transitory, computer-readable storage medium 460 includes a computer program with instructions that, when executed by the processor 410 performs the methods described herein.
The processor 410 can also be configured to use the distinction to determine user activity as described in the background part of the description.
The device 400 is preferably implemented as a single device such as a gateway, but its functionality can also be distributed over a plurality of devices.
In some cases, the processor 410 may have access to other data and use this data to determine that the sound has been incorrectly classified, for example in case the sound was classified as natural and the data originates from another device and indicates that artificial sound is indeed rendered in the environment where the processor 410 is located. If this occurs regularly, it could mean that the classification model used by the processor 410 is not accurate enough. In this case, the device 400 can send anonymized descriptors that caused false incorrect classification to a server, so that the global model can be adapted to these descriptors (i.e. recomputed with those new inputs). The global model can then be distributed to the individual devices. In such an implementation, a stream processing big-data infrastructure such as Storm or Spark is particularly relevant.
Figure 5 illustrates a flowchart for a method of audio distinction according to the present principles. In step S510 the device 400 obtains captured sound, either by capturing it itself or receiving captured sound from another device. In step S520, the processor 410 calculates power standard deviation, i.e., the first descriptor, (as a measure of amplitude standard deviation) as already explained. In step S530, the processor 410 calculates the silence level, i.e., the second descriptor, as already described. Finally, in step S540, the processor uses the first and second descriptors to determine if the captured sound is natural or artificial, as already described.
The processor 410 can then for example output information on whether the sound is natural or artificial through the output interface 450 or use this information internally as input to other functions.
It will be appreciated that the present principles can provide a solution for audio recognition that can enable:

Respect of users' privacy since the sound can be distinguished in a device located in the users' location rather than being sent to a device "in the cloud".
A small footprint on the distinguishing device since it is sufficient to retain the model, some variables and the present sound windows.

It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

Claims

A method for determining if sound is artificial, the method comprising at a device (400):
obtaining (S510), by a hardware input interface (440) a signal corresponding to sound in an environment;

calculating (S520, S530), by at least one hardware processor (410) from the signal at least one of a descriptor related to loudness and a descriptor related to silence; and

determining (S540), by the at least one hardware processor (410), that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
The method of claim 1, wherein the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent.
The method of claim 2, wherein the adjacent windows are overlapping.
The method of claim 2 or 3, wherein a window is silent in case its Root Mean Square (RMS) power is below a third threshold.
The method of any one of claims 1 to 4, wherein the descriptor related to loudness is a standard deviation for power of the signal.
A device (400) for determining if sound is natural, comprising:
a hardware input interface (440) configured to obtain a signal corresponding to sound in an environment; and

at least one hardware processor (410) configured to:
calculate from the signal at least one of a descriptor related to loudness and a descriptor related to silence; and

determine that the sound is artificial in case a variance of the descriptor related to loudness is below a first threshold value or in case the descriptor related to silence is below a second threshold value.
The device of claim 6, wherein the descriptor related to silence is a ratio of windows of the signal that are silent to windows of the signal that are non-silent.
The device of claim 7, wherein the adjacent windows are overlapping.
The device of claim 7 or 8, wherein a window is silent in case its Root Mean Square (RMS) power is below a third threshold.
The device of any one of claims 6 to 9, wherein the descriptor related to loudness is a standard deviation for power of the signal.
The device of any one of claims 6 to 10, wherein the input interface (440) is configured to capture the sound.
The device of claim 11, wherein the input interface (440) comprises a microphone.
The device of any one of claims 6 to 12, further comprising an output interface (450) for outputting information about whether the sound is natural or artificial.
A computer program comprising instructions that, when executed cause at least one hardware processor (410) to perform the method of any one of claims 1-5.
A non-transitory, computer-readable storage medium (460) including that, when executed, cause at least one hardware processor (410) to perform the method of any one of claims 1-5.