EP2454725A1

EP2454725A1 - Method and system for remotely guarding an area by means of cameras and microphones

Info

Publication number: EP2454725A1
Application number: EP10736847A
Authority: EP
Inventors: Frank Leonard Kooi; Kim Kranenborg
Original assignee: Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO
Current assignee: Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO
Priority date: 2009-07-17
Filing date: 2010-07-19
Publication date: 2012-05-23
Also published as: EP2276007A1; WO2011008099A1

Abstract

Method and system for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post, comprising the steps of displaying, at an observation screen, the various camera and microphone locations on a map of said area; enabling selective activation, e.g. by an operator, of camera images for zooming in; deriving, per microphone or group of microphones, an attention value based on the sound picked up by that sound source; and outputting, when the attention value passes a predetermined threshold value, a sound signal of limited duration from the sound source causing the threshold passage is rendered. Sound processing may be used to provide a further reduction of the intelligibility of the sound. An audible and/or visual representation of the location of the sound source causing the threshold passage is also given. The location representation may be performed by means of spatial audible reproduction of the relevant sound representation in the vicinity of said observation screen and/or by means of visual display of the location of the sound source causing said threshold passage.

Description

Title: Method and system for remotely guarding an area by means of cameras and microphones

Field of the invention

The present invention relates to a method and system for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post.

Background

Surveillance cameras for monitoring public areas have widespread

applications especially in urban areas. For example GB 2408880 discloses a camera surveillance system with a plurality of cameras and a user interface that allows cameras to be prioritized based on detected activity. Although the use of such cameras is very useful in guarding such areas, the effectiveness of such systems could be improved.

It is known to use audio input from the location of the camera to improve the effectiveness. US 2007/182819 proposes to digitize known analog camera surveillance systems. It mentions the possibility of using sensors for fire, smoke, sound, glass breakage, motion, panic buttons and the like to trigger a camera activation event to switch a camera into transmission mode, or to alert a server. WO 2007/095994 discloses a camera surveillance system that shows a mosaic of images and renders related sound, using stereophonic sound to associate sound with the display position of the corresponding image. When an event occurs in one image, sound corresponding to the other images may be switched off.

However, in most countries legal privacy regulations forbid eavesdropping (except under special conditions). Conventionally, surveillance camera systems for areas where such prohibitions are in force comply with the prohibition because they do not provide for audio output. But this way of addressing the prohibition on eavesdropping limits effectiveness of the surveillance system. Summary

It is one aim to improve the effectiveness of such a system by combining it with audible information in a way that prevents unlimited eavesdropping. Thus it can be made possible to use audible information to improve the effectiveness without infringing privacy regulations.

A method according to claim 1 is provided. Herein cameras and sound sources at a plurality of locations within an area are used to support remote guarding of the area. Each sound source comprises a microphone or a group of microphones, the cameras and sound sources being coupled to a surveillance post. The method comprises the steps of: deriving, per sound source, an attention value based on the sound picked up by that sound source; comparing the attention values with a predetermined threshold value; and in response to detection that the attention value for a particular one of the sound sources has passed the predetermines threshold value; audibly rendering a sound representation of the sound picked up by the particular one of the sound sources causing the threshold passage, limited to a time interval of at most a predetermined length. By rendering the sound in response to detection at most for a time interval of predetermined length, the intelligibility of the sound for eavesdropping is limited. In this way it is made possible to use audio to improve the effectiveness without infringing privacy regulations.

In an embodiment the time interval in which the audio is rendered has a length of at most ten seconds. It has been found that this usually limits intelligibility. The time interval may be at least two seconds long for example and more preferably at least five seconds. This makes it possible to observe emotions in the sound. When emotions can be observed the operator can semantically comprehend the emotional components (in particular fear, anger, excitement etc.) in the audible signals picked up in the vicinity of the cameras. These components should, after transfer to the operator, attract his attention in a natural way and trigger him/her to pay attention to the location at which such (e.g. excited) audible signal originated or was recorded. In an embodiment the various camera and microphone locations are displayed on a map of said area and a representation of a location of the sound source where the attention value exceeded the threshold is indicated in relation to said map in response the detection. The indication may be realized visually, for example by activating display of a predetermined color at a location of the map that corresponds with the sound source, and/or auditively, for example by means of a synthetic stereophonic signal that suggest that sound comes from that location. This helps to keep the operator focus on the location of the source of the sound even after the time interval in which the sound has been rendered has expired. In an embodiment, the visual indication of the map location given at least following the time interval in which the sound from the sound source is rendered. This helps the operator identify the location even if the operator did not identify the location during the time interval, without allowing eavesdropping. The visual indication may also be given during the time interval. During the time interval an audio indication of the location may be given by rendering the sound from the sound source stereophonically. After the interval other sound, that is not sound picked up after the time interval may be rendered in this way to indicate the location.

In an embodiment sound from the sound source is first processed before it is rendered, such that intelligibility is reduced. The processing may involve time and/or frequency domain filtering, adding echos and/or scrambling of fragmenting of the sound representation may be used, reducing the overall semantic or linguistic intelligibility of the sound to a level which complies with the relevant privacy regulations related to eavesdropping. Preferably a form of processing is used that results in rendering of at least the lowest frequencies of speech sounds. This provides for recognition of emotions.

In an embodiment a value of a measure of intelligibility of the sound is determined and the processing is controlled dependent on the value. In an embodiment said processing may be enabled or disabled dependent on whether the value is above or below a predetermined threshold. In an embodiment said processing may be adapted so as to reduce the value of the measure of intelligibility. The Speech Transmission Index (abbreviated STI, see for its definition e.g. en.wikipedia.org/wiki/Speech_Transmission_Index) may be used as a measure of intelligibility for example. The STI of the processed sound is reduced, e.g. by means of signal scrambling or addition of noise), to a maximum of e.g. or less 0.35.

To comply with the aim to provide that the relevant audible signals, picked up in the vicinity of the cameras and processed as indicated above, will attract the operator's attention and guide him to the location on his observation screen where the (e.g. excited) sound was originated, the location representation of that sound may preferably be performed by spatial (2 or 3 Dimensional) audible reproduction of that sound representation in the vicinity of the observation screen. As such observation screen (which may be formed by a group of cooperating display screens) normally will have rather large dimensions, the operator's attention can be attracted when the sound representations, originated at several microphone locations, are reproduced (i.e. when the attention value of the sound passes a predetermined threshold value) via a spatial audio reproduction system. It has to be noted that the sounds as such may be picked up by single channel microphones, however their sound representations are reproduced, via a spatial audio system in the vicinity of the observation screen, in such a way that, in the operator's perception, the sound representations comes from the direction of the location, as mapped on the observation screen, where the sound has been produced or recorded.

Additionally or optionally, the sound originating location may be represented by means of visual display of the location where the sound has been produced, e.g. by means of any form of highlighting that location at the area mapping on the observation screen.

Brief description of the drawing These and other object and advantageous aspects will become apparent from a description of exemplary embodiments, using the following figures

Figure 1 shows an exemplary embodiment of a surveillance system; Figure 2 shows the diagram of an exemplary embodiment of a subsystem for sound processing.

Detailed description of exemplary embodiments Figures 1 shows use of a system for remotely guarding an area (the centre of a city in the example) using cameras and microphones at several locations within that area, which are connected to a central surveillance post.

Figure 2 shows the system, including an observation screen 1, a microphone (MIC) an event detector (ED) 2 and an intelligibility reductor (IR) 3, a 2D renderer (2DR) 4 and a set of loudspeakers 5, and a video screen driver (VD) 6. Event detector (ED) 2 and an intelligibility reductor (IR) 3, have inputs coupled to the microphone MIC. Event detector (ED) 2 has an output coupled to a control input of intelligibility reductor (IR) 3. Intelligibility reductor (IR) 3 has an output coupled to 2D renderer (2DR) 4 and a video screen driver (VD) 6. In another embodiment event detector (ED) 2 has an output coupled to a control inputs of intelligibility reductor (IR) 3 and video screen driver (VD) 6, and optionally 2D renderer (2DR), intelligibility reductor (IR) 3 not being coupled to video screen driver (VD) 6. 2D renderer (2DR) 4 and a video screen driver (VD) 6 are coupled to observation screen 1. Although only one

microphone MIC is shown in the figure, it should be appreciated that a plurality of microphones or groups of microphones mat be used. Furthermore, the system comprises a plurality of cameras (not shown). The system may comprise auxiliary screens 10 (figure 1) coupled to the cameras, for showing images from different cameras.

In operation the system is used for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post. At observation screen 1, the various camera and microphone locations are displayed on a map of the area; enabling selective activation, e.g. by a screen observing operator 9, of one or more camera images for zooming in. Per microphone or group of microphones, called sound source hereinafter, an attention value is derived based on the sound picked up by that sound source. When the attention value passes a

predetermined threshold value, a representation of the sound picked up by the sound source causing the threshold passage, called sound representation hereinafter, is output. The output includes an audible and/or visual

representation of the location of the sound source causing the threshold passage, called location representation hereinafter. When event detector (ED) 2 detects an event it causes intelligibility reductor (IR) 3 to pass a sound signal from microphone during a limited time interval, optionally after applying a form of processing that further reduces

intelligibility. In addition the signal from event detector (ED) 2 is used to control 2D renderer (2DR) 4 and video screen driver (VD) 6 to select a position representation that must be rendered by means of synthetic stereophony and display at a selected location on observation screen according to the position of the microphone MIC for which en event is detected. The observation screen 1 is arranged for displaying the various camera (Cam) and microphone (Mic) locations on a map of the area. Event detector (ED) 2 and intelligibility reductor (IR) 3 function as means for executing the method as discussed hereinbefore including processing means and means for the reproduction of the sound representations 2D renderer (2DR) 4 functions as means for the reproduction of the relevant location representations. Set of loudspeakers 5 for acoustic location representation, as well as a video screen driver (VD) 6 provide for visual location representation at the observation screen 1. The relevant area thus can be monitored by means of cameras and

microphones at several locations within the area, which are connected to a central surveillance post which accommodates the components shown in the figures 1 and 2. By means of the observation screen 1, the various camera and microphone locations are displayed on a map image of the area to be monitored. A screen observing operator 9 is able, e.g. by means of a keyboard, mouse, joystick (not shown) or touch screen, to select and activate cameras and/or camera images to zoom in and out; besides the operator may be able to move the cameras into different positions. In the vicinity of each camera, microphones are installed, picking up the sound present in the camera's vicinity. In this way the sounds which are present in the vicinity of each camera is transmitted to the surveillance post, which accommodates the system. In the event detector 2 per microphone or group of microphones (sound source) an attention value is derived based on the sound picked up by that sound source. The event detector 2 analyzes the incoming sound and decides -e.g. based on the results of a frequency spectrum and energy level analysis- whether the incoming sound comprises elements like fear, excitement (e.g. screaming), uncommon noise like e.g. breaking glass etc. In such cases the attention value should pass a predetermined threshold value, indicating that there might be an event which should be investigated. The attention value may be based on a level of signal power in a selected frequency band, or steepness of an increase of such a level, or deviation from a range of spectral distributions of standard sounds. Known sound recognition algorithms may be used to detect specific types of sound.

When the attention value passes a predetermined threshold value, detected in the event detector 2, this detector gives an "on" signal to the intelligibility reductor 3 to pass a representation of the sound picked up by the sound source causing the threshold passage, i.e. a sound representation having a reduced intelligibility. In an embodiment the intelligibility reductor 3 reduces intelligibility by passing no more than a predetermined time interval of sound, for example at most ten seconds. In an embodiment intelligibility reductor 3 may comprise a buffer memory to buffer sound of the sound source before it is rendered. Thus for example, intelligibility reductor 3 may render a part of buffered sound for which it was determined subsequently that the attention value exceeded the threshold. In addition, an audible representation of the location of the possibly buffered sample of the event sound source causing the threshold passage (location representation) is performed, viz. by reproducing the sound representation (having a reduced intelligibility) by means of a 2D sound rendering subsystem (2DR) 4 and loudspeakers 5 which—by means of audio phase manipulation causing pseudo stereo/quadraphonic sound reproduction (see

en.wikipedia.org/wiki/Quadraphonic_sound) and/or sound reproduction via a selected set of loudspeakers 5a and 5b- provides that -in the perception of the operator 9, standing or sitting before his (widescreen) observation screen 1- the sound representation comes from the location at that observation screen (in the corner right below in figure 1).

In addition to the audible location representation, audible to the operator, also a visual location representation is presented to the operator, viz. in the form of an image, e.g. as shown in figure 1 (again in the corner right below) where the relevant microphone location and the neighbouring camera location have been accentuated by (bold) encircling the relevant location. In this way the operator

9 will be guided—in a natural and intuitive way— to pay his attention to the location in which -according to the sound picked up by the microphone(s)- something might be wrong. Then the operator may activate the relevant camera (e.g. by using a touch screen or keyboard function) to zoom in, which may be made visible via the same observation screen 1 or -as is suggested in figure 1— via one or more auxiliary screens. In the illustrated example, the operator may have heard (the sound representation of) breaking glass and/or crying voices "Stop thief!!", is guided by that sound to the highlighted location at his screen 1, activates the relevant camera and see at the auxiliary screen

10 a thief running away. The operator then may contact and inform the police. The display of the visual location representation may continue until it is switched off by the operator, or it may be switched off automatically after a time interval that is longer than the time interval during which sound is rendered from the sound source.

Concerning the sound representation, made in de IR module 3, this may include to make separated fragmented parts of the sound picked up by the sound source (the microphone(s)), which fragmentation is such that the overall semantic intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. When the length of each fragmented part is limited (e.g. to 10 seconds or less), the intelligibility will be decreased and thus the possibility to relate a spoken phrase to a particular individual will be made infeasible.

Another or an additional method for intelligibility reduction is to process (e.g. by scrambling and/or distortion) the sound from the originating sound source such that the intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. In practice it has been proven that when the Speech Transmission Index of the processed sound has a maximum of 0.35, this will fit to the desired lower intelligibility. The Speech Transmission Index (STI) is a measure for the intelligibility (understanding) of speech, whose value varies from 0 (completely

unintelligible) to 1 (perfect intelligibility). On this scale, an STI of at least 0.5 is desirable for most applications (Steeneken, H. J. M., & Houtgast, T. (1980). A physical method for measuring speech-transmission quality. Journal of the Acoustical Society of America, 67, 318-326).

Intelligibility reduction may be realized by processing such as one or more of time and/or frequency domain scrambling, adding echos, distorting, filtering, or addition of noise. The addition of echos, for example is a simple and effective way of reducing intelligibility. Scrambling may involve changing the relative sequence of a series of fragments of the sound. Distortion may involve applying a non-linear function to sound sample values. Filtering may involve reducing the strength high frequency components. In an embodiment, a frequency of a lowest sound (e.g. speech) component such as a formant is determined and the relative strength of frequency component at frequencies above this component is reduced. Noise may be added at selected frequencies for example. Each of these measures reduces the intelligibility of the sound.

The degree of reduction may be increased for example by using shorter fragments or increased reordering, adding more echos, using a non-linear function that deviates more from a linear function, reducing high frequency components more strongly, reducing more high frequency components, adding more noise etc. Preferably a method of reduction is used that preserves the amplitude variation at the lowest speech frequencies more than higher frequencies. This makes it easier to recognize emotions.

In an embodiment the reduction of intelligibility may be performed in a control loop, wherein the Speech Transmission Index is determined and the degree of reduction is controlled dependent on Speech Transmission Index. Processing to reduce intelligibility may be switched on or off dependent on the value of the Speech Transmission Index for example. Thus in an embodiment, processing to reduce intelligibility may be switched on only if the received audio is sufficiently intelligible speech. The system may contain respective event detectors (ED) 2 and/or intelligibility reductors (IR) 3 for respective microphones (MIC) or groups of microphones. Alternatively an event detector (ED) 2 may process sound from different microphones on a time multiplexing basis. In another embodiment

intelligibility reductor (IR) 3 may comprise a memory for storing recent sound input from different microphones or groups of microphones, intelligibility reductor (IR) 3 outputting and optionally processing sound for a microphone or groups of microphones that is selected by an event detector (ED) 2 or event detectors (ED) 2. An intelligibility reductor (IR) 3 may be configured to supply an output signal identifying the selected microphone or group of microphones that is the source of the sound that is passes.

Part or all of event detector (ED) 2 and intelligibility reductor (IR) 3, 2D renderer (2DR) 4 and a video screen driver (VD) 6 may be implemented using a programmable computer or set of computers and a computer program or set of programs to perform the functions as described. Such a computer or set of computers, or dedicated hardware to perform the functions of event detector (ED) 2 and intelligibility reductor (IR) 3, 2D renderer (2DR) 4 and a video screen driver (VD) 6 will be referred to as a processing system. When it is described that the system is configured to perform a function, it should be understood that this covers both hardware and software implementation, using a programmed computer and mixtures of both.

Although an embodiment has been described wherein the position of a microphone is indicated by both visual and audio representation, it should be appreciated that one, for example visual representation by the activation of a light at a selected location may be sufficient. Although an embodiment has been described that combines rendering of audio with a representation of position, it should be appreciated that in some cases it may suffice to render only the sound, without representing the position. Representation of the position facilitates determination of the location of the microphone by the operator in combination with monitoring when a limited time interval of sound is rendered. When the intelligibility of the sound is reduced, it may not be necessary to limit the time interval in which audio is rendered after event detection. But when many cameras are used it may be advantageous to limit the time interval even in that case. The microphones may be mounted in camera assemblies together with the cameras. Thus, each sound event can be associated with a camera and indication of the location of the microphone of a sound event can involve showing the images from the associated camera of that microphone. Alternatively, or in addition, microphones may be used that are at a distance from cameras.

In an embodiment a method is provided for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post, comprising the steps of: displaying, at an observation screen (1), the various camera and microphone locations on a map of said area; enabling selective activation, e.g. by a screen observing operator (9), of one or more camera images for zooming in; deriving, per microphone or group of microphones, called sound source hereinafter, an attention value based on the sound picked up by that sound source; outputting, when the attention value passes a predetermined threshold value, a

representation of the sound picked up by the sound source causing the threshold passage, called sound representation hereinafter, including an audible and/or visual representation of the location of the sound source causing the threshold passage, called location representation hereinafter.

Optionally said sound representation includes fragmented parts of the sound picked up by the sound source, the fragmentation being such that the overall semantic intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. The length of each fragmented part may have a maximum of 10 seconds. In an embodiment said sound representation may includes at least part of the sound picked up by the relevant sound source, however, processed such, e.g. by means of time and/or frequency domain scrambling, distorting, filtering etc., that the intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. In an embodiment the Speech Transmission Index of the processed sound has a maximum of 0.35. Said location representation may be performed by means of spatial audible reproduction of the relevant sound representation in the vicinity of said observation screen. In an embodiment said location representation is performed by means of visual display of the location of the sound source causing said threshold passage.

A system is provided for remotely guarding an area using cameras and microphones at several locations within that area, which are connected to a central surveillance post, including an observation screen (1) arranged for displaying the various camera and microphone locations on a map of said area; the system including means for executing the method according to any of the preceding claims, including processing means and means for the reproduction of said sound representations and location representations respectively.

An advantage of at least some of the embodiments is to provide a system which makes remote monitoring of (urban) areas more lively for the operator (e.g. guardsman), as the visual information offered by the video cameras is supplemented by accompanying "real live audio", however without passing on (private) conversations etc. in a way that their content could be followed, i.e. understood, by the operator.

Claims

1. A method for supporting remote guarding of an area by means of cameras and sound sources at a plurality of locations within that area, each sound source comprising a microphone or a group of microphones, the cameras and sound sources being coupled to a surveillance post, the method comprising the steps of:

deriving, per sound source, an attention value based on the sound picked up by that sound source;

comparing the attention values with a predetermined threshold value; and in response to detection that the attention value for a particular one of the sound sources has passed the predetermines threshold value

audibly rendering a sound representation of the sound picked up by the particular one of the sound sources causing the threshold passage, limited to a time interval of at most a predetermined length.

2. A method according to claim 1, wherein the predetermined length is ten seconds.

3. A method according to claim 1 or 2, comprising

displaying, at an observation screen (1), the various camera and microphone locations on a map of said area;

outputting a representation of a location of the particular one of the sound sources in relation to said map in response to said detection.

4. A method according to claim 4, wherein said representation of the location of the particular one of the sound sources is performed by means of visual display of the location of the particular one of the sound sources.

5. A method according to any of the preceding claims, comprising processing the sound picked up by the particular one of the sound sources by means of at least one of time and/or frequency domain scrambling, adding echos, distorting, filtering, or addition of noise, reducing the intelligibility of the sound

6. A method according to claim 5, comprising

- determining a value of a measure of intelligibility of the sound picked up by the particular one of the sound sources and

- applying said processing dependent on said value of the measure of intelligibility.

7. A method according to claim 5 or 6, wherein said processing reduces the Speech Transmission Index of the processed sound to less than or equal to 0.35.

8. A system for supporting remote guarding of an area using cameras and sound sources at a plurality of locations within that area, each sound source comprising a microphone or a group of microphones, comprising

- an audio output device;

- a processing system with an input for audio information the sound sources, the processing system being configured, per sound source, an attention value based on the sound picked up by that sound source, to compare the attention values with a predetermined threshold value, and, in response to detection that the attention value for a particular one of the sound sources has passed the predetermines threshold value, audibly render a sound representation of the sound picked up by the particular one of the sound sources causing the threshold passage, limited to a time interval of at most a predetermined length.

9. A system according to claim 8, comprising a display for displaying a map of the area, wherein the processing system is configured to cause output of a signal representing a location of the particular one of the sound sources in relation to said map, in response to said detection.

10. A system according to claim 9, comprising an image display screen, wherein the signal represents the location of the particular one of the sound sources is performed by means of visual display of the location of the particular one of the sound sources.

11. A system according to any of claims 8-10, comprising an audio processor configured to process the sound picked up by the particular one of the sound sources by at least one of time and/or frequency domain scrambling, adding echos, distorting, filtering, or addition of noise, reducing the intelligibility of the sound.

12. A system according to claim 11 wherein the audio processor is configured to determine a value of a measure of intelligibility of the sound picked up by the particular one of the sound sources and to apply said processing dependent on said value of the measure of intelligibility.

13. A system for remotely guarding an area using cameras and sound sources at a plurality of locations within that area, each sound source comprising a microphone or a group of microphones, cameras and sound sources being coupled to a surveillance post, the system comprising including means for executing the method according to any of claims 1-7, the system including processing means and means for the reproduction of said sound representations and location representations respectively.

14. A computer program product comprising a program of instructions for a programmable computer in a system for remotely guarding an area, the program being configured to cause the computer, when the program is executed by the programmable computer, to execute the method of according to any of claims 1-7.

15. A system for supporting remote guarding of an area using cameras and sound sources at a plurality of locations within that area, each sound source comprising a microphone or a group of microphones, comprising

- an audio output device;

- a processing system with an input for audio information the sound sources, the processing system being configured, per sound source, an attention value based on the sound picked up by that sound source, to compare the attention values with a predetermined threshold value, and, in response to detection that the attention value for a particular one of the sound sources has passed the predetermines threshold value, audibly render a sound representation of the sound picked up by the particular one of the sound sources causing the threshold passage, after processing the sound picked up by the particular one of the sound sources by at least one of time and/or frequency domain scrambling, adding echos, distorting, filtering, or addition of noise, reducing the intelligibility of the sound to at most a predetermined level.