WO2013132216A1

WO2013132216A1 - Method and apparatus for determining the number of sound sources in a targeted space

Info

Publication number: WO2013132216A1
Application number: PCT/GB2013/050271
Authority: WO
Inventors: Erich ZWYSSIG
Original assignee: Eads Uk Limited
Priority date: 2012-03-05
Filing date: 2013-02-06
Publication date: 2013-09-12
Also published as: GB201203810D0; GB2501058A

Abstract

The invention relates to a method of determining the number of sound sources in a targeted space having a sound sensor array to detect sound signals from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval from each of the sound sensors, the method comprising determining a direction of said sound signals arriving at at least some of said plurality of sound sensors in said time segments, mapping said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction to an activity map, determining the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments, and determining the number of sound sources by identifying the sector having the highest number of occurrence in each of said time segments over said time interval.

Description

SIGNAL PROCESSING METHODS AND APPARATUS

Field of the Invention

The invention relates generally to signal processing and, more particularly, to a method of processing a signal in a speaker diarisation system.

Background of the Invention

Recently, speaker diarisation has become a popular area of research. This is because speaker diarisation is an important technology for a number of applications including security applications in the area of law enforcement, crisis management, and military command and control (C2), and commercial applications such as information retrieval in business and financial sectors; for example a meeting (or a meeting recording) with conversations involving several people.

Speaker diarisation relates to the problem of "who spoke when?", or more formally, it aims to determine the number of active speakers in a recording (or in real-time) and identify when each speaker is talking.

Speaker diarisation is typically carried out in three steps: (i) detecting when speech is present in the recording, (ii) splitting the speech segments where the speaker changes mid- segment, and (iii) identifying and clustering speech segments from the same speaker.

It is noted that the accuracy of a speaker diarisation system relies heavily on determining the correct number of speakers.

Summary of the Invention

In a first aspect of the invention there is provided a method of determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the method comprising:

processing each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments, generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;

mapping said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction;

accumulating a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments;

identifying the sector having the highest count value in each of said time segments over said time interval; and

determining the number of sound sources based on the number of identified sectors.

In an embodiment of the invention the method further comprises combining the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.

The step of determining the number of sound sources may comprise processing said combined stream of output signal and said identified sectors over said time interval.

Said processing may comprise assigning each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.

Said processing may further comprise determining a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.

Said processing may further comprise performing a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.

The step of performing said statistical test may include determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.

Said processing may further comprise generating a speaker matrix and accumulating a value in position (i,J) of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold. The step of determining the number of sound sources may further comprise identifying the number of entries in said speaker matrix.

In another embodiment of the invention the method further comprises detecting presence of sound signal in said combined stream of output signal.

The step of detecting presence of sound signal may include performing voice or speech activity detection.

Said processing further may comprise processing said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.

The step of determining said direction of said sound signal may further comprise determining time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.

The step of determining said direction of said sound signal may further comprise determining an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.

In a second aspect of the invention there is provided an apparatus for determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the apparatus comprising a processor operable to process each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments, and a database generator for generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array, wherein the processor further operable to:

map said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction;

accumulate a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments;

identify the sector having the highest count value in each of said time segments over said time interval; and

The processor may be further operable to combine the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.

The processor may be further operable to process said combined stream of output signal and said identified sectors over said time interval.

The processor may be further operable to assign each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.

The processor may be further operable to determine a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.

The processor may be further operable to perform a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.

The processor may be further operable to determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.

The database generator may be further operable to generating a speaker matrix and accumulating a value in position of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold.

The processor may be operable to identify the number of entries in said speaker matrix. In one embodiment of the invention the apparatus further comprises a detector for detecting presence of sound signal in said combined stream of output signal.

The detector may be operable to perform voice or speech activity detection.

The processor may be further operable to process said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.

The processor may be further operable to determine time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.

The processor may be further operable to determine an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.

An aspect of the invention provides a computer program product comprising computer executable instructions which, when executed by a computer, cause the computer to perform a method as set out above. The computer program product may be embodied in a carrier medium, which may be a storage medium or a signal medium. A storage medium may include optical storage means, or magnetic storage means, or electronic storage means.

The above aspects of the invention can be incorporated into a specific hardware device, a general purpose device configured by suitable software, or a combination of both. The invention can be embodied in a software product, either as a complete software implementation of the invention, or as an add-on component for modification or enhancement of existing software (such as a plug in). Such a software product could be embodied in a carrier medium, such as a storage medium (e.g. an optical disk or a mass storage memory such as a FLASH memory) or a signal medium (such as a download). Specific hardware devices suitable for the embodiment of the invention could include an application specific device such as an ASIC, an FPGA, a GPU, CPU, or a DSP, or other dedicated functional hardware means. The reader will understand that none of the foregoing discussion of embodiment of the invention in software or hardware limits future implementation of the invention on yet to be discovered or defined means of execution. Brief description of the Drawings

In the following, embodiments of the invention will be explained in more detail with reference to the drawings, in which:

Figure 1 illustrates an arrangement of a sound sensing device deployed in a targeted space according to an embodiment of the invention;

Figure 2 is a top view of the sound sensing device of Figure 1 , illustrating an array of sound sensors implemented as part of the sound sensing device according to an embodiment of the invention;

Figure 3 is a side view of the sound sensing device of Figure 1 , illustrating an array of sound sensors implemented as a part of the sound sensing device according to an embodiment of the invention;

Figure 4 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention;

Figure 5 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention;

Figure 6 illustrates a flow diagram of a method of determining the number of sound sources according to an embodiment of the invention;

Figure 7 illustrates a flow diagram of a process of performing sector activity count according to an embodiment of the invention;

Figure 8 illustrates a flow diagram a method of determining the number of sound sources according to another embodiment of the invention;

Figure 9 illustrates a flow diagram of a process of determining a speaker matrix according to an embodiment of the invention; and

Figure 10 illustrates a flow diagram of a method of determining the number of sound sources according to yet another embodiment of the invention; Detailed Description

Specific embodiments of the present invention will be described in further detail on the basis of the attached diagrams. It will be appreciated that this is by way of example only, and should not be viewed as presenting any limitation on the scope of protection sought.

An overview of a deployment of a sound sensing device 10 for determining the number of sound sources in a targeted space, for example an indoor space (such as a meeting room 16), is illustrated in Figure 1 , A skilled person in the art will appreciate that the sound sensing device 10 can also be deployed in an outdoor environment, although noise reduction and filtering techniques may need to be applied to reduce background noise.

As will be described in due course, the sound sensing device 10 is capable of transducing sound signals from a number of sound sources into electrical signals. In this example, the sound sources include a group of participants 12a, 12b, 12c, 12d, 12e gathered about a meeting table 14 in the meeting room 16. It is understood that the sound sources may also include a telephone being put on a speaker mode and/or sounds from a television when a video conference is being held. It is further noted that the number of sound sources may be less than the number of participants, for example there may be only two or three participants involved in a discussion.

As shown in Figure 1 , the sound sensing device 10 is positioned around the centre of the meeting table 14. However, it is understood that the sound sensing device 10 can also be positioned anywhere on the table 14, for example on an end of the meeting table 14. Indeed in an alternative arrangement, the sound sensing device 10 may be deployed by attaching it to the ceiling of the meeting room 16.

Figures 2 and 3 illustrate a top view and a side view respectively, of the sound sensing device 10.

As illustrated in Figure 2, the sound sensing device 10 comprises an array of sound sensors 20a-h, such as microphones, for detecting sound signals at a given time. It will be appreciated by the person skilled in the art that any suitable means for detecting sound signals may be employed.

The sound sensors 20a-h are equilaterally disposed around the circumference of the sound sensing device 10 to provide an omni-directional coverage to receive sound signals from all directions. While a circular configuration is illustrated, it will be appreciated by the skilled person that other configurations are also possible. As illustrated in Figure 2, eight sound sensors are provided, but it is understood that the accuracy of sound source localisation increases with the number of sound sensors implemented. However, it is also noted that this does not prevent the present invention from being employed in a set up where the sound sensors provided are more or less than eight.

Figure 4 illustrates schematically components of a speaker diarisation device 30 according to an embodiment of the invention. The speaker diarisation device 30 includes an input/output (I/O) interface 32, a working memory 34, a signal processor 36, and a mass storage unit 38.

The sound sensing device 10 comprising the sound sensor array 20a-h is in communication with the speaker diarisation device 30 to provide sound signals detected by the sound sensor array 20a-h to the speaker diarisation device 30. In an alternative configuration, the sound sensing device 10 may also be integrated with the speaker diarisation device 30.

As shown in Figure 4, the output of the sound sensing device 10 is connected to the signal processor 36 via the I/O interface 32. By this connection, the detected sound signals can be input to the signal processor 36. The I/O interface 32 also includes an analogue-to-digital converter (ADC) 40 which converts the analogue output signals from the sound sensor array 20a-h into digital input signals, It will be appreciated that the sound sensing device 10 may also include an ADC (not shown) to provide digital signals directly from its output. In operation, the sound sensing device 10 continuously monitors sound signals and provides the detected sound signals to the signal processor 36 and/or the mass storage unit 38. The received sound signals may be processed in real-time by the signal processor 36 or stored as data in the mass storage unit 38 to be post-processed when required. In the context of speaker diarisation, it is noted that the sound signals generally comprise speech signals. The output of each of the sound sensors is an output signal stream comprising presence and absence of speech signals over a time interval, for example the duration of a meeting.

By means of a general purpose bus 42, external devices, such as the sound sensing device 10, user input devices (not shown), or audio/video hardware devices (not shown), through the I/O interface 32 are in communication with the signal processor 36.

The user operable input devices 42 may comprise, in this example, a keyboard and a mouse though it will be appreciated that any other input devices could also or alternatively be provided, such as another type of pointing device, a writing tablet, speech recognition means, or any other means by which a user input action can be interpreted and converted into data signals.

Audio/video hardware devices can also be connected to I/O interface for the output of information to a user. Audio/video output hardware devices can include a visual display unit, a speaker or any other device capable of presenting information to a user.

The signal processor 36 is operable to execute machine code instructions stored in a working memory 34 and/or retrievable from a mass storage unit 38. The signal processor 36 processes the incoming signals according to the method described in the forthcoming paragraphs.

In another embodiment of the invention, the speaker diarisation device includes a communication unit configured to transmit stored data or to relay the received sound signals to a remote location for processing. A schematic diagram of such a speaker diarisation device 50 according to this embodiment is illustrated in Figure 5.

As shown in Figure 5, the speaker diarisation device 50 includes a communication unit 66 operable to establish communication with a remote station. It is noted that such an arrangement allows the received sound signals to be processed at the remote station. It is further noted that the received sound signals (and/or the stored data) may be transmitted in real-time to the remote station or they may be stored in the mass storage unit 58 and transmitted to the remote station when required.

In one embodiment of the present invention, there is provided a method of determining the number of sound sources in a targeted space accurately for a speaker diarisation system.

Figure 6 illustrates a flow diagram of a speaker diarisation method according to an embodiment of the invention.

Referring to the flow diagram of Figure 6, the process begins with receiving data at step 100. The received data comprises streams of output signals from each of the sound sensors received over a time interval. In this embodiment, the data corresponds to the received signals (when real-time processing is performed), or the received signals previously stored in the mass storage unit (when post-processing is performed). In step 102, noise reduction is performed on the data associated with each of the output signals received via sound sensors 20a-h of the sound sensor array 20 to reduce the amount of noise present in the output signals. In this example, Wiener filtering is applied to the received signals to remove any additive noise present in the signals, It will be appreciated that other noise reduction technique or any suitable means of reducing noise of output signals can be applied. It will also be appreciated by the person skilled in the art that this is not an essential step for the purpose of the present invention. However, it is understood that the overall diarisation error rate (DER) may be reduced by performing this step.

In an embodiment of the invention, one of the sound sensors 20a-h in the sound sensor array 20 is assigned as a reference. In this example, the sound sensor 20a is assigned as the reference. It is noted that the reference sound sensor may change during the time interval, or the same reference sound sensor may be used throughout the whole time interval.

Each of the remaining sound sensors is paired with the reference sound sensor 20a, resulting in seven pairs of sound sensors as depicted in the following table.

Table 1

In step 104, time difference of arrival (TDOA) estimation is performed to identify the time difference between signals from a given sound source arriving at a pair of sound sensors. Thus, the TDOA estimation produces seven outputs; each output corresponds to an output from a respective pair of sound sensors. One example of performing TDOA estimation on the output signals is the Generalised Cross Correlation with Phase Transform (GCC-PHAT). The resultant TDOA estimates are further improved by performing Viterbi smoothing in step 106. However, it will be understood that this is not an essential step for the purpose of the present invention.

Each of the outputs of the TDOA estimation is provided to determine a sector activity map (in step 108).

The sector activity mapping is performed in step 108, and will be described in the forthcoming paragraphs, with reference to Figure 7.

Referring to the flow diagram in Figure 7, the process begins with receiving the TDOA estimation values in step 200.

Since the relative location of a pair of sound sensors is known, given the TDOA values for the pair of sound sensors, the angle of arrival (AOA) of a sound signal in relation to the sound sensors can be determined. In fact, due to rotational symmetry, for a pair of sound sensors, a single delay estimate results in two angles of arrival - that is the actual angle of arrival, and another angle of arrival reflected on the axis of the two sound sensors.

In order to identify the angle of the sound source in relation to the sound sensor array, a sector activity (SA) map of N = 36 sectors is used such that a main lobe of 10 degrees is provided in each sector. The sector map records the number of times a sound signal arrives in a particular sector. It will be appreciated that any number of sectors can be used, but the accuracy of sound source localisation can be increased with the number of sectors.

The estimated TDOA values of each microphone pair are provided and the corresponding AOA values are determined every predetermined time interval (for example, 1 second) over a predetermined time window (for example, a 5 seconds window) until the end of the output signal stream.

The determined AOA values are mapped with the SA map. Essentially, a count is incremented and accumulated, over the predetermined time interval, in a sector of the SA map (step 204) that corresponds to a determined AOA value (step 206). At the end of the predetermined time interval, the sector with the highest value is determined (step 208). The process (steps 202 to 210) is repeated in the next predetermined time interval until the end of the output signal stream (see steps 210 and 212). For example where the predetermined time interval is 1 second, the counts are accumulated in a 5 seconds time window, and the highest scoring sector for each time window is recorded, such that for every second of output signal stream the sector with the most activity is determined over 5 seconds.

Finally, the output of the SA map is provided in step 214. It is appreciated that the output of the SA map comprises a representation of the most active sector identified in every 1 second over a 5 second time window until the end of the output signal stream.

Referring back to Figure 6, the number of sound sources is determined in step 1 14 by determining the number of active sectors. Speaker diarisation is performed in step 1 16. It is noted that any suitable method of obtaining the diarisation output may be employed, and therefore details of obtaining the diarisation output will not be described.

In another embodiment of the invention, beamforming is applied using the TDOA estimates to combine the eight outputs from the sensors to a single stream of output signal. The signal stream of output signal can be represented in a plurality of time segments over time interval of the stream of output signal.

An example of the beamforming is the delay-sum beamforming. As shown in Figure 8, the delay-sum beamforming is performed in step 310. It is noted that the sector activity (step 308) and the delay-sum beamforming (310) can be performed simultaneously or consecutively in any order.

In order to avoid the entire segment being incorrectly assigned to a single speaker during diarisation, the Bayesian information criterion (BIC) is employed to determine whether a segment of the combined output contains one or more speakers.

The Bayesian information criterion for an audio cluster, C_k , is defined as:

where n, is the number of samples in the cluster and is the sample covariance matrix.

The penalty, P , is defined as 1

P d + -d(d + \) log N (2)

where N is the total sample size and d is the number of parameters per cluster. It is noted that the penalty weight, λ , is usually set to 1.

The Bayesian information criterion is then used to calculate whether a speech segment contains one or more different speakers and to determine whether two speech segments are from the same speaker. The increase in the BIC value for merging two segments s1 and s2 is defined as:

BIC = n log∑ - «, log∑_r n₂ log∑₂ - λΡ (3)

If the BIC value is greater than zero then the information content of the merged segments is higher than the individual segments and the two segments are likely to belong to the same speaker and should be merged. Similarly, a speaker change is indicated by a positive peak of the BIC value when calculating a series of BIC values for a sliding split point of a speech segment. It is noted that for implementing the BIC, the input speech segment can be modeled as a Gaussian process in the cepstral space.

In the present embodiment, description of steps 300 to 308 in the flowchart of Figure 8 is similar to steps 100 to 108 described above with respect to the flowchart of Figure 6. For this reason, details of steps 300 to 308 will not be described.

In step 312 a speaker matrix is generated. The step of generating a speaker matrix will now be described in detail with reference to Figure 9.

Referring to Figure 9, the process commences with receiving the output of the beamformer and the output of the SA map in step 400.

In step 402, each of the time segments of the stream of output signal is assigned to the active sector corresponding to the time window. Once each of the time segments has been assigned to a sector, the highest probable time segment assigned in each sector is selected as that sector's reference (step 404). Examples of the highest probable time segment include the longest series of time segments in a sector, or the time segment(s) with the highest count value.

In step 406, the BIC score of each of the time segments is determined, using equation (3), with each of the reference segments. Accordingly, if the BIC score is greater than a threshold (e.g. 0), a count is incremented in a N x N speaker matrix (SM) in location (/, /) , where / corresponds to the sector where the segment was originally assigned, and j corresponds to the sector of the reference segment. Ideally, this matrix would only contain entries in its diagonal as the originally assigned sector would be the same as the sector with the highest BIC score, and the indices of the entries would then correspond to sectors with sound sources. However, it will be appreciated that in practice the entries tend to cluster around locations on the diagonal of the SM. In step 408, the output of the SM is provided.

Referring back to Figure 8, the number of sound sources is determined in step 314 based on the output of the SM. The sectors containing sound sources are determined based on the entries on the diagonal of the SM. It is noted that the indices of the peaks correspond to the sector number in which a sound source is predicted to be located, i.e. the speaker sectors, and accordingly the number of sound sources is determined.

Speaker diarisation is performed in step 316. Similarly any suitable method of obtaining the diarisation output may be employed, and therefore details of obtaining the diarisation output will not be described.

In another embodiment, Voice Activity Detection (VAD) can be performed to detect presence or absence of speech signal in the stream of output signal. It will be appreciated by the person skilled in the art that any suitable method of the performing VAD may be employed. As shown in Figure 10, this is performed after the beamforming in step 512. The remaining steps (steps 500 to 518) in Figure 10 are similar to those described in Figure 8 (steps 300 to 316) above. One of the advantages of performing VAD is that the SM is generated only for time segments of the stream of output signals that contain speech signals. This allows the SM to be generated without processing redundant data (such as time segments that do not contain speech signals), which result in a more efficient use of computing resources.

In the event that the speech segment overlaps two time windows, and the time windows contain different active sectors, the segment is split into two segments, and each assigned to the corresponding active sectors from the two time windows. In another embodiment, speech segmentation can be performed to detect presence of multiple speakers in a stream segment of the output signal. It will be appreciated by the person skilled in the art that any suitable method of performing segmentation may be employed.

While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims

1 . A method of determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the method comprising:

processing each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments;

generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;

2. A method according to claim 1 , further comprising combining the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.

3. A method according to claim 2, wherein said determining the number of sound sources comprises processing said combined stream of output signal and said identified sectors over said time interval.

4. A method according to claim 3, wherein said processing comprises assigning each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.

5. A method according to claim 4, wherein said processing further comprises determining a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.

6. A method according to claim 5, wherein said processing further comprises performing a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.

7. A method according to claim 6, wherein said performing said statistical test includes determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.

8. A method according to claim 7, wherein said processing further comprises generating a speaker matrix and accumulating a value in position of said matrix, where /^' corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion.

9. A method according to claim 8, wherein determining the number of sound sources further comprises identifying the number of entries in said speaker matrix.

10. A method according to any one of claims 3 to 9, further comprising detecting presence of sound signal in said combined stream of output signal.

1 1 . A method according to claim 10, wherein said step of detecting includes performing voice activity detection,

12. A method according to claim 10 or claim 1 1 , wherein said processing further comprises processing said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.

13. A method according to any one of claims 1 to 12, wherein said step of determining said direction of said sound signal further comprises determining time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.

14. A method according to claim 13, further comprising determining an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.

15. A apparatus for determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the apparatus comprising:

a processor operable to process each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments; and

a database generator for generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;

wherein the processor further operable to:

16. An apparatus according to claim 15, wherein said processor is further operable to combine the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.

17. An apparatus according to claim 16, wherein said processor is further operable to process said combined stream of output signal and said identified sectors over said time interval.

18. An apparatus according to claim 17, wherein said processor is further operable to assign each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal. 9. An apparatus according to claim 18, wherein said processor is further operable to determine a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.

20. An apparatus according to claim 19, wherein said processor is further operable to perform a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.

21. An apparatus according to claim 20, wherein said processor is further operable to determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.

22. An apparatus according to claim 21 , wherein said database generator is further operable to generating a speaker matrix and accumulating a value in position of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion.

23. An apparatus according to claim 22, wherein said processor is operable to identify the number of entries in said speaker matrix.

24. An apparatus according to any one of claims 17 to 23, further comprises a detector for detecting presence of sound signal in said combined stream of output signal.

25. An apparatus according to claim 24, wherein said detector is operable to perform voice activity detection.

26. An apparatus according to claim 24 or claim 25, wherein said processor is further operable to process said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.

27. An apparatus according to any one of claims 15 to 26, wherein said processor is further operable to determine time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.

28. An apparatus according to claim 27, wherein said processor is further operable to determine an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.