WO2013132216A1 - Method and apparatus for determining the number of sound sources in a targeted space - Google Patents

Method and apparatus for determining the number of sound sources in a targeted space Download PDF

Info

Publication number
WO2013132216A1
WO2013132216A1 PCT/GB2013/050271 GB2013050271W WO2013132216A1 WO 2013132216 A1 WO2013132216 A1 WO 2013132216A1 GB 2013050271 W GB2013050271 W GB 2013050271W WO 2013132216 A1 WO2013132216 A1 WO 2013132216A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
time segments
sectors
output signal
time
Prior art date
Application number
PCT/GB2013/050271
Other languages
French (fr)
Inventor
Erich ZWYSSIG
Original Assignee
Eads Uk Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eads Uk Limited filed Critical Eads Uk Limited
Publication of WO2013132216A1 publication Critical patent/WO2013132216A1/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction
    • G01S3/808Systems for determining direction or deviation from predetermined direction using transducers spaced apart and measuring phase or time difference between signals therefrom, i.e. path-difference systems
    • G01S3/8086Systems for determining direction or deviation from predetermined direction using transducers spaced apart and measuring phase or time difference between signals therefrom, i.e. path-difference systems determining other position line of source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • H04R2201/4012D or 3D arrays of transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/20Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
    • H04R2430/23Direction finding using a sum-delay beam-former
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones

Definitions

  • the invention relates generally to signal processing and, more particularly, to a method of processing a signal in a speaker diarisation system.
  • speaker diarisation has become a popular area of research. This is because speaker diarisation is an important technology for a number of applications including security applications in the area of law enforcement, crisis management, and military command and control (C2), and commercial applications such as information retrieval in business and financial sectors; for example a meeting (or a meeting recording) with conversations involving several people.
  • security applications in the area of law enforcement, crisis management, and military command and control (C2)
  • C2 military command and control
  • commercial applications such as information retrieval in business and financial sectors; for example a meeting (or a meeting recording) with conversations involving several people.
  • Speaker diarisation relates to the problem of "who spoke when?", or more formally, it aims to determine the number of active speakers in a recording (or in real-time) and identify when each speaker is talking.
  • Speaker diarisation is typically carried out in three steps: (i) detecting when speech is present in the recording, (ii) splitting the speech segments where the speaker changes mid- segment, and (iii) identifying and clustering speech segments from the same speaker.
  • a method of determining the number of sound sources in a targeted space having a sound sensor array each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the method comprising:
  • each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments, generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;
  • the method further comprises combining the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.
  • the step of determining the number of sound sources may comprise processing said combined stream of output signal and said identified sectors over said time interval.
  • Said processing may comprise assigning each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.
  • Said processing may further comprise determining a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.
  • Said processing may further comprise performing a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
  • the step of performing said statistical test may include determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.
  • Said processing may further comprise generating a speaker matrix and accumulating a value in position (i,J) of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold.
  • the step of determining the number of sound sources may further comprise identifying the number of entries in said speaker matrix.
  • the method further comprises detecting presence of sound signal in said combined stream of output signal.
  • the step of detecting presence of sound signal may include performing voice or speech activity detection.
  • Said processing further may comprise processing said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.
  • the step of determining said direction of said sound signal may further comprise determining time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.
  • the step of determining said direction of said sound signal may further comprise determining an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.
  • an apparatus for determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the apparatus comprising a processor operable to process each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments, and a database generator for generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array, wherein the processor further operable to:
  • the processor may be further operable to combine the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.
  • the processor may be further operable to process said combined stream of output signal and said identified sectors over said time interval.
  • the processor may be further operable to assign each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.
  • the processor may be further operable to determine a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.
  • the processor may be further operable to perform a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
  • the processor may be further operable to determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.
  • the database generator may be further operable to generating a speaker matrix and accumulating a value in position of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold.
  • the processor may be operable to identify the number of entries in said speaker matrix.
  • the apparatus further comprises a detector for detecting presence of sound signal in said combined stream of output signal.
  • the detector may be operable to perform voice or speech activity detection.
  • the processor may be further operable to process said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.
  • the processor may be further operable to determine time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.
  • the processor may be further operable to determine an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.
  • An aspect of the invention provides a computer program product comprising computer executable instructions which, when executed by a computer, cause the computer to perform a method as set out above.
  • the computer program product may be embodied in a carrier medium, which may be a storage medium or a signal medium.
  • a storage medium may include optical storage means, or magnetic storage means, or electronic storage means.
  • the above aspects of the invention can be incorporated into a specific hardware device, a general purpose device configured by suitable software, or a combination of both.
  • the invention can be embodied in a software product, either as a complete software implementation of the invention, or as an add-on component for modification or enhancement of existing software (such as a plug in).
  • a software product could be embodied in a carrier medium, such as a storage medium (e.g. an optical disk or a mass storage memory such as a FLASH memory) or a signal medium (such as a download).
  • Specific hardware devices suitable for the embodiment of the invention could include an application specific device such as an ASIC, an FPGA, a GPU, CPU, or a DSP, or other dedicated functional hardware means.
  • Figure 1 illustrates an arrangement of a sound sensing device deployed in a targeted space according to an embodiment of the invention
  • Figure 2 is a top view of the sound sensing device of Figure 1 , illustrating an array of sound sensors implemented as part of the sound sensing device according to an embodiment of the invention
  • Figure 3 is a side view of the sound sensing device of Figure 1 , illustrating an array of sound sensors implemented as a part of the sound sensing device according to an embodiment of the invention
  • Figure 4 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention
  • Figure 5 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention
  • Figure 6 illustrates a flow diagram of a method of determining the number of sound sources according to an embodiment of the invention
  • Figure 7 illustrates a flow diagram of a process of performing sector activity count according to an embodiment of the invention
  • Figure 8 illustrates a flow diagram a method of determining the number of sound sources according to another embodiment of the invention.
  • Figure 9 illustrates a flow diagram of a process of determining a speaker matrix according to an embodiment of the invention.
  • Figure 10 illustrates a flow diagram of a method of determining the number of sound sources according to yet another embodiment of the invention
  • FIG. 1 An overview of a deployment of a sound sensing device 10 for determining the number of sound sources in a targeted space, for example an indoor space (such as a meeting room 16), is illustrated in Figure 1 , A skilled person in the art will appreciate that the sound sensing device 10 can also be deployed in an outdoor environment, although noise reduction and filtering techniques may need to be applied to reduce background noise.
  • the sound sensing device 10 is capable of transducing sound signals from a number of sound sources into electrical signals.
  • the sound sources include a group of participants 12a, 12b, 12c, 12d, 12e gathered about a meeting table 14 in the meeting room 16. It is understood that the sound sources may also include a telephone being put on a speaker mode and/or sounds from a television when a video conference is being held. It is further noted that the number of sound sources may be less than the number of participants, for example there may be only two or three participants involved in a discussion.
  • the sound sensing device 10 is positioned around the centre of the meeting table 14. However, it is understood that the sound sensing device 10 can also be positioned anywhere on the table 14, for example on an end of the meeting table 14. Indeed in an alternative arrangement, the sound sensing device 10 may be deployed by attaching it to the ceiling of the meeting room 16.
  • Figures 2 and 3 illustrate a top view and a side view respectively, of the sound sensing device 10.
  • the sound sensing device 10 comprises an array of sound sensors 20a-h, such as microphones, for detecting sound signals at a given time. It will be appreciated by the person skilled in the art that any suitable means for detecting sound signals may be employed.
  • the sound sensors 20a-h are equilaterally disposed around the circumference of the sound sensing device 10 to provide an omni-directional coverage to receive sound signals from all directions. While a circular configuration is illustrated, it will be appreciated by the skilled person that other configurations are also possible. As illustrated in Figure 2, eight sound sensors are provided, but it is understood that the accuracy of sound source localisation increases with the number of sound sensors implemented. However, it is also noted that this does not prevent the present invention from being employed in a set up where the sound sensors provided are more or less than eight.
  • FIG. 4 illustrates schematically components of a speaker diarisation device 30 according to an embodiment of the invention.
  • the speaker diarisation device 30 includes an input/output (I/O) interface 32, a working memory 34, a signal processor 36, and a mass storage unit 38.
  • I/O input/output
  • working memory 34 working memory
  • signal processor 36 signal processor
  • mass storage unit 38 mass storage unit
  • the sound sensing device 10 comprising the sound sensor array 20a-h is in communication with the speaker diarisation device 30 to provide sound signals detected by the sound sensor array 20a-h to the speaker diarisation device 30.
  • the sound sensing device 10 may also be integrated with the speaker diarisation device 30.
  • the output of the sound sensing device 10 is connected to the signal processor 36 via the I/O interface 32.
  • the I/O interface 32 also includes an analogue-to-digital converter (ADC) 40 which converts the analogue output signals from the sound sensor array 20a-h into digital input signals,
  • ADC analogue-to-digital converter
  • the sound sensing device 10 may also include an ADC (not shown) to provide digital signals directly from its output.
  • the sound sensing device 10 continuously monitors sound signals and provides the detected sound signals to the signal processor 36 and/or the mass storage unit 38.
  • the received sound signals may be processed in real-time by the signal processor 36 or stored as data in the mass storage unit 38 to be post-processed when required.
  • the sound signals generally comprise speech signals.
  • the output of each of the sound sensors is an output signal stream comprising presence and absence of speech signals over a time interval, for example the duration of a meeting.
  • external devices such as the sound sensing device 10, user input devices (not shown), or audio/video hardware devices (not shown), through the I/O interface 32 are in communication with the signal processor 36.
  • the user operable input devices 42 may comprise, in this example, a keyboard and a mouse though it will be appreciated that any other input devices could also or alternatively be provided, such as another type of pointing device, a writing tablet, speech recognition means, or any other means by which a user input action can be interpreted and converted into data signals.
  • Audio/video hardware devices can also be connected to I/O interface for the output of information to a user.
  • Audio/video output hardware devices can include a visual display unit, a speaker or any other device capable of presenting information to a user.
  • the signal processor 36 is operable to execute machine code instructions stored in a working memory 34 and/or retrievable from a mass storage unit 38.
  • the signal processor 36 processes the incoming signals according to the method described in the forthcoming paragraphs.
  • the speaker diarisation device includes a communication unit configured to transmit stored data or to relay the received sound signals to a remote location for processing.
  • a communication unit configured to transmit stored data or to relay the received sound signals to a remote location for processing.
  • the speaker diarisation device 50 includes a communication unit 66 operable to establish communication with a remote station. It is noted that such an arrangement allows the received sound signals to be processed at the remote station. It is further noted that the received sound signals (and/or the stored data) may be transmitted in real-time to the remote station or they may be stored in the mass storage unit 58 and transmitted to the remote station when required.
  • Figure 6 illustrates a flow diagram of a speaker diarisation method according to an embodiment of the invention.
  • the process begins with receiving data at step 100.
  • the received data comprises streams of output signals from each of the sound sensors received over a time interval.
  • the data corresponds to the received signals (when real-time processing is performed), or the received signals previously stored in the mass storage unit (when post-processing is performed).
  • noise reduction is performed on the data associated with each of the output signals received via sound sensors 20a-h of the sound sensor array 20 to reduce the amount of noise present in the output signals.
  • Wiener filtering is applied to the received signals to remove any additive noise present in the signals. It will be appreciated that other noise reduction technique or any suitable means of reducing noise of output signals can be applied. It will also be appreciated by the person skilled in the art that this is not an essential step for the purpose of the present invention. However, it is understood that the overall diarisation error rate (DER) may be reduced by performing this step.
  • DER overall diarisation error rate
  • one of the sound sensors 20a-h in the sound sensor array 20 is assigned as a reference.
  • the sound sensor 20a is assigned as the reference. It is noted that the reference sound sensor may change during the time interval, or the same reference sound sensor may be used throughout the whole time interval.
  • Each of the remaining sound sensors is paired with the reference sound sensor 20a, resulting in seven pairs of sound sensors as depicted in the following table.
  • step 104 time difference of arrival (TDOA) estimation is performed to identify the time difference between signals from a given sound source arriving at a pair of sound sensors.
  • TDOA time difference of arrival
  • the TDOA estimation produces seven outputs; each output corresponds to an output from a respective pair of sound sensors.
  • GCC-PHAT Generalised Cross Correlation with Phase Transform
  • the resultant TDOA estimates are further improved by performing Viterbi smoothing in step 106. However, it will be understood that this is not an essential step for the purpose of the present invention.
  • Each of the outputs of the TDOA estimation is provided to determine a sector activity map (in step 108).
  • the sector activity mapping is performed in step 108, and will be described in the forthcoming paragraphs, with reference to Figure 7.
  • the process begins with receiving the TDOA estimation values in step 200.
  • the angle of arrival (AOA) of a sound signal in relation to the sound sensors can be determined.
  • AOA angle of arrival
  • SA sector activity
  • the estimated TDOA values of each microphone pair are provided and the corresponding AOA values are determined every predetermined time interval (for example, 1 second) over a predetermined time window (for example, a 5 seconds window) until the end of the output signal stream.
  • the determined AOA values are mapped with the SA map. Essentially, a count is incremented and accumulated, over the predetermined time interval, in a sector of the SA map (step 204) that corresponds to a determined AOA value (step 206). At the end of the predetermined time interval, the sector with the highest value is determined (step 208). The process (steps 202 to 210) is repeated in the next predetermined time interval until the end of the output signal stream (see steps 210 and 212). For example where the predetermined time interval is 1 second, the counts are accumulated in a 5 seconds time window, and the highest scoring sector for each time window is recorded, such that for every second of output signal stream the sector with the most activity is determined over 5 seconds.
  • the output of the SA map is provided in step 214. It is appreciated that the output of the SA map comprises a representation of the most active sector identified in every 1 second over a 5 second time window until the end of the output signal stream.
  • the number of sound sources is determined in step 1 14 by determining the number of active sectors. Speaker diarisation is performed in step 1 16. It is noted that any suitable method of obtaining the diarisation output may be employed, and therefore details of obtaining the diarisation output will not be described.
  • beamforming is applied using the TDOA estimates to combine the eight outputs from the sensors to a single stream of output signal.
  • the signal stream of output signal can be represented in a plurality of time segments over time interval of the stream of output signal.
  • the delay-sum beamforming is performed in step 310. It is noted that the sector activity (step 308) and the delay-sum beamforming (310) can be performed simultaneously or consecutively in any order.
  • the Bayesian information criterion (BIC) is employed to determine whether a segment of the combined output contains one or more speakers.
  • the Bayesian information criterion for an audio cluster, C k is defined as: where n, is the number of samples in the cluster and is the sample covariance matrix.
  • the penalty, P is defined as 1
  • N is the total sample size and d is the number of parameters per cluster. It is noted that the penalty weight, ⁇ , is usually set to 1.
  • the Bayesian information criterion is then used to calculate whether a speech segment contains one or more different speakers and to determine whether two speech segments are from the same speaker.
  • the increase in the BIC value for merging two segments s1 and s2 is defined as:
  • the input speech segment can be modeled as a Gaussian process in the cepstral space.
  • steps 300 to 308 in the flowchart of Figure 8 is similar to steps 100 to 108 described above with respect to the flowchart of Figure 6. For this reason, details of steps 300 to 308 will not be described.
  • step 312 a speaker matrix is generated.
  • the step of generating a speaker matrix will now be described in detail with reference to Figure 9.
  • the process commences with receiving the output of the beamformer and the output of the SA map in step 400.
  • each of the time segments of the stream of output signal is assigned to the active sector corresponding to the time window.
  • the highest probable time segment assigned in each sector is selected as that sector's reference (step 404). Examples of the highest probable time segment include the longest series of time segments in a sector, or the time segment(s) with the highest count value.
  • the BIC score of each of the time segments is determined, using equation (3), with each of the reference segments. Accordingly, if the BIC score is greater than a threshold (e.g. 0), a count is incremented in a N x N speaker matrix (SM) in location (/, /) , where / corresponds to the sector where the segment was originally assigned, and j corresponds to the sector of the reference segment. Ideally, this matrix would only contain entries in its diagonal as the originally assigned sector would be the same as the sector with the highest BIC score, and the indices of the entries would then correspond to sectors with sound sources. However, it will be appreciated that in practice the entries tend to cluster around locations on the diagonal of the SM. In step 408, the output of the SM is provided.
  • a threshold e.g. 0
  • the number of sound sources is determined in step 314 based on the output of the SM.
  • the sectors containing sound sources are determined based on the entries on the diagonal of the SM. It is noted that the indices of the peaks correspond to the sector number in which a sound source is predicted to be located, i.e. the speaker sectors, and accordingly the number of sound sources is determined.
  • Speaker diarisation is performed in step 316. Similarly any suitable method of obtaining the diarisation output may be employed, and therefore details of obtaining the diarisation output will not be described.
  • VAD Voice Activity Detection
  • any suitable method of the performing VAD may be employed. As shown in Figure 10, this is performed after the beamforming in step 512. The remaining steps (steps 500 to 518) in Figure 10 are similar to those described in Figure 8 (steps 300 to 316) above.
  • One of the advantages of performing VAD is that the SM is generated only for time segments of the stream of output signals that contain speech signals. This allows the SM to be generated without processing redundant data (such as time segments that do not contain speech signals), which result in a more efficient use of computing resources.
  • speech segmentation can be performed to detect presence of multiple speakers in a stream segment of the output signal. It will be appreciated by the person skilled in the art that any suitable method of performing segmentation may be employed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a method of determining the number of sound sources in a targeted space having a sound sensor array to detect sound signals from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval from each of the sound sensors, the method comprising determining a direction of said sound signals arriving at at least some of said plurality of sound sensors in said time segments, mapping said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction to an activity map, determining the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments, and determining the number of sound sources by identifying the sector having the highest number of occurrence in each of said time segments over said time interval.

Description

SIGNAL PROCESSING METHODS AND APPARATUS
Field of the Invention
The invention relates generally to signal processing and, more particularly, to a method of processing a signal in a speaker diarisation system.
Background of the Invention
Recently, speaker diarisation has become a popular area of research. This is because speaker diarisation is an important technology for a number of applications including security applications in the area of law enforcement, crisis management, and military command and control (C2), and commercial applications such as information retrieval in business and financial sectors; for example a meeting (or a meeting recording) with conversations involving several people.
Speaker diarisation relates to the problem of "who spoke when?", or more formally, it aims to determine the number of active speakers in a recording (or in real-time) and identify when each speaker is talking.
Speaker diarisation is typically carried out in three steps: (i) detecting when speech is present in the recording, (ii) splitting the speech segments where the speaker changes mid- segment, and (iii) identifying and clustering speech segments from the same speaker.
It is noted that the accuracy of a speaker diarisation system relies heavily on determining the correct number of speakers.
Summary of the Invention
In a first aspect of the invention there is provided a method of determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the method comprising:
processing each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments, generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;
mapping said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction;
accumulating a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments;
identifying the sector having the highest count value in each of said time segments over said time interval; and
determining the number of sound sources based on the number of identified sectors.
In an embodiment of the invention the method further comprises combining the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.
The step of determining the number of sound sources may comprise processing said combined stream of output signal and said identified sectors over said time interval.
Said processing may comprise assigning each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.
Said processing may further comprise determining a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.
Said processing may further comprise performing a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
The step of performing said statistical test may include determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.
Said processing may further comprise generating a speaker matrix and accumulating a value in position (i,J) of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold. The step of determining the number of sound sources may further comprise identifying the number of entries in said speaker matrix.
In another embodiment of the invention the method further comprises detecting presence of sound signal in said combined stream of output signal.
The step of detecting presence of sound signal may include performing voice or speech activity detection.
Said processing further may comprise processing said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.
The step of determining said direction of said sound signal may further comprise determining time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.
The step of determining said direction of said sound signal may further comprise determining an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.
In a second aspect of the invention there is provided an apparatus for determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the apparatus comprising a processor operable to process each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments, and a database generator for generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array, wherein the processor further operable to:
map said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction;
accumulate a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments;
identify the sector having the highest count value in each of said time segments over said time interval; and
determining the number of sound sources based on the number of identified sectors.
The processor may be further operable to combine the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.
The processor may be further operable to process said combined stream of output signal and said identified sectors over said time interval.
The processor may be further operable to assign each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.
The processor may be further operable to determine a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.
The processor may be further operable to perform a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
The processor may be further operable to determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.
The database generator may be further operable to generating a speaker matrix and accumulating a value in position of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold.
The processor may be operable to identify the number of entries in said speaker matrix. In one embodiment of the invention the apparatus further comprises a detector for detecting presence of sound signal in said combined stream of output signal.
The detector may be operable to perform voice or speech activity detection.
The processor may be further operable to process said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.
The processor may be further operable to determine time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.
The processor may be further operable to determine an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.
An aspect of the invention provides a computer program product comprising computer executable instructions which, when executed by a computer, cause the computer to perform a method as set out above. The computer program product may be embodied in a carrier medium, which may be a storage medium or a signal medium. A storage medium may include optical storage means, or magnetic storage means, or electronic storage means.
The above aspects of the invention can be incorporated into a specific hardware device, a general purpose device configured by suitable software, or a combination of both. The invention can be embodied in a software product, either as a complete software implementation of the invention, or as an add-on component for modification or enhancement of existing software (such as a plug in). Such a software product could be embodied in a carrier medium, such as a storage medium (e.g. an optical disk or a mass storage memory such as a FLASH memory) or a signal medium (such as a download). Specific hardware devices suitable for the embodiment of the invention could include an application specific device such as an ASIC, an FPGA, a GPU, CPU, or a DSP, or other dedicated functional hardware means. The reader will understand that none of the foregoing discussion of embodiment of the invention in software or hardware limits future implementation of the invention on yet to be discovered or defined means of execution. Brief description of the Drawings
In the following, embodiments of the invention will be explained in more detail with reference to the drawings, in which:
Figure 1 illustrates an arrangement of a sound sensing device deployed in a targeted space according to an embodiment of the invention;
Figure 2 is a top view of the sound sensing device of Figure 1 , illustrating an array of sound sensors implemented as part of the sound sensing device according to an embodiment of the invention;
Figure 3 is a side view of the sound sensing device of Figure 1 , illustrating an array of sound sensors implemented as a part of the sound sensing device according to an embodiment of the invention;
Figure 4 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention;
Figure 5 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention;
Figure 6 illustrates a flow diagram of a method of determining the number of sound sources according to an embodiment of the invention;
Figure 7 illustrates a flow diagram of a process of performing sector activity count according to an embodiment of the invention;
Figure 8 illustrates a flow diagram a method of determining the number of sound sources according to another embodiment of the invention;
Figure 9 illustrates a flow diagram of a process of determining a speaker matrix according to an embodiment of the invention; and
Figure 10 illustrates a flow diagram of a method of determining the number of sound sources according to yet another embodiment of the invention; Detailed Description
Specific embodiments of the present invention will be described in further detail on the basis of the attached diagrams. It will be appreciated that this is by way of example only, and should not be viewed as presenting any limitation on the scope of protection sought.
An overview of a deployment of a sound sensing device 10 for determining the number of sound sources in a targeted space, for example an indoor space (such as a meeting room 16), is illustrated in Figure 1 , A skilled person in the art will appreciate that the sound sensing device 10 can also be deployed in an outdoor environment, although noise reduction and filtering techniques may need to be applied to reduce background noise.
As will be described in due course, the sound sensing device 10 is capable of transducing sound signals from a number of sound sources into electrical signals. In this example, the sound sources include a group of participants 12a, 12b, 12c, 12d, 12e gathered about a meeting table 14 in the meeting room 16. It is understood that the sound sources may also include a telephone being put on a speaker mode and/or sounds from a television when a video conference is being held. It is further noted that the number of sound sources may be less than the number of participants, for example there may be only two or three participants involved in a discussion.
As shown in Figure 1 , the sound sensing device 10 is positioned around the centre of the meeting table 14. However, it is understood that the sound sensing device 10 can also be positioned anywhere on the table 14, for example on an end of the meeting table 14. Indeed in an alternative arrangement, the sound sensing device 10 may be deployed by attaching it to the ceiling of the meeting room 16.
Figures 2 and 3 illustrate a top view and a side view respectively, of the sound sensing device 10.
As illustrated in Figure 2, the sound sensing device 10 comprises an array of sound sensors 20a-h, such as microphones, for detecting sound signals at a given time. It will be appreciated by the person skilled in the art that any suitable means for detecting sound signals may be employed.
The sound sensors 20a-h are equilaterally disposed around the circumference of the sound sensing device 10 to provide an omni-directional coverage to receive sound signals from all directions. While a circular configuration is illustrated, it will be appreciated by the skilled person that other configurations are also possible. As illustrated in Figure 2, eight sound sensors are provided, but it is understood that the accuracy of sound source localisation increases with the number of sound sensors implemented. However, it is also noted that this does not prevent the present invention from being employed in a set up where the sound sensors provided are more or less than eight.
Figure 4 illustrates schematically components of a speaker diarisation device 30 according to an embodiment of the invention. The speaker diarisation device 30 includes an input/output (I/O) interface 32, a working memory 34, a signal processor 36, and a mass storage unit 38.
The sound sensing device 10 comprising the sound sensor array 20a-h is in communication with the speaker diarisation device 30 to provide sound signals detected by the sound sensor array 20a-h to the speaker diarisation device 30. In an alternative configuration, the sound sensing device 10 may also be integrated with the speaker diarisation device 30.
As shown in Figure 4, the output of the sound sensing device 10 is connected to the signal processor 36 via the I/O interface 32. By this connection, the detected sound signals can be input to the signal processor 36. The I/O interface 32 also includes an analogue-to-digital converter (ADC) 40 which converts the analogue output signals from the sound sensor array 20a-h into digital input signals, It will be appreciated that the sound sensing device 10 may also include an ADC (not shown) to provide digital signals directly from its output. In operation, the sound sensing device 10 continuously monitors sound signals and provides the detected sound signals to the signal processor 36 and/or the mass storage unit 38. The received sound signals may be processed in real-time by the signal processor 36 or stored as data in the mass storage unit 38 to be post-processed when required. In the context of speaker diarisation, it is noted that the sound signals generally comprise speech signals. The output of each of the sound sensors is an output signal stream comprising presence and absence of speech signals over a time interval, for example the duration of a meeting.
By means of a general purpose bus 42, external devices, such as the sound sensing device 10, user input devices (not shown), or audio/video hardware devices (not shown), through the I/O interface 32 are in communication with the signal processor 36.
The user operable input devices 42 may comprise, in this example, a keyboard and a mouse though it will be appreciated that any other input devices could also or alternatively be provided, such as another type of pointing device, a writing tablet, speech recognition means, or any other means by which a user input action can be interpreted and converted into data signals.
Audio/video hardware devices can also be connected to I/O interface for the output of information to a user. Audio/video output hardware devices can include a visual display unit, a speaker or any other device capable of presenting information to a user.
The signal processor 36 is operable to execute machine code instructions stored in a working memory 34 and/or retrievable from a mass storage unit 38. The signal processor 36 processes the incoming signals according to the method described in the forthcoming paragraphs.
In another embodiment of the invention, the speaker diarisation device includes a communication unit configured to transmit stored data or to relay the received sound signals to a remote location for processing. A schematic diagram of such a speaker diarisation device 50 according to this embodiment is illustrated in Figure 5.
As shown in Figure 5, the speaker diarisation device 50 includes a communication unit 66 operable to establish communication with a remote station. It is noted that such an arrangement allows the received sound signals to be processed at the remote station. It is further noted that the received sound signals (and/or the stored data) may be transmitted in real-time to the remote station or they may be stored in the mass storage unit 58 and transmitted to the remote station when required.
In one embodiment of the present invention, there is provided a method of determining the number of sound sources in a targeted space accurately for a speaker diarisation system.
Figure 6 illustrates a flow diagram of a speaker diarisation method according to an embodiment of the invention.
Referring to the flow diagram of Figure 6, the process begins with receiving data at step 100. The received data comprises streams of output signals from each of the sound sensors received over a time interval. In this embodiment, the data corresponds to the received signals (when real-time processing is performed), or the received signals previously stored in the mass storage unit (when post-processing is performed). In step 102, noise reduction is performed on the data associated with each of the output signals received via sound sensors 20a-h of the sound sensor array 20 to reduce the amount of noise present in the output signals. In this example, Wiener filtering is applied to the received signals to remove any additive noise present in the signals, It will be appreciated that other noise reduction technique or any suitable means of reducing noise of output signals can be applied. It will also be appreciated by the person skilled in the art that this is not an essential step for the purpose of the present invention. However, it is understood that the overall diarisation error rate (DER) may be reduced by performing this step.
In an embodiment of the invention, one of the sound sensors 20a-h in the sound sensor array 20 is assigned as a reference. In this example, the sound sensor 20a is assigned as the reference. It is noted that the reference sound sensor may change during the time interval, or the same reference sound sensor may be used throughout the whole time interval.
Each of the remaining sound sensors is paired with the reference sound sensor 20a, resulting in seven pairs of sound sensors as depicted in the following table.
Figure imgf000011_0001
Table 1
In step 104, time difference of arrival (TDOA) estimation is performed to identify the time difference between signals from a given sound source arriving at a pair of sound sensors. Thus, the TDOA estimation produces seven outputs; each output corresponds to an output from a respective pair of sound sensors. One example of performing TDOA estimation on the output signals is the Generalised Cross Correlation with Phase Transform (GCC-PHAT). The resultant TDOA estimates are further improved by performing Viterbi smoothing in step 106. However, it will be understood that this is not an essential step for the purpose of the present invention.
Each of the outputs of the TDOA estimation is provided to determine a sector activity map (in step 108).
The sector activity mapping is performed in step 108, and will be described in the forthcoming paragraphs, with reference to Figure 7.
Referring to the flow diagram in Figure 7, the process begins with receiving the TDOA estimation values in step 200.
Since the relative location of a pair of sound sensors is known, given the TDOA values for the pair of sound sensors, the angle of arrival (AOA) of a sound signal in relation to the sound sensors can be determined. In fact, due to rotational symmetry, for a pair of sound sensors, a single delay estimate results in two angles of arrival - that is the actual angle of arrival, and another angle of arrival reflected on the axis of the two sound sensors.
In order to identify the angle of the sound source in relation to the sound sensor array, a sector activity (SA) map of N = 36 sectors is used such that a main lobe of 10 degrees is provided in each sector. The sector map records the number of times a sound signal arrives in a particular sector. It will be appreciated that any number of sectors can be used, but the accuracy of sound source localisation can be increased with the number of sectors.
The estimated TDOA values of each microphone pair are provided and the corresponding AOA values are determined every predetermined time interval (for example, 1 second) over a predetermined time window (for example, a 5 seconds window) until the end of the output signal stream.
The determined AOA values are mapped with the SA map. Essentially, a count is incremented and accumulated, over the predetermined time interval, in a sector of the SA map (step 204) that corresponds to a determined AOA value (step 206). At the end of the predetermined time interval, the sector with the highest value is determined (step 208). The process (steps 202 to 210) is repeated in the next predetermined time interval until the end of the output signal stream (see steps 210 and 212). For example where the predetermined time interval is 1 second, the counts are accumulated in a 5 seconds time window, and the highest scoring sector for each time window is recorded, such that for every second of output signal stream the sector with the most activity is determined over 5 seconds.
Finally, the output of the SA map is provided in step 214. It is appreciated that the output of the SA map comprises a representation of the most active sector identified in every 1 second over a 5 second time window until the end of the output signal stream.
Referring back to Figure 6, the number of sound sources is determined in step 1 14 by determining the number of active sectors. Speaker diarisation is performed in step 1 16. It is noted that any suitable method of obtaining the diarisation output may be employed, and therefore details of obtaining the diarisation output will not be described.
In another embodiment of the invention, beamforming is applied using the TDOA estimates to combine the eight outputs from the sensors to a single stream of output signal. The signal stream of output signal can be represented in a plurality of time segments over time interval of the stream of output signal.
An example of the beamforming is the delay-sum beamforming. As shown in Figure 8, the delay-sum beamforming is performed in step 310. It is noted that the sector activity (step 308) and the delay-sum beamforming (310) can be performed simultaneously or consecutively in any order.
In order to avoid the entire segment being incorrectly assigned to a single speaker during diarisation, the Bayesian information criterion (BIC) is employed to determine whether a segment of the combined output contains one or more speakers.
The Bayesian information criterion for an audio cluster, Ck , is defined as:
Figure imgf000013_0001
where n, is the number of samples in the cluster and is the sample covariance matrix.
The penalty, P , is defined as 1
P d + -d(d + \) log N (2)
where N is the total sample size and d is the number of parameters per cluster. It is noted that the penalty weight, λ , is usually set to 1.
The Bayesian information criterion is then used to calculate whether a speech segment contains one or more different speakers and to determine whether two speech segments are from the same speaker. The increase in the BIC value for merging two segments s1 and s2 is defined as:
BIC = n log∑ - «, log∑r n2 log∑2 - λΡ (3)
If the BIC value is greater than zero then the information content of the merged segments is higher than the individual segments and the two segments are likely to belong to the same speaker and should be merged. Similarly, a speaker change is indicated by a positive peak of the BIC value when calculating a series of BIC values for a sliding split point of a speech segment. It is noted that for implementing the BIC, the input speech segment can be modeled as a Gaussian process in the cepstral space.
In the present embodiment, description of steps 300 to 308 in the flowchart of Figure 8 is similar to steps 100 to 108 described above with respect to the flowchart of Figure 6. For this reason, details of steps 300 to 308 will not be described.
In step 312 a speaker matrix is generated. The step of generating a speaker matrix will now be described in detail with reference to Figure 9.
Referring to Figure 9, the process commences with receiving the output of the beamformer and the output of the SA map in step 400.
In step 402, each of the time segments of the stream of output signal is assigned to the active sector corresponding to the time window. Once each of the time segments has been assigned to a sector, the highest probable time segment assigned in each sector is selected as that sector's reference (step 404). Examples of the highest probable time segment include the longest series of time segments in a sector, or the time segment(s) with the highest count value.
In step 406, the BIC score of each of the time segments is determined, using equation (3), with each of the reference segments. Accordingly, if the BIC score is greater than a threshold (e.g. 0), a count is incremented in a N x N speaker matrix (SM) in location (/, /) , where / corresponds to the sector where the segment was originally assigned, and j corresponds to the sector of the reference segment. Ideally, this matrix would only contain entries in its diagonal as the originally assigned sector would be the same as the sector with the highest BIC score, and the indices of the entries would then correspond to sectors with sound sources. However, it will be appreciated that in practice the entries tend to cluster around locations on the diagonal of the SM. In step 408, the output of the SM is provided.
Referring back to Figure 8, the number of sound sources is determined in step 314 based on the output of the SM. The sectors containing sound sources are determined based on the entries on the diagonal of the SM. It is noted that the indices of the peaks correspond to the sector number in which a sound source is predicted to be located, i.e. the speaker sectors, and accordingly the number of sound sources is determined.
Speaker diarisation is performed in step 316. Similarly any suitable method of obtaining the diarisation output may be employed, and therefore details of obtaining the diarisation output will not be described.
In another embodiment, Voice Activity Detection (VAD) can be performed to detect presence or absence of speech signal in the stream of output signal. It will be appreciated by the person skilled in the art that any suitable method of the performing VAD may be employed. As shown in Figure 10, this is performed after the beamforming in step 512. The remaining steps (steps 500 to 518) in Figure 10 are similar to those described in Figure 8 (steps 300 to 316) above. One of the advantages of performing VAD is that the SM is generated only for time segments of the stream of output signals that contain speech signals. This allows the SM to be generated without processing redundant data (such as time segments that do not contain speech signals), which result in a more efficient use of computing resources.
In the event that the speech segment overlaps two time windows, and the time windows contain different active sectors, the segment is split into two segments, and each assigned to the corresponding active sectors from the two time windows. In another embodiment, speech segmentation can be performed to detect presence of multiple speakers in a stream segment of the output signal. It will be appreciated by the person skilled in the art that any suitable method of performing segmentation may be employed.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims

Claims
1 . A method of determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the method comprising:
processing each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments;
generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;
mapping said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction;
accumulating a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments;
identifying the sector having the highest count value in each of said time segments over said time interval; and
determining the number of sound sources based on the number of identified sectors.
2. A method according to claim 1 , further comprising combining the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.
3. A method according to claim 2, wherein said determining the number of sound sources comprises processing said combined stream of output signal and said identified sectors over said time interval.
4. A method according to claim 3, wherein said processing comprises assigning each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal.
5. A method according to claim 4, wherein said processing further comprises determining a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.
6. A method according to claim 5, wherein said processing further comprises performing a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
7. A method according to claim 6, wherein said performing said statistical test includes determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.
8. A method according to claim 7, wherein said processing further comprises generating a speaker matrix and accumulating a value in position of said matrix, where /' corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion.
9. A method according to claim 8, wherein determining the number of sound sources further comprises identifying the number of entries in said speaker matrix.
10. A method according to any one of claims 3 to 9, further comprising detecting presence of sound signal in said combined stream of output signal.
1 1 . A method according to claim 10, wherein said step of detecting includes performing voice activity detection,
12. A method according to claim 10 or claim 1 1 , wherein said processing further comprises processing said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.
13. A method according to any one of claims 1 to 12, wherein said step of determining said direction of said sound signal further comprises determining time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.
14. A method according to claim 13, further comprising determining an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.
15. A apparatus for determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time interval, the apparatus comprising:
a processor operable to process each of said time segments to determine a direction of said sound signal arriving at at least some of said plurality of sound sensors relative to said at least one of said sound sources in said time segments; and
a database generator for generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array;
wherein the processor further operable to:
map said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction;
accumulate a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments;
identify the sector having the highest count value in each of said time segments over said time interval; and
determining the number of sound sources based on the number of identified sectors.
16. An apparatus according to claim 15, wherein said processor is further operable to combine the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval.
17. An apparatus according to claim 16, wherein said processor is further operable to process said combined stream of output signal and said identified sectors over said time interval.
18. An apparatus according to claim 17, wherein said processor is further operable to assign each of said identified sectors in each of said time segments to a corresponding time segments in said combined stream of output signal. 9. An apparatus according to claim 18, wherein said processor is further operable to determine a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one of said time segments of said combined stream of output signal.
20. An apparatus according to claim 19, wherein said processor is further operable to perform a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
21. An apparatus according to claim 20, wherein said processor is further operable to determining a Bayesian Information Criterion value of each of said time segments assigned with said identified sectors.
22. An apparatus according to claim 21 , wherein said database generator is further operable to generating a speaker matrix and accumulating a value in position of said matrix, where / corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion.
23. An apparatus according to claim 22, wherein said processor is operable to identify the number of entries in said speaker matrix.
24. An apparatus according to any one of claims 17 to 23, further comprises a detector for detecting presence of sound signal in said combined stream of output signal.
25. An apparatus according to claim 24, wherein said detector is operable to perform voice activity detection.
26. An apparatus according to claim 24 or claim 25, wherein said processor is further operable to process said time segments of combined stream of output signal in which sound signal is present and said identified sectors over said time interval.
27. An apparatus according to any one of claims 15 to 26, wherein said processor is further operable to determine time difference of arrival, TDOA, values associated with said sound signal arriving at said at least some of said plurality of sound sensors.
28. An apparatus according to claim 27, wherein said processor is further operable to determine an angle of arrival of said sound signal relative to said plurality of sound sensors based on said determined TDOA values.
PCT/GB2013/050271 2012-03-05 2013-02-06 Method and apparatus for determining the number of sound sources in a targeted space WO2013132216A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1203810.5 2012-03-05
GB1203810.5A GB2501058A (en) 2012-03-05 2012-03-05 A speaker diarization system

Publications (1)

Publication Number Publication Date
WO2013132216A1 true WO2013132216A1 (en) 2013-09-12

Family

ID=46003112

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2013/050271 WO2013132216A1 (en) 2012-03-05 2013-02-06 Method and apparatus for determining the number of sound sources in a targeted space

Country Status (2)

Country Link
GB (1) GB2501058A (en)
WO (1) WO2013132216A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2927853A1 (en) 2014-04-04 2015-10-07 AirbusGroup Limited Method of capturing and structuring information from a meeting
CN110178178A (en) * 2016-09-14 2019-08-27 纽昂斯通讯有限公司 Microphone selection and multiple talkers segmentation with environment automatic speech recognition (ASR)
CN112185413A (en) * 2020-09-30 2021-01-05 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN116030815A (en) * 2023-03-30 2023-04-28 北京建筑大学 Voice segmentation clustering method and device based on sound source position

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030016588A1 (en) * 2000-08-11 2003-01-23 Hans-Ueli Roeck Method for directional location and locating system
FR2947931A1 (en) * 2009-07-10 2011-01-14 France Telecom LOCATION OF SOURCES
FR2954513A1 (en) * 2009-12-21 2011-06-24 Thales Sa METHOD AND SYSTEM FOR ESTIMATING THE NUMBER OF SOURCES INCIDENTED TO A SENSOR ARRAY BY ESTIMATING NOISE STATISTICS

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8554562B2 (en) * 2009-11-15 2013-10-08 Nuance Communications, Inc. Method and system for speaker diarization
US8433567B2 (en) * 2010-04-08 2013-04-30 International Business Machines Corporation Compensation of intra-speaker variability in speaker diarization
EP2585947A1 (en) * 2010-06-23 2013-05-01 Telefónica, S.A. A method for indexing multimedia information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030016588A1 (en) * 2000-08-11 2003-01-23 Hans-Ueli Roeck Method for directional location and locating system
FR2947931A1 (en) * 2009-07-10 2011-01-14 France Telecom LOCATION OF SOURCES
FR2954513A1 (en) * 2009-12-21 2011-06-24 Thales Sa METHOD AND SYSTEM FOR ESTIMATING THE NUMBER OF SOURCES INCIDENTED TO A SENSOR ARRAY BY ESTIMATING NOISE STATISTICS

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
EL CHAMI Z ET AL: "A phase-based dual microphone method to count and locate audio sources in reverberant rooms", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 2009. WASPAA '09. IEEE WORKSHOP ON, IEEE, PISCATAWAY, NJ, USA, 18 October 2009 (2009-10-18), pages 209 - 212, XP031575126, ISBN: 978-1-4244-3678-1 *
ERICH ZWYSSIG ET AL: "Determining the number of speakers in a meeting using microphone array features", 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2012) : KYOTO, JAPAN, 25 - 30 MARCH 2012 ; [PROCEEDINGS], IEEE, PISCATAWAY, NJ, 25 March 2012 (2012-03-25), pages 4765 - 4768, XP032228220, ISBN: 978-1-4673-0045-2, DOI: 10.1109/ICASSP.2012.6288984 *
LATHOUD G ET AL: "A Sector-based Approach for Localizationof Multiple Speakers with Microphone Arrays", 3 October 2004 (2004-10-03), pages 1 - 6, XP007921806, Retrieved from the Internet <URL:http://www.isca-speech.org/archive_open/archive_papers/sapa_04/sap4_93.pdf> [retrieved on 20130423] *
SWAMY R K ET AL: "Determining Number of Speakers From Multispeaker Speech Signals Using Excitation Source Information", IEEE SIGNAL PROCESSING LETTERS, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 14, no. 7, 1 July 2007 (2007-07-01), pages 481 - 484, XP011185729, ISSN: 1070-9908, DOI: 10.1109/LSP.2006.891333 *
VALIN J M ET AL: "Robust sound source localization using a microphone array on a mobile robot", PROCEEDINGS OF THE 2003 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS. (IROS 2003). LAS VEGAS, NV, OCT. 27 - 31, 2003; [IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS], NEW YORK, NY : IEEE, US, 27 October 2003 (2003-10-27), pages 1 - 6, XP002586779, ISBN: 978-0-7803-7860-5 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2927853A1 (en) 2014-04-04 2015-10-07 AirbusGroup Limited Method of capturing and structuring information from a meeting
CN110178178A (en) * 2016-09-14 2019-08-27 纽昂斯通讯有限公司 Microphone selection and multiple talkers segmentation with environment automatic speech recognition (ASR)
CN110178178B (en) * 2016-09-14 2023-10-10 纽昂斯通讯有限公司 Microphone selection and multiple speaker segmentation with ambient Automatic Speech Recognition (ASR)
CN112185413A (en) * 2020-09-30 2021-01-05 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN112185413B (en) * 2020-09-30 2024-04-12 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN116030815A (en) * 2023-03-30 2023-04-28 北京建筑大学 Voice segmentation clustering method and device based on sound source position

Also Published As

Publication number Publication date
GB201203810D0 (en) 2012-04-18
GB2501058A (en) 2013-10-16

Similar Documents

Publication Publication Date Title
US11398235B2 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals based on horizontal and pitch angles and distance of a sound source relative to a microphone array
US10469967B2 (en) Utilizing digital microphones for low power keyword detection and noise suppression
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
US9668048B2 (en) Contextual switching of microphones
US9978388B2 (en) Systems and methods for restoration of speech components
CN108922553B (en) Direction-of-arrival estimation method and system for sound box equipment
US9500739B2 (en) Estimating and tracking multiple attributes of multiple objects from multi-sensor data
CN109599124A (en) A kind of audio data processing method, device and storage medium
US20160187453A1 (en) Method and device for a mobile terminal to locate a sound source
JP4565162B2 (en) Speech event separation method, speech event separation system, and speech event separation program
US11869481B2 (en) Speech signal recognition method and device
JP6467736B2 (en) Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program
WO2013132216A1 (en) Method and apparatus for determining the number of sound sources in a targeted space
Yella et al. Improved overlap speech diarization of meeting recordings using long-term conversational features
CN110992972B (en) Sound source noise reduction method based on multi-microphone earphone, electronic equipment and computer readable storage medium
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
JP5215826B2 (en) Multiple signal section estimation apparatus, method and program
Nakadai et al. Footstep detection and classification using distributed microphones
CN110275138B (en) Multi-sound-source positioning method using dominant sound source component removal
CN113707149A (en) Audio processing method and device
Inoue et al. Speaker diarization using eye-gaze information in multi-party conversations
Liu et al. A unified network for multi-speaker speech recognition with multi-channel recordings
US20240242728A1 (en) Cascade Architecture for Noise-Robust Keyword Spotting
US20230097197A1 (en) Cascade Architecture for Noise-Robust Keyword Spotting
Dang et al. An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13707203

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13707203

Country of ref document: EP

Kind code of ref document: A1