GB2501058A - A speaker diarization system - Google Patents

A speaker diarization system Download PDF

Info

Publication number
GB2501058A
GB2501058A GB201203810A GB201203810A GB2501058A GB 2501058 A GB2501058 A GB 2501058A GB 201203810 A GB201203810 A GB 201203810A GB 201203810 A GB201203810 A GB 201203810A GB 2501058 A GB2501058 A GB 2501058A
Authority
GB
United Kingdom
Prior art keywords
sound
speaker
determining
time
sector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB201203810A
Other versions
GB201203810D0 (en
Inventor
Erich Zwyssig
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Airbus Group Ltd
Original Assignee
Airbus Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Airbus Group Ltd filed Critical Airbus Group Ltd
Priority to GB201203810A priority Critical patent/GB2501058A/en
Publication of GB201203810D0 publication Critical patent/GB201203810D0/en
Priority to PCT/GB2013/050271 priority patent/WO2013132216A1/en
Publication of GB2501058A publication Critical patent/GB2501058A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction
    • G01S3/808Systems for determining direction or deviation from predetermined direction using transducers spaced apart and measuring phase or time difference between signals therefrom, i.e. path-difference systems
    • G01S3/8086Systems for determining direction or deviation from predetermined direction using transducers spaced apart and measuring phase or time difference between signals therefrom, i.e. path-difference systems determining other position line of source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • H04R2201/4012D or 3D arrays of transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/20Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
    • H04R2430/23Direction finding using a sum-delay beam-former
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a method of determining the number of sound sources in a targeted space having a sound sensor array to detect sound signals from at least one of said sound sources and to provide a stream of output signal represented in a plurality of time segments defined over a time interval from each of the sound sensors. The method comprises determining a direction of said sound signals arriving at at least some of said plurality of sound sensors in said time segments, mapping said determined direction of said sound signal to at least one of said plurality of sectors corresponding to said determined direction to an activity map, determining the number of occurrences in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments, and determining the number of sound sources by identifying the sector having the highest number of occurrence in each of said time segments over said time interval. The system aims to improve the accuracy of the number of speakers in a speaker diarization system.

Description

SIGNAL PROCESSING METHODS AND APPARATUS
Field of the Invention
The invention relates generafly to signal processing and. more paticuiarly, to a method of processino a signal fri a speaker diarisation system.
pk round of the Invention Recently speaker diarisatlon has become a oopidar area of research. This is because speaker diaribation is an important technology for a number of applications including SE.cuty applications in the area of law enforcement, crisis management. and military command and control (02), and commercial apoUcations such as information ratheval in business and financial sectors; for exampe a meeting (or a meeting recording) with conversations involving several people.
Speaker diarisation relates to the problem of "who spoke when?" or more formafly, it aims to deteniiine the number of active soeslcers m a recording (or in reai4ime) and identity when each speaker is ts4king.
Speaker diarisation is typicaHy carded out in three steps: (i) detecting when speech s present in the recording, (ii) splitting the speech segments where the speaker changes rnid segment, and (iii) identifying and clustering speech segments from the same speaker.
ft is noted that the accuracy of a speaker cfiarisation system relies he:*viy on determining the correct number of speakers.
in a first aspect of the invenhon there is provided a method of determining the number of sound sources in a targeted space having a sound sensor anay, each of said sound sensors in said array being operable to detect sound signal from at least one of said sound sources and to pruvde a stream of output signal represented in a plurality of time segments detined over a time interval, wherein each of said time segments is defined over a predetermined time period shorter than said time iriteral, the method comprising: processing each of said time segments to determine a direction of said sound signal arnving at at east some of said plurality of sound sensors relative to said at least one of said sound sources In said time segments, generating a sector activity map representing a pluraIity of sectors corresponding to the geometry of said sound sensor array; mapping said determined direction of said sound signal to at least one of said plurality of sedors corresponding to said determined direction; accumulating a count value in each of said plurality of sectors, wherein said count value represents the number of occurrence in which said determined direction of said sound signal Is mapped to said at least one of said plurality of sectors in each of said time segments; identifying the settor having the highest count value in each of said time segments over said time interval and determining the number of sound sources based on the number of identified sectors.
In an embodiment of the invention the method further comprises combining the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented in said plurality of time segments over said time interval, The step of determining the number of sound sources may comprise processing said combined stream of output signal and said identified sectors over said time interval.
Said processing may comprise assigniflg each of said identified sectors in each of said time segments to a corresponding time segments In said combIned stream of output signal.
Said processing may further comprise determining a reference stream portIon for each of said identified sectors the reference stream portion comprising at least one of said time segments of said combined stream of output signaL Said processing may further comprise, perfonning a statistical test on each of said time segments assigned with said identified sectors with said reference stream portion determined forthe corresponding Sector.
The step of performing said statistical test may. include determining a Bayesian information Criterion value of each of said time segments assigned with said identified sectors Said processing may further comprise generating a speaker matrix and accumulating a value in position (ij) of said matrix, where i corresponds to the sector of said assigned time segment, and j corresponds to the sector corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold.
The step of determining the number of sound sources may further comprise identifying the number of entries In said speaker matrix, in another embodiment of the invention the method further comprises detecting presence of sound signal in said combined stream of output signaL The step of detecfing presence of sound signal may include performing voice or speech achvity detection.
Said processing further may comprise processing said times egments of combined stream of output signal in which sound signS is present and said identified sectors over said time interval.
The step of determining said direction of said sound signal may further comprise determining time difference of arrivaL TDOA, values associated with said sound signal arrMng'at said at least some of said plurality of sound sensors.
The step of determining said direction of said soun.d signai may further comprise determining an angle of arrival of said sound signal relative to said plurality of sound sensors based on said dete-rmined TDOt values.
In a scond aspec of the invention there is provided an apparatus for determining the number of sound sources in a targeted space having a sound sensor array, each of said sound sensors in said array being operable to detect sound signal from at east one of said sound sources and to provide a stream of output signal represented in a lurality of time seoments defined over a time interval, wherein each of said time segments is defined over a predetermined time oeriod shorter than said time interval, the apparatus compnsing a processor operable to process each of said time segments to determine a direction of said sound signal.an'iving at at least some of said plurality of sound sensors relative to saId at least one of said sound sources in said time segments, and a database generator for generating a sector activity map representing a plurality of sectors corresponding to the geometry of said sound sensor array wh:erein the processor further operable to: map said determined OirecUon of said sound signal to at least one of said plurality of sectors corresponding to said determined direction accumulate a count value in. each of said plurality of sectors. wherein said count value represents the number of occurrence in which said determined direction of said sound signal is mapped to said at least one of said plurality of sectors in each of said time segments; identify the sector having the highest count value in each of said time segments over said time interval; and determining the number of sound sources based on the number of identifled sectors.
The processor may be further operable to, combine the streams of output signal from said each of sound sensors to generate a combined stream of output signal represented In said plurality of time segments over said tints interval The processor m be further operable to process said combined stream of output signal and said identified sectors over said time interval.
The processor may be further operable to assign each of said identified sectors in each of said time segmentç to a corresponding time segments in said combined stream of output signal.
The processOr may be further operable tO determine a reference stream portion for each of said identified sectors, the reference stream portion comprising at least one or said time segments of said combined stream of output signal.
The processor may be further operable to perform a statistical test on each at said time segments assigned with said identified sectors with said reference stream portion determined for the corresponding sector.
The processor may be further operable to determining' a Bayesian Information Criterion value of each of said time segments assigned with saId identified sectors The database generator may be further operable to generating a speaker matrix and accumulafing a value in position (4j) of said matrix, where i corresponds to the sector of said assigned time segment, and j corresponds to the seaor corresponding to said reference stream portion with said Bayesian Information Criterion value exceeding a predefined threshold.
The processor may be operable to identify the number of entries in said speaker mattix.
in one embodiment of the invention the apparatus further comprises a detector for detecting presence of sound ignat in said combined stream of output signal.
The detector may be operaSe to perform voice or speech activity detection.
The processor may be further operable to process said time segments of combined stream of output signal in which sound signal. IS present and said identified sectors over said time interval.
The processor may be further' operable to determine time difference of aniveL TDOA, values associated wdh said sound signal aniving, at said at least some of said plurality of sound sensors.
The processor may be further operable to determine an angle of arrival of said sound signal relative to said, plurality of sound sensors based on said deterrntned bOA values An aspect of the invention provides a computer program product comprising computer executable instructions which, when executed by' a computer, cause the computer to perform a method as set out above. The computer program product may be embodied in a carrier medium, whrch may be a storage medium or a signal medium. A storage medium may include optical storage means, or magnetic storage means, or electronic storage means.
The above aspects of the invention can be incorporated into a specific hardware device4 a general purpose device configured by suitable software, or a combination of both. The invention can be embodied In a software product, either as a complete software implementation of the invention, or as an add-on component for modification or enhancement of existing software (such as a plug in), Such a software product could be embodied in a carrier medium., such as a storage medium (e.g. an optical disk or a mass storage memory such as a FLASH memory) or a signal medium (such as a download).
Specific hardware devices suitable for the embodiment of the invention could include an application specific device such as an ASIC, an FPGA, a OPU, CPU, or a DSP, or other dedicated functional hardware means. The reader will understand that none of the foregoing discussion of embodiment of the invention in software or hardware limits future implementation of the invertion on yet to be discovered or'deflned means of execution.
Drief descriotjgn of the Orawfr'gs In the following, embodiments of the invention will be explained in more detafl with reference to the drawings, in which: Figure 1 lustrates an arrangement of a sound sensing device deployed in a targeted. space according to an embodiment of the Invention; Figure 2 is a top view of the sound sensing device of Figure 1. illustrating an array of sound sensors implemented as part of the sound sensing device according to an embodiment of the invention; Figure 3 is a side view of the Sound sensing device ot FIgure 1. illustrating an array of sound sensors implemented as a part of the sound sensing device according to an embodiment of the invention; Figure 4 illustrates a block diagram representation of a speaker diarisation device accordIng to an embodiment of the invention; Figure 5 illustrates a block diagram representation of a speaker diarisation device according to an embodiment of the invention: Figure 6 illustrates a flow diagram of.a method of determining the number of sound sources according to an embodiment of the invention; Figure 7 illustrates a flow diagram of a process of performing sector activity count according to an embodiment of the inventIon; Figure 8 illustrates a flow diagram a method of determining the number of sound sources according to another embodiment of the invention; Figure 9 ilLustrates a flow diagram of a process of determining a speaker matrix according to an embodiment of the invention; and Figure 10 illustrates a flow diagram of a method of determining the number of sound sources according to yet another embodiment of the invention;
petalled Description
Specific embodiments of the present invention wHI be described in flsther detafi on the basis of the attached diagrams. It wift be appreciated that this is by way of example only, and should not be viewed as presenting, any tritetion on the scope of protection sought.
An overview of a deployment of a sound sensing device 10 for determining the number of sound souttes in a targeted space, for aitample an indoor space (such as a meeting room 18), Is illustrated in Figure 1. A skilled person in the art will appreciate that the sound sensing devke 10 can also be deployed in an outdoor environment, although noise redUction and filtering techniques may need to be appiied to çeduce bacKground noise.
As will be described in due course, the sound sensing device 10 is capable of transducing sound signals from a number of sound sources into electrical signals. In this example:, the sound sources include a group of participants 12a, 12b, 12c. 12d, 12e gathered about a meeting table 14 in the meeting room 16. It is understood that the sound sources may also include a telephone being put on a speaker mode and(or sounds from a television when a video conference is being held. It is further noted that the number of sound sources may be less than the number of participants,. for example there may be only two or three participants involved in a discussion.
As shown in Figure 1, the sound sensing device 10 is positioned wound the centre of the meeting table 14. However, it is understood that the sound sensing device 10 can also be pStioned ans'there on the table 14, for example on an end of the meeting table 14. Indeed in an alternative arrangement, the sound sensing device 10 may be deployed by attaching it to the ceffing of the meeting ioom 16: Figures 2 and 3 illustrate a top view and a side view respectively, of the sound sensing device 10.
As illustrated in Figure 2, the sound sensing device 10 comprises an array of sound sensors 20a-h, such as microphones, for detecting sound signals at. a given time, It will be appreciated by the person skilled in the art that any suitable means for detecting sound signals may be employed.
The sound sensors 20a-h are equilateraily disposed around the circumference of the sound sensing device 10 to provide an omni-directk,nal coverage to receIve sound signals from all dIrections. While a circular configuration Is illustrated, it will be appreciated by the skilled persqn that other configijratlons are also possible, As illustrated l, Figure 2, eight sound sensors are provided4 but it is understood that the accuracy of sound source tocalisatiort increases with the number of sound sensors implemented. However, it is also noted that this don not prevent the present inv�ntion from being employed in a set up where the sound sensors provided are more or tess than eight.
FIgure 4 illUstrates schematically components. of a speaker diarisation device 30 accogling to an embodiment of the invention. The speaker diarisation deVice 30 includeS an input)output (I/O) interface 32, a working memory 34. a signal processor 38. and a mass storage unit 33.
The sound sensing device iCY cornpdsing the sound sensor any 2Oa-h is in communication with the speaker diarisation device 30 to provide Sound signals detected by the sound sensor array 20a-h to the speaker diarisation device 30. In en alternative configuration, the sound sensing device 10 may also be integrated with the speaker diarisation device 30.
As shown in Figure 4, the output of the sound sensing device 10 is connected to the signal processor 38 via the 110 interface 2. By this connection, the detected sound signals can be i!lput to the &gnal proce:ssor 36. The I/O Interface 32 also includes an analogue-to-digital converter (ADC) 40 which converts the analogue output signals from the sound sensor array 20a-h into digital input signals. It will be appreciated that the sound sensing device 10 may also include an ADO (not shown) to provide digital signals directly from its output. In operation, the sound sensing device 10 continuously monitors sound signals and provides the detected Sound signals to the signal processor 36 and/or the mass Storage unit 38. The received sound signals may be processed in real-time by the signal processor 36 or stored as data In the mass storage unit 38 to be post-processed when required. In the context of speaker diarisation, it is noted that the sound signals genetally comprise speech signals.
The output of each of the sound sensors is an output signal stream comprising presence and absence of speech signals over a lime interval, for example the duration of a meeting, By means of a general purpose bus 42, external devices, such as the sound sensing device 10, user input devices (not shown), or audioMdeo hardware devices {hot shown), through the 110 interface 32 are in communication with the signal processor 36.
The user operable input devices 42 may comprise, in this example, a keyboard and a mouse though it will be appreciated that any other Input devices could also or alternatively be provided, such as another type of pointing device, a wiThng tabet. speech recogmtion means. or any other means by which a user input action can be interpreted and converted into data signas.
Audo/video hardware devices can aso be connected to UO interface for the output of information to a user. Aucio/video output hardware devices can indude a visuai dispay unit, a speaker or other device capahie of presenting information to a user..
The sknai processor 35 Is ocerable to execute rnachne code instructions stored in a working memory 34 and/or retrievabe from a mass storage unit 38. The signai processor 36 processes the ncormng sigrias accordng to the method described in the forthcoming paragraphs.
n another embodiment of the nveriUon, the speaker darisaion device inciudes a communication unit configured to transmit stored data or to relay the received sauna signals to a remote location for processing. A sLmatc diagram of such a speaker diarisation device 50 according to this embodiment is iHustrated in Figure 5.
As shown in Figure 5, the speaker diarisation device 50 nchides a communication unit 66 operable to establish communication with a remote station. It is noted that such an.
arrangement aUows the received sound signals to be processed at the remote station. It is further roted that the received sound sionals (and/or the stored data' may he transmitted in rodl-ne to te e oe stat n or fhej may rp torad n e mass stoaqe un t 58 d tancm tted 0 t"'l emote sctr)n when requ rod In one embodiment of the resen.t invention, there is provided a method of determining the number oi sound sources in a targeted space accurately for a speaker diarisation system.
Figure 6 illustrates a flow diagram of a speaker diarisation method according to an embodiment of the invention Referring to the flow diagram of FigureS the process begins with receiving data at step 100.
The received data comprises streams of output signals from each of the sound sensors received over a time intervaf In this embodmen, the data corresoonds to the received signals (when real-time processing is performed), or the received signais previously stored in the mass storage unit (when post-processing is performed).
In step 102, rtoise reduction is prfored on the data assgçiated with each of the output signals received via sound sensors 20e4) of the sound sensor array 20 to reduce the amount of noise present in the output signals. In this example, Wiener tillering is applied to the received signals to remove any additive noise present in the signals It wIlt be appreoiated that other noiSe t*dvction technitiue or any suitable means of reducing noise of output signals can be applied. It will also be appreciated by the parson skilled in the art that this is not an essential step for the purpose of the present invention. However, it is understood that the overall diarisation error rate (DER) may be reduced by performing this step.
In an embodiment of the invention, one of the soqnd sensors 20a-h in the sound sensor array 20 is assigned as a referenoL In this example, the sound sensor 20a is assigned as the reference. It is noted that the reference sound sensor may change during the time interval, or the same reference sound sensor may be used throughout the whole time interval Each of The remaining sound sensors is paired with the reference sound sensor 20a, resulting in seven pairs of sound sensors as depicted In the following table.
Sound sensor pairs rst s&und sensor Second scundaerisor (reference) I 20a 20b 2 2Oa 20c 3 20a 20d 4 20a 20e 20a 20f 8 20a 20g 7 20a 20h
Table i
In step 104, time difference of arrival (TDOA) estimation is performed to identify the time difference between signals from a given sound source arriving at a pair of sound sensors.
Thus, the TOGA estimation produces seven outputs; each output corresponds to an output from a respective pair of sound sensors. One example of performing TDOA estimation on the output signals is theGeneralised Cross Corretation with Phase Transform (GCC-PHAT). 11)
The resultant TDOA estimates are further improved by performing Vkterbi smoothing in step 106. However, it wUt be understood that this is not an essentIal step for the purpose of the present invention Each of the outputs of the TDOA estimation is provided to determine a sector activity map (in step 108).
The sector activity mapping is performed in step 108, arid wHI be described in the forthcoming paragraphs, with referenceto Figure 7.
Referring to the flow diagram in Figure 7, the. process begins with receiving the TDOA estimation vaiues n step 200.
SInce the relative location of a pair of sound sensors Is known, given the TDOA values for the pair of sound sensors, the angie of arrival (AOA) of a sound signal in relation to the sound sensors can be determined. In fact, due to rotational symmetry, for a pair of sound sensors, a single delay estimate resutts in two angles of ai*al that is the actual angle or arrival, and another angle of arrival reflected on the axts of the two sound sensors.
In order to identify fh angle of the sound source in relation to the sound sensor array, a sector activity (SA) map of N = 36 sedan is used such that a main lobe of 10 degrees is provided in each sector. The sector map records the number of times a sound signal anives iii a patlicular sector. It will be appreciated that any number ci sectors can be used, but the accuracy of sound source tocalisation can be increased with the number of sectors.
The estimated TDOA valUes of each microphone pair are provided and the corresponding ADA values are determined every predetermined time Interval (for example, I tmcpnd) over a predetermined time window (for example, a 5 seconds window) until the end of the output signal stream.
The determined AOA values are mapped with the SA map. Essentially, a count is incremented and accumulated, over the predetermined time interval, in a sector of the SA map (step 204) that corresponds to a determined AOA value (step 206). At the end of the predetetmined time interval, the sector with the highest value is determined (step 208). The process (steps 202 to 210) Is repeated in the next predetermined time interval until the end of the output signal stream (see steps 210 and 212).
U
For example where the predetermined time interval is I second, the counts are accumulated in a 5 seconds time window, and the highest scoring sector tbr eath time window is recorded, such that for every second of output signal stream the sector with the most activity is determined over 5 secOnds.
Finally, the output of the SA map is provided in step 214. It is appreciated that the output of the SA map comprises a representation of the most active sector Identified in every I second over a 5 second time window until the end of the output signal streatm Referring back to Figure 8, the number of sound sources Is determined in step 114 by determining the number of active sectqrs4 Speaker diarisation is performed in step 116,. It is noted that any suitable method of obtaining the diatisation output may be etnploye& and therefore details of obtaining the diatisatlon oufputwitI not be described.
In another embodiment of the invention. beamforming Is apphed using the TOQA estimates to combine the eght outputs from the sensors toa single stream Of output signal The signal stream of output signal can be represented itT S plurality of time segments over time interval of the stream of output signal.
An example of the beamforming i the deIay-sum beamforming. As shown in Figure 8. the delay-sum bearnforming is performed in step 310. It is noted that the sector activity (step 308) and the delay-sum beamfomiing (310) can be performed simultaneously or consecutiveLy lit any order.
In order to avoid the entire segment being incorrectly assigned to a single speaker during diarisation. the Bayesian information criterion (BtC) Is employed to determine whether a segment of the combined output contaIns one or more speakers.
The Bayesian irtformation criterion for an audio cluster. Ck,, s defined as: B!C(Ck)= ±{--logiE {}_2P (1) Where n, is the number at samples in the cluster and Is the sample covariance matrix.
The penalty, P. is defined as F=!(d+4.d(d÷t)JIogN (2) where N = is the total sample size and d is the number of parameters per duster. It is noted that the penalty weight, A, is usually set to 1.
The Bayesian Ir,formaQon criterion is then used to calculate whether a speech segment contains one or more different speakers and to determine wflethertwo speech segments are from the same speaker. The increase in the SIC value for merging two segments si and *2 i defined as: IilC= nlog-n1 kg-it2log-AP (3) If the SIC value is greater than zero than the information content of the merged segments. is higher than the individual segments and the two segments are Iik&y to belong to the same speaker and should be merged. Similarly, a speaker change is indicated by a posthve peak of the BIC value when calculating a terles Of BIG values f4r a sliding split point of a speech segment. It is noted that for implementing the BIC, the input speech segment can be modeled as a Gaussian process in the cepsiral space.
In the present embodiment, description of steps 300 to 308 in the flowchart of Figure 8 is simflar to steps 100 to 108 described above with respect to the flowchart of Figure 6. For this reason, details of steps 300 to 308 will not be described.
In step 312 a speaker matrix is generated. The step of generating a speaker matrix will now be described n detail with reference to Figure 9.
Refening to Figure 9, the process commences with receiving the output of the beamformer and the output of the SA map in step 400.
In step 402. each of the time segments of the stream of output signal is assigned to the active sector corresponding to the time window. Once each cit the time segments has been assigned to a sector, the highest probable time segment assigned in each sector is selethed as that secto?s reference (step 404). Examples of the highest probable time segment include the longest series of time segments in a sector, or the time segment(s) with the Iighest count value.
In step 405, the SIC score of each of the time segments is determined, using equation (3), With each of the reference segments. Accordingly, if the BIC score is greater than a threshold (e.g. 0), a count is incremented in a N N speaker matrb (SM) in location (1. j).
where I corresponds to the sector where the segment was originaLly assigned, and j corresponds to the sector of the reference segment Ideally, this matrix would only contain entries In its diagonal as the originally assigned sector would be the same as the sector with the highest BLO score, and the indices of the entries would then correspond to sectors with sound sources. However, it will be appreciated that in practice the eM vies tend to duster around locations on the diagonal of the SM. In step 408, the output of the $M is provided.
Referring back to Figure t the number otsound sources is determined in step 314 based on the output of the SM. The sectors containing Sound sources are determined based on the entries on the diagonal of the SM. It is noted that the indices of the peaks correspond to the sector number in which a sound source is predicted to be Jocated. i.e. the speaker sectors, and accordingly the number of sound sources is determined.
Speaker diarisatJon Is performed In step 315: Similarly any suitable method of obtaining the diarisation output may be employed, and. therefore details of obtaining the diarisation output will not be described.
In another embodiment, Voice Activity Detection (VAD) can be performed to detect presence or absence of speech signal in the stream of output signal. It will be appreciated by the person skilled in the art that any suitable method of the performing VAD may be employed.
As shown In Figure 10. this is performed alter the beamforming in step 512. The remaining steps (steps 500 to 518) in Figure 10 are similar to those descrIbed in FIgureS (steps 300 to 31) above% One of the advantages of performing lAD is that the SM is generated ont for time segments of the stream of output signats that contain speeth signals. This allows the SM to be generated without processing redundant data (such as time segments that do not contain speech signals), which result In a more efficient use of computing resources.
in the event that the speech segment overlaps two time windows, and the time windows contain different active sectors, the segment Is spilt Into two segments, and each assigned to the corresponding active sectors from the two time windows.
In another entodimerit. speeth segmentation cart be performed to detect presence of multiple speakers in a stream segment of the output signaL It wilJ be appreciated by the person. skilled in the art that. any suitable method of pefformk'rg segmentation may be employed.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope at the invention& Indeed.
the noveL methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and thanges in the form of the methods and systems described herein may be made wIthout departing from the spirit of the inventions. The accompanying claims and their equivalents are lnteded to covet such forms or modifications as would fall within the scope and spirit of the inventiant
GB201203810A 2012-03-05 2012-03-05 A speaker diarization system Withdrawn GB2501058A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB201203810A GB2501058A (en) 2012-03-05 2012-03-05 A speaker diarization system
PCT/GB2013/050271 WO2013132216A1 (en) 2012-03-05 2013-02-06 Method and apparatus for determining the number of sound sources in a targeted space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB201203810A GB2501058A (en) 2012-03-05 2012-03-05 A speaker diarization system

Publications (2)

Publication Number Publication Date
GB201203810D0 GB201203810D0 (en) 2012-04-18
GB2501058A true GB2501058A (en) 2013-10-16

Family

ID=46003112

Family Applications (1)

Application Number Title Priority Date Filing Date
GB201203810A Withdrawn GB2501058A (en) 2012-03-05 2012-03-05 A speaker diarization system

Country Status (2)

Country Link
GB (1) GB2501058A (en)
WO (1) WO2013132216A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201406070D0 (en) 2014-04-04 2014-05-21 Eads Uk Ltd Method of capturing and structuring information from a meeting
US10424317B2 (en) * 2016-09-14 2019-09-24 Nuance Communications, Inc. Method for microphone selection and multi-talker segmentation with ambient automated speech recognition (ASR)
CN112185413B (en) * 2020-09-30 2024-04-12 北京搜狗科技发展有限公司 Voice processing method and device for voice processing
CN116030815B (en) * 2023-03-30 2023-06-20 北京建筑大学 Voice segmentation clustering method and device based on sound source position

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110119060A1 (en) * 2009-11-15 2011-05-19 International Business Machines Corporation Method and system for speaker diarization
US20110251843A1 (en) * 2010-04-08 2011-10-13 International Business Machines Corporation Compensation of intra-speaker variability in speaker diarization
US20110320197A1 (en) * 2010-06-23 2011-12-29 Telefonica S.A. Method for indexing multimedia information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6687187B2 (en) * 2000-08-11 2004-02-03 Phonak Ag Method for directional location and locating system
FR2947931A1 (en) * 2009-07-10 2011-01-14 France Telecom LOCATION OF SOURCES
FR2954513B1 (en) * 2009-12-21 2014-08-15 Thales Sa METHOD AND SYSTEM FOR ESTIMATING THE NUMBER OF SOURCES INCIDENTED TO A SENSOR ARRAY BY ESTIMATING NOISE STATISTICS

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110119060A1 (en) * 2009-11-15 2011-05-19 International Business Machines Corporation Method and system for speaker diarization
US20110251843A1 (en) * 2010-04-08 2011-10-13 International Business Machines Corporation Compensation of intra-speaker variability in speaker diarization
US20110320197A1 (en) * 2010-06-23 2011-12-29 Telefonica S.A. Method for indexing multimedia information

Also Published As

Publication number Publication date
WO2013132216A1 (en) 2013-09-12
GB201203810D0 (en) 2012-04-18

Similar Documents

Publication Publication Date Title
JP4815661B2 (en) Signal processing apparatus and signal processing method
US9595259B2 (en) Sound source-separating device and sound source-separating method
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
US9361907B2 (en) Sound signal processing apparatus, sound signal processing method, and program
CN110875060A (en) Voice signal processing method, device, system, equipment and storage medium
JP3812887B2 (en) Signal processing system and method
US20040252845A1 (en) System and process for sound source localization using microphone array beamsteering
JP6467736B2 (en) Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program
JP4565162B2 (en) Speech event separation method, speech event separation system, and speech event separation program
US11869481B2 (en) Speech signal recognition method and device
CN111429939B (en) Sound signal separation method of double sound sources and pickup
JP7194897B2 (en) Signal processing device and signal processing method
JP2008236077A (en) Target sound extracting apparatus, target sound extracting program
CN109741609B (en) Motor vehicle whistling monitoring method based on microphone array
GB2501058A (en) A speaker diarization system
Yella et al. Speaker diarization of overlapping speech based on silence distribution in meeting recordings
JP2022533300A (en) Speech enhancement using cue clustering
CN112394324A (en) Microphone array-based remote sound source positioning method and system
US10229686B2 (en) Methods and apparatus for speech segmentation using multiple metadata
Nakadai et al. Footstep detection and classification using distributed microphones
JP2006227328A (en) Sound processor
Salvati et al. A real-time system for multiple acoustic sources localization based on ISP comparison
CN110992972A (en) Sound source noise reduction method based on multi-microphone earphone, electronic equipment and computer readable storage medium
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
Inoue et al. Speaker diarization using eye-gaze information in multi-party conversations

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)