GB2489489A - An integrated auto-diarization system which identifies a plurality of speakers in audio data and decodes the speech to create a transcript - Google Patents

An integrated auto-diarization system which identifies a plurality of speakers in audio data and decodes the speech to create a transcript Download PDF

Info

Publication number
GB2489489A
GB2489489A GB1105415.2A GB201105415A GB2489489A GB 2489489 A GB2489489 A GB 2489489A GB 201105415 A GB201105415 A GB 201105415A GB 2489489 A GB2489489 A GB 2489489A
Authority
GB
United Kingdom
Prior art keywords
speaker
speech
segment
profile
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1105415.2A
Other versions
GB201105415D0 (en
GB2489489B (en
Inventor
Catherine Breslin
Mark John Francis Gales
Kean Kheong Chin
Katherine Mary Knill
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1105415.2A priority Critical patent/GB2489489B/en
Publication of GB201105415D0 publication Critical patent/GB201105415D0/en
Priority to US13/215,711 priority patent/US8612224B2/en
Publication of GB2489489A publication Critical patent/GB2489489A/en
Application granted granted Critical
Publication of GB2489489B publication Critical patent/GB2489489B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/005

Abstract

The invention lies in integrated auto-diarization. A method for identifying a plurality of speakers in audio data and for decoding the speech spoken by the users is disclosed. The method comprises receiving speech, dividing the speech into segments as it is received, processing the received speech segment by segment in the order received to identify the speaker and to decode the speech, the processing comprising, performing primary decoding of the segment using an acoustic model and a language model, obtaining segment parameters indicating the differences between the speaker of the segment and a base speaker during the primary decoding, comparing the segment parameters with a plurality of stored speaker profiles to determine the identity of the speaker and selecting a speaker profile for the speaker, updating the selected speaker profile, performing a further decoding of the segment using a speaker independent acoustic model, adapted using the updated speaker profile and outputting the decoded speech for the identified speaker wherein the speaker profiles are updated as further segments of speech relating to a speaker profile are processed.

Description

I
A Speech Processing System and Method
FIELD
Embodiments of the present invention relate generally to speech processing systems and methods
BACKGROUND
Diarization is generally referred to as the task of finding and classifying homogeneous segments in an audio stream for the purpose of further processing. For example, in a meeting, diarisation can apply to the task of taking the audio stream for that meeting and identifying who spoke when. Speech recognition is the task of decoding the audio received into text and speaker adaptation is, in very general terms, the ability of a speech recognition system to adapt and hence decode speech from different speakers.
An auto-integrated system is one which combines diarization and recognition with adaptation to give a transcript of who said what and when.
BRIEF DESCRIPTION OF THE DRAWIINGS
The present invention will now be described with reference to the following embodiments in which: Figure 1 is a schematic of a system in accordance with an embodiment of the present invention in use in a meeting; Figure 2 is a schematic of the system of figure 1 in more detail; and Figure 3 is a flow diagram illustrating a method in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
According to one embodiment, a method is provided for identifying a plurality of speakers in audio data and for decoding the speech spoken by said speakers; the method comprising: receiving speech; dividing the speech into segments as it is received; processing the received speech segment by segment in the order received to identify the speaker and to decode the speech, processing comprising: performing primary decoding of the segment using an acoustic model and a language model; obtaining segment parameters indicating the differences between the speaker of the segment and a base speaker during the primary decoding; comparing the segment parameters with a plurality of stored speaker profiles to determine the identity of the speaker, and selecting a speaker profile for said speaker; updating the selected speaker profile; performing a further decoding of the segment using a speaker independent acoustic model, adapted using the updated speaker profile; outputting the decoded speech for the identified speaker, wherein the speaker profiles are updated as further segments of speech relating to a speaker profile are processed.
Embodiments of the present invention are concerned with so-called integrated auto-diarisation system which can identify the speakers at a meeting and decode the speech from those speakers. This allows an automatic transcript of the meeting to be produced on-line or as the audio progresses identifying who said what and when.
If the segment parameters do not match closely enough with a stored speaker profile, in an embodiment, a new speaker profile is initialised based on the statistics obtained for the current segment and any relevant prior knowledge.
In an embodiment, obtaining segment parameters comprises obtaining the parameters which allow a speaker transform to be obtained, said speaker transform adapting the speaker independent or canonical model to the new speaker.
The speaker transform may be an adaptation transform. In a further embodiment, the speaker transform is an MLLR or CMLLR transform. The speaker transform may comprise both adaptive and prior statistics.
The primary decoding uses a language model or grammar model. In an embodiment, this is a 4-gram model, but lower or higher order models may also be used. In a further embodiment, no language model is used and the secondary decoding comprises rescoring a lattice of possible text corresponding to the segment. In a yet further embodiment a language model is used in the secondary decoding.
Dividing the input speech into segments may comprise detecting where there is silence in the input speech or where the speaker changes.
In an embodiment, the base speaker is a canonical speaker and a speaker independent acoustic model is used. In a further embodiment, the base speaker is a previous speaker and the acoustic model used in the primary decoding is adapted for the previous speaker.
In a further embodiment, a system is provided for identifying a plurality of speakers in audio data and for decoding the speech spoken by said speakers; the system comprising: a receiver for audio containing speech; and a processor, said processor being adapted to: divide the speech into segments as it is received; process the received speech segment by segment in the order received to identify the speaker and to decode the speech, processing comprising: perform primary decoding of the segment using an acoustic model and a language model; obtain segment parameters indicating the differences between the speaker of the segment and a base speaker during the primary decoding; compare the segment parameters with a plurality of stored speaker profiles to determine the identity of the speaker, and selecting a speaker profile for said speaker; update the selected speaker profile; and perform a further decoding of the segment using a speaker independent acoustic model, adapted using the updated speaker profile, the system further comprising an output for outputting the decoded speech for the identified speaker, wherein the speaker profiles are updated as further segments of speech relating to a speaker profile are processed.
Methods and systems in accordance with embodiments can be implemented either in hardware or on software in a general purpose computer. Further embodiments can be implemented in a combination of hardware and software. Embodiments may also be implemented by a single processing apparatus or a distributed network of processing apparatuses.
Since methods and systems in accordance with embodiments can be implemented by software, systems and methods in accordance with embodiments may be implanted using computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or arty transient medium such as any signal e.g. an electrical, optical or microwave signal.
Figure 1 is a schematic of a system in accordance with an embodiment of the present invention.
The integrated auto-diarisation and speech recognition system 1 can be used for an audio stream where there are a number of speakers. In the arrangement of figure 1, there is speaker A and speaker B. However, it will be appreciated that the system could be extended to a large number of speakers. The system 1 is provided on a table in figure 1 between speaker A and B. However, it could be placed anywhere in the room providing that the system was capable of detecting speech signals from the speakers at the meeting.
The system 1 in figure 1 is shown on a table taking speech inputs from two nearby speakers. However, the system could also be integrated into a speaker for example of a telephone conferencing system where it was not physically located close to every speaker but instead received speech signals over a telephone line, internet connection etc. The system 1 comprises a voice input module 11. Voice input module 11 can be a microphone or configured to receive speech inputs from telephone lines, over the internet etc. The voice input module 11 will then output speech signals into identification and decoding section 13. Identification and decoding section 13 will be described in more detail with reference to figure 2. The output from identification and decoding module 13 is then outputted to output module 15. Output module 15 in one embodiment will output a transcription of the meeting with the speech spoken by each speaker listed. The transcript may identify the speaker with the speech spoken by that speaker in chronological order.
Figure 2 is a schematic of the diarisation system 1 of figure 1. The system 1 comprises an identification and decoding section 13, a voice input 11 and an output 15.
The identification and decoding section 13 comprises processor 53 which runs voice conversion application 55. The section 13 is also provided with memory 57 which communicates with the application as directed by the processor 53. There is also provided a front end unit 61 and an output preparation module 63.
Front end unit 61 receives a speech input from voice input 11. Voice input 11 may be a microphone, microphone array or maybe received from a storage medium, streamed online etc. The front end unit 61 digitises the received speech signal and splits it into frames of equal lengths. The speech signals are then subjected to a spectral analysis to determine various parameters which are plotted in an "acoustic space" or feature space.
The parameters which are derived will be discussed in more detail later.
The front end unit 61 also removes signals which are believed not to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use S filter bank (F BANK) parameters, Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit 61 is in the form of an input vector which is in n-dimensional acoustic space.
Front end unit 61 then communicates the input data to the processor 53 running application 55. Application 55 outputs data corresponding to the text of the speech input via module 11 and also identifies the speaker of the text.
The application 55 comprises a decoder, an acoustic model and a language model. The acoustic model section will generally operate using Hidden Markov Models. However, it is also possible to use acoustic models based on for example, connectionist models and hybrid models.
The acoustic model derives the likelihood of a sequence of observations corresponding' to a word or part thereof on the basis of the acoustic input alone.
The language model contains information concerning probabilities of a certain sequence of words or parts of words following each other in a given language. Generally a static model is used. The most popular method is the N-gram model.
The decoder then traditionally uses a dynamic programming (DP) approach to find the best transcription for a given speech segment using the results from the acoustic model and the language model.
This data is then output to output preparation module 63 which converts the data into a form to be output by output 15. Output 15 may be a display on a screen, a printout or a file may be directed towards a storage medium, streamed over the Internet or directed towards a further program as required. In one embodiment, the text output is translated into a further language.
The acoustic model which is to be used for speech recognition will need to cope under different conditions such as for different speakers andlor under different noise conditions.
The above overview of speech recognition has not considered the case of multiple different speakers. Systems and methods in accordance with embodiments of the present invention handle speech recognition from a plurality of different speakers.
Systems and methods in accordance with embodiments of the present invention adapt the acoustic model to the different speakers.
Speaker adaptation methods such as like Maximum Likelihood Linear Regression (MLLR) and Constrained MLLR (CMLLR) can be used.
In the adaptation technique of CMLLR each speech observation vector o, is transformed by a linear transform W4 A(r) b°] o=A(r)ot+b(r) (1) Where o1 is the feature vector or speech vector to be transformed and. is the feature vector after transformation.
Many voice recognition systems use Hidden Markov Models (HMMs) where the HMMs have states with output distributions. The output distributions are often based on Gaussian Mixture Models (GMMs) with each component m of a GMM being a Gaussian defined by its mean and variance.
For such a system, equation (1) yields likelihood p(otm) = + b(T); p(m) E(m)) (2) for component m in regression class r that has mean and variance jém) and zm).
To estimate the CMLLR transform parameters W using maximum likelihood, the auxiliary function Q(W,. is used Q(W, W) = (in) log (lAw (Aot + b; (3) t in The 1th row of the transform can be estimated iteratively using = (Api + k) (4) Where Pt 5 the extended co-factor row vector of A and A satisfies the quadratic equation + ApGk -(r) = 0 (5) CMLLR transforms are normally estimated from a set of statistics k5' and (t) accumulated from data.
= E7 [ :] (6) ntEr O t=i (in) 1' kr = (m)2 Em [1 oT 1 (7) mEr C1 t=i (r) = (8) mEr t=1 Where 4m) is the posterior probability that frame o, is aligned with component m. For convenience, the dependence on the regression class r is dropped below.
By using an adaptation framework of the above type, it is possible to adapt the input vectors for a speaker to that of a canonical speaker for which the acoustic model has been trained.
Multiple speakers can be handled in such a system by establishing a set of adaptation transforms for each speaker. In the above CMLLR example, a set of statistics G, and fl can be accumulated for each speaker and from these statistics, a transform is estimated.
In a method in accordance with this embodiment, speaker clustering and adaptation is performed where individual speakers are represented by CMLLR transforms, and those transforms are used for both speaker clustering and adaptation.
In a method in accordance with an embodiment, Speaker s is represented by a speaker profile p(s) containing a set of statistics, G5, k and fl and the adaptation transform W estimated from those statistics. At any time, the set of profiles p = [p(1) . . . p(S)] represents all the speakers that have been seen thus far. A further set of transforms represent generic' speakers: ] For example, this set could contain an identity transform, training data transforms, male and female speaker transforms etc. The complete set of transforms is given by w = [vz(1) . * yj(S) vz(A) For each new segment, a first-pass hypothesis is obtained. In a multi-pass decoding framework, speaker identity is not normally used in the first pass. Next, the segment specific statistics G, kS and fi are accumulated using the forward-backward algorithm and this first-pass hypothesis (9) m (in) T oT] (10) in CT t=1 (11) in t=1 The auxiliary function that approximates the segment likelihood, Q(W), is given by: Q(W) = m)1og (AIN(Aot +b;dm),E(m))) t in = Tiog A - w1GwT -2kwTF12) For diagonal covariances, the set of transforms TV described above can be used directly to select the transform W which yields the highest segment likelihood: 1AT=argmaxQ(W) (13)
WE W
If one of the speaker-specific transforms [W' wS)] is selected, then the corresponding speaker profile is updated by adding the cunent segment statistics to those stored in the profile -+ (14) k ÷-+ k (15) a(s) p(s) + (16) and the transform W recomputed. If one of the generic' speaker transforms [W, w°" ] is selected, then a new speaker profile p(s is initialised. To obtain a robust transform estimate in the first instance, the statistics for this profile may be initialised with the selected generic transform as prior. That is, statistics for estimating the CMLLR transform, G and k5 are based on interpolating segment-based and prior statistics G + (17) (pr) (18) (pr) a(s) = 3(1L) + 5 7(m) (19) The prior statistics are normalised so that they effectively contribute r frames to the final statistics.
The prior statistics are given by: -7(m) 1 s{oTIm} 20 - (m) [e{om} E{ooTjm} (m) (m) = 2 [1 E{oTLm} ] (21) m pr = ) 7(m) (22) where c{olm} and £ {ooTlm} are estimated from a target distribution for each component and y(m) is the component occupancy from training data. This can be implemented as a set of statistics accumulated offline from training data, modified by the inverse of the generic transform at runtime.
In the above described embodiment, a current segment is either assigned to an existing speaker profile, or a new speaker profile is created. The adaptation transform from that profile can be used immediately to perform adaptation and to re-decode the current segment in a multi-pass decoding framework.
Figure 3 is a flow diagram showing a method in accordance with an embodiment of the present invention. The method may be used with the type of system described with reference to Figure 1. Using the terminology of figure 1, and for simplicity assuming that there are just two speakers, speaker A and speaker B, when the first speaker, speaker A starts to speak, the system will switch on.
In step SI 01, the speech will be broken into segments. The segments will contain one or more words. In one embodiment, the segments are selected so that a whole segment corresponds to the speech of just one speaker. This may be achieved for example by using a detector which can determine the difference between speech and silence. The silent portions segregate the speech into segments. For example, a 0MM (Gaussian mixture model) speechlsilence detector may be used.
In a fUrther embodiment, the output of the speech recognition stage (described below) can be used, For example, if the speech recognition stage indicates silence with high confidence, a boundary between segments can be set. This avoids the need for a separate module.
The system may be provided with a push to talk interface where a user pushes a button on the system to indicate that they are starting to speak. The input from this can be used to segment the speech.
Once the first segment has been determined, it is then passed into primary transcription stage in step S 103. In this embodiment, speech is represented by a set of speaker independent or canonical hidden. Markov models (HMMs 105). These speaker independent HMMs are stored in database 105. In order to perform recognition step S 103, speaker independent HMMs 105 are used. In an embodiment, the primary transcription stage will also use a language model. However, as this is a primary transcription stage and a further transcription stage will take place later, usually a lower order language model such as a bigram can be used.
A hypothesis for the input speech will be first determined in step S 103 and from this, a forward/backward algorithm using HMMs is carried out to determine statistics for the segment. In this particular embodiment, the statistics for CMLLR are used.
However, other types of speaker adaptation could be used. For example MLLR, with prior, noise robustness or any combination thereof.
Confidence measures can also be used during the forward-backward pass to weight the resulting statistics.
in this embodiment, in step Si 07, it is then determined whether or not the speaker is a new speaker. Speakers who have already been introduced to the system will have a speaker profile stored in speaker profile database 109. A speaker identified as a new speaker in step Sl07 will not have a pre-stored speaker profile in the speaker profile database 109. There are many different ways of determining whether or not a speaker is a new speaker in step Si 07. For example: 1) The identity transform could be used, if the identity transform gives a higher likelihood or value of auxiliary function Q than those obtained from the other transforms from the set of speaker profiles, this indicates that there could be a new speaker; or 2) Similarly, a generic speaker transform could be used, if this gives a higher likelihood value or auxiliary function than those obtained from the other transforms from the set of speaker profiles, this indicates that there could be a new speaker; or 3) Another approach would be to look at the likelihoods or values of auxiliary functions derived from adaptation transforms from the set of speaker profiles, if this is, below a threshold, it may indicate that there is a new speaker.
If it is determined from step 5107 that there is a new speaker, a new speaker profile is created and initialised in step Sill and the new speaker profile is saved in speaker profiles SI 09. A speaker profile will be generated in step Si ii using the statistics derived for that speaker in primary transcription and any appropriate prior information.
If the segment is not from a new speaker, it is assigned to one of the existing speaker profile by selecting the profile with the derived adaptation transform that maximises the auxiliary function Q given the statistics derived in step S 103. This is similar to the procedure used in some implementations of Vocal Tract Length Normalisation (VTLN) which select a warping transform from among pre-computed sets to maximise the auxiliary function Q. The speaker ID is allocated in step Si 13. This will either be a new speaker or one of the existing speakers whose profile has been created in a database in step S 109.
In step S 115, if the speaker is recognised it is then determined whether or not the profile needs to be updated. In one embodiment, speed or computational resources may be considered when deciding whether or not to update a profile.
If it is determined that the speaker profile should be updated, the speaker profile is updated in step S117 which updates the relevant profile in the speaker profile database 109. The updating is done by adding the current segment statistics to those of the best profile and re-estimating the transform.
In step S 119, a second pass transcription is performed. The second pass transcription decodes the data using the updated speaker profile and the speaker independent HMMs 105. In one embodiment, a higher order language model is used in the first pass, and only the lattice is rescored in this second pass in step S 119.
The system then determines if it has processed the last segment. If it has not, then it returns to step SlOl and the next segment is analysed in the same way. The speaker profiles are being continually updated and are used both for recognising the speaker and performing the final transcription. By sharing information between the speaker ID and speech recognition stages, the speaker identification and the speaker recognition stages leverage off one another. Word level information and confidence scores can be used in the speaker ID stage and efficiency savings exist in the speech recognition stages. As they have been already accumulated in the speaker clustering stage, there is no need to compute the relevant statistics for adaptation. The profiles can be stored for future sessions.
Therefore, the above embodiment of integrated diarisation and automatic speech recognition leads to fast online recognition of continuous audio streams, with efficient use of memory as data and data structures can be shared between speaker identification and speech recognition subtasks. Further, in the above embodiment the use of the auxiliary function to approximate likelihood allows for very efficient speaker identification as segment statistics need only be computed once for the speaker independent or canonical HMMs rather than multiple times for each existing speaker profile.
The use of shared statistics allows confidence scores to be easily integrated into both the speaker ID and recognition stages.
In the above embodiment, the updated speaker profile is used to derive an adaptation transform for the current segment. This adaptation transform is then used to adapt the HMMs, re-decode the current segment and improve the hypothesis. This process can optionally be repeated.
However, in a further embodiment, it is assumed that the speaker doesn't change between segments. The adaptation transform derived from the speaker profile for the previous segment is used to obtain the segment hypothesis and/or statistics, before proceeding with the speaker identification step. If the speaker has changed, then the segment can be re-recognised using the transform derived from the correct speaker profile.
In the embodiment described with reference to figure 3, a new speaker profile is created when a new speaker is seen. In practice, a speaker profile could represent a cluster of speakers where their data is pooled. In a further embodiment, multiple speaker profiles represent a single speaker in, for example, different environmental conditions. In these cases, the creation of a new speaker profile and the speaker identification stage can be modified to reflect these definitions.
In an embodiment, speaker profiles (i.e. statistics and transforms) can be cached over multiple sessions.
Any form of transform can be used, such as with a prior, discriminatively estimated, from noise robustness techniques etc. Methods and systems in accordance with the above embodiment, have used adaptation based on adaptive statistics. However, it is also possible to use prior statistics where the PCMLLR transform parameters are obtained by minimising the KL divergence between a CMLLR adapted distribution and a target distribution. This is a powerful technique when the target distribution is complex, e.g. full covariance, and the PCMLLR transform provides a computatiorially efficient approximation.
Incorporation of prior statistics have been explained above with reference to equations (17) through to (22). This will allow robust transforms to be obtained with little data, but as the profile is updated with more data, the CMLLR statistics will begin to dominate.
In one embodiment, this form of prior statistics can be used to initialise new speaker profiles. In another embodiment, the structure of the adaptation transform derived from a speaker profile could depend on the amount of data available. For example, if a small number of frames are used then a diagonal transform might be appropriate. If a large number of frames are assigned to a speaker profile, then a full transform is expected to give larger performance gains.
A method in accordance with the above embodiments uses the same accumulated statistics for speaker clustering and for adaptation, making it efficient to implement and easy to integrate with speaker adaptation. Furthermore, selecting the most likely speaker profile for each segment maintains a direct link between speaker clustering and the likelihood criterion.
There are several possible modifications to this scheme. In one implementation, it is assumed that the speaker doesn't change between segments arid so statistics for the current segment are accumulated using the previous best speaker transform. Then, at the speaker clustering stage, it is decided whether the speaker has changed or not.
An example of a method in accordance with an embodiment of the present invention will now be described.
The AM! dev set consists of audio data from two sets of four meetings (ESO9a-d and ISO9a-d). Each meeting has four speakers and the same speakers appear in each set of meetings. The first of these meetings (ESO9a and!S09a) were used to train a CMLLR transform for each speaker. The resulting CMLLR transforms were then used to identify the current speaker for all segments in all meetings. For each segment, the auxiliary function Q(W) was used to select the best candidate from the set of well-trained transforms. The selected transform was then used in lattice rescoring to improve the baseline hypothesis. This proposed approach was compared to using the correct speaker transform identified using the oracle segment speaker labels.
Using the proposed speaker ID approach, roughly 18% of the segments were incorrectly classified as originating from the wrong speaker. However, most of these segments were short, and so corresponded to only 2.7% of speech frames being wrongly labelled. The subsequent word error rates obtained using these transforms are shown in the table below.
WER
Baseline (no transform) 29.9 Proposed approach selecting most likely 29.4
CMLLR
Oracle speaker CMLLR 29.1 While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims (14)

  1. CLAIMS: 1. A method for identifying a plurality of speakers in audio data and for decoding the speech spoken by said speakers; the method comprising: receiving speech; dividing the speech into segments as it is received; processing the received speech segment by segment in the order received to identify the speaker and to decode the speech, processing comprising: performing primary decoding of the segment using an acoustic model and a language model; obtaining segment parameters indicating the differences between the speaker of the segment and a base speaker during the primary decoding; comparing the segment parameters with a plurality of stored speaker profiles to determine the identity of the speaker, and selecting a speaker profile for said speaker; updating the selected speaker profile; performing a further decoding of the segment using a speaker independent acoustic model, adapted using the updated speaker profile; outputting the decoded speech for the identified speaker, wherein the speaker profiles are updated as further segments of speech relating to a speaker profile are processed.
  2. 2. A method according to claim 1, wherein if the segment parameters do not match closely enough with a stored speaker profile, a new speaker profile is initialised based on the parameters obtained for the segment.
  3. 3. A method according to claim 2, wherein a new speaker profile is initialised if the likelihood or value of auxiliary function is greater for a unity transform or generic transform than for one of the stored speaker profiles.
  4. 4. A method according to claim 2, wherein a new speaker profile is initialised if the likelihood or value of auxiliary function is less than a predetennined threshold.
  5. A method according to claim 1, wherein obtaining segment parameters comprises obtaining the parameters which allow a speaker transform to be estimated, said speaker transform adapting the speech of the new speaker to that of the independent speaker of the acoustic model.
  6. 6. A method according to claim 5, wherein the speaker transform is an MLLR or CMLLR transfonn.
  7. 7. A method according to claim 5, wherein the speaker profile comprises both adaptive and prior statistics.
  8. 8. A method according to claim 1, wherein the primary decoding uses a language model.
  9. 9. A method according to claim 1, wherein secondary decoding comprises rescoring a lattice of possible text corresponding to the segment.
  10. 10. A method according to claim 1, wherein dividing the input speech into segments comprises detecting where there is silence in the input speech.
  11. I 1 A method according to claim 1, wherein the base speaker is a canonical speaker.
  12. 12. A method according to claim 1, wherein the base speaker is the speaker of a previous segment.
  13. 13. A carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 1.
  14. 14. A system for identifying a plurality of speakers in audio data and for decoding the speech spoken by said speakers; the system comprising: a receive for audio containing speech; and a processor, said processor being adapted to: divide the speech into segments as it is received; process the received speech segment by segment in the order received to identify the speaker and to decode the speech, processing comprising: perform primary decoding of the segment using an acoustic model and a language model; obtain segment parameters indicating the differences between the speaker of the segment and a base speaker during the primary decoding; compare the segment parameters with a plurality of stored speaker profiles to determine the identity of the speaker, and selecting a speaker profile for said speaker; update the selected speaker profile; and perform a further decoding of the segment using a speaker independent acoustic model, adapted using the updated speaker profile, the system further comprising an output for outputting the decoded speech for the identified speaker, wherein the speaker profiles are updated as further segments of speech relating to a speaker profile are processed.
GB1105415.2A 2011-03-30 2011-03-30 A speech processing system and method Expired - Fee Related GB2489489B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1105415.2A GB2489489B (en) 2011-03-30 2011-03-30 A speech processing system and method
US13/215,711 US8612224B2 (en) 2011-03-30 2011-08-23 Speech processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1105415.2A GB2489489B (en) 2011-03-30 2011-03-30 A speech processing system and method

Publications (3)

Publication Number Publication Date
GB201105415D0 GB201105415D0 (en) 2011-05-11
GB2489489A true GB2489489A (en) 2012-10-03
GB2489489B GB2489489B (en) 2013-08-21

Family

ID=44067678

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1105415.2A Expired - Fee Related GB2489489B (en) 2011-03-30 2011-03-30 A speech processing system and method

Country Status (2)

Country Link
US (1) US8612224B2 (en)
GB (1) GB2489489B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3107089A1 (en) * 2015-06-18 2016-12-21 Airbus Operations GmbH Speech recognition on board of an aircraft
US10468031B2 (en) 2017-11-21 2019-11-05 International Business Machines Corporation Diarization driven by meta-information identified in discussion content
EP3584786A4 (en) * 2017-02-15 2019-12-25 Tencent Technology (Shenzhen) Company Limited Voice recognition method, electronic device, and computer storage medium
US11120802B2 (en) 2017-11-21 2021-09-14 International Business Machines Corporation Diarization driven by the ASR based segmentation
US11557288B2 (en) 2020-04-10 2023-01-17 International Business Machines Corporation Hindrance speech portion detection using time stamps

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9009040B2 (en) * 2010-05-05 2015-04-14 Cisco Technology, Inc. Training a transcription system
US8706499B2 (en) * 2011-08-16 2014-04-22 Facebook, Inc. Periodic ambient waveform analysis for enhanced social functions
US20130144414A1 (en) * 2011-12-06 2013-06-06 Cisco Technology, Inc. Method and apparatus for discovering and labeling speakers in a large and growing collection of videos with minimal user effort
US8515750B1 (en) * 2012-06-05 2013-08-20 Google Inc. Realtime acoustic adaptation using stability measures
US8744995B1 (en) 2012-07-30 2014-06-03 Google Inc. Alias disambiguation
US8583750B1 (en) 2012-08-10 2013-11-12 Google Inc. Inferring identity of intended communication recipient
US8520807B1 (en) 2012-08-10 2013-08-27 Google Inc. Phonetically unique communication identifiers
US8571865B1 (en) * 2012-08-10 2013-10-29 Google Inc. Inference-aided speaker recognition
US9401140B1 (en) * 2012-08-22 2016-07-26 Amazon Technologies, Inc. Unsupervised acoustic model training
US9058806B2 (en) * 2012-09-10 2015-06-16 Cisco Technology, Inc. Speaker segmentation and recognition based on list of speakers
US9530417B2 (en) * 2013-01-04 2016-12-27 Stmicroelectronics Asia Pacific Pte Ltd. Methods, systems, and circuits for text independent speaker recognition with automatic learning features
US9263030B2 (en) 2013-01-23 2016-02-16 Microsoft Technology Licensing, Llc Adaptive online feature normalization for speech recognition
US9368109B2 (en) * 2013-05-31 2016-06-14 Nuance Communications, Inc. Method and apparatus for automatic speaker-based speech clustering
US9953630B1 (en) * 2013-05-31 2018-04-24 Amazon Technologies, Inc. Language recognition for device settings
KR20150081981A (en) * 2014-01-07 2015-07-15 삼성전자주식회사 Apparatus and Method for structuring contents of meeting
US9786297B2 (en) 2014-04-09 2017-10-10 Empire Technology Development Llc Identification by sound data
KR102225404B1 (en) * 2014-05-23 2021-03-09 삼성전자주식회사 Method and Apparatus of Speech Recognition Using Device Information
RU2704746C2 (en) 2015-08-24 2019-10-30 ФОРД ГЛОУБАЛ ТЕКНОЛОДЖИЗ, ЭлЭлСи Dynamic acoustic model for vehicle
US9858923B2 (en) * 2015-09-24 2018-01-02 Intel Corporation Dynamic adaptation of language models and semantic tracking for automatic speech recognition
US10026405B2 (en) 2016-05-03 2018-07-17 SESTEK Ses velletisim Bilgisayar Tekn. San. Ve Tic A.S. Method for speaker diarization
US10614797B2 (en) * 2016-12-01 2020-04-07 International Business Machines Corporation Prefix methods for diarization in streaming mode
US10431225B2 (en) 2017-03-31 2019-10-01 International Business Machines Corporation Speaker identification assisted by categorical cues
US10637898B2 (en) * 2017-05-24 2020-04-28 AffectLayer, Inc. Automatic speaker identification in calls
US11417343B2 (en) * 2017-05-24 2022-08-16 Zoominfo Converse Llc Automatic speaker identification in calls using multiple speaker-identification parameters
US20190051375A1 (en) 2017-08-10 2019-02-14 Nuance Communications, Inc. Automated clinical documentation system and method
US11316865B2 (en) 2017-08-10 2022-04-26 Nuance Communications, Inc. Ambient cooperative intelligence system and method
WO2019048062A1 (en) * 2017-09-11 2019-03-14 Telefonaktiebolaget Lm Ericsson (Publ) Voice-controlled management of user profiles
EP3682444A1 (en) 2017-09-11 2020-07-22 Telefonaktiebolaget LM Ericsson (PUBL) Voice-controlled management of user profiles
WO2019077013A1 (en) 2017-10-18 2019-04-25 Soapbox Labs Ltd. Methods and systems for processing audio signals containing speech data
US11250382B2 (en) 2018-03-05 2022-02-15 Nuance Communications, Inc. Automated clinical documentation system and method
WO2019173333A1 (en) 2018-03-05 2019-09-12 Nuance Communications, Inc. Automated clinical documentation system and method
WO2019173349A1 (en) 2018-03-05 2019-09-12 Nuance Communications, Inc. System and method for review of automated clinical documentation
US11227679B2 (en) 2019-06-14 2022-01-18 Nuance Communications, Inc. Ambient clinical intelligence system and method
US11043207B2 (en) 2019-06-14 2021-06-22 Nuance Communications, Inc. System and method for array data simulation and customized acoustic modeling for ambient ASR
US11216480B2 (en) 2019-06-14 2022-01-04 Nuance Communications, Inc. System and method for querying data points from graph data structures
US11531807B2 (en) 2019-06-28 2022-12-20 Nuance Communications, Inc. System and method for customized text macros
CN110610720B (en) * 2019-09-19 2022-02-25 北京搜狗科技发展有限公司 Data processing method and device and data processing device
US11670408B2 (en) 2019-09-30 2023-06-06 Nuance Communications, Inc. System and method for review of automated clinical documentation
US11222103B1 (en) 2020-10-29 2022-01-11 Nuance Communications, Inc. Ambient cooperative intelligence system and method
US11626104B2 (en) * 2020-12-08 2023-04-11 Qualcomm Incorporated User speech profile management

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6424946B1 (en) * 1999-04-09 2002-07-23 International Business Machines Corporation Methods and apparatus for unknown speaker labeling using concurrent speech recognition, segmentation, classification and clustering
US6748356B1 (en) * 2000-06-07 2004-06-08 International Business Machines Corporation Methods and apparatus for identifying unknown speakers using a hierarchical tree structure
WO2006126216A1 (en) * 2005-05-24 2006-11-30 Loquendo S.P.A. Automatic text-independent, language-independent speaker voice-print creation and speaker recognition
US20090319269A1 (en) * 2008-06-24 2009-12-24 Hagai Aronowitz Method of Trainable Speaker Diarization
US20100179811A1 (en) * 2009-01-13 2010-07-15 Crim Identifying keyword occurrences in audio data
US20110119060A1 (en) * 2009-11-15 2011-05-19 International Business Machines Corporation Method and system for speaker diarization

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4780906A (en) * 1984-02-17 1988-10-25 Texas Instruments Incorporated Speaker-independent word recognition method and system based upon zero-crossing rate and energy measurement of analog speech signal
US5687287A (en) * 1995-05-22 1997-11-11 Lucent Technologies Inc. Speaker verification method and apparatus using mixture decomposition discrimination
US5963903A (en) * 1996-06-28 1999-10-05 Microsoft Corporation Method and system for dynamically adjusted training for speech recognition
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
US6263308B1 (en) * 2000-03-20 2001-07-17 Microsoft Corporation Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
US6629073B1 (en) * 2000-04-27 2003-09-30 Microsoft Corporation Speech recognition method and apparatus utilizing multi-unit models
US7668718B2 (en) * 2001-07-17 2010-02-23 Custom Speech Usa, Inc. Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile
KR20030070179A (en) * 2002-02-21 2003-08-29 엘지전자 주식회사 Method of the audio stream segmantation
US7295970B1 (en) * 2002-08-29 2007-11-13 At&T Corp Unsupervised speaker segmentation of multi-speaker speech data
US20040138894A1 (en) * 2002-10-17 2004-07-15 Daniel Kiecza Speech transcription tool for efficient speech transcription
US7676363B2 (en) * 2006-06-29 2010-03-09 General Motors Llc Automated speech recognition using normalized in-vehicle speech
TWI342010B (en) * 2006-12-13 2011-05-11 Delta Electronics Inc Speech recognition method and system with intelligent classification and adjustment
US7881930B2 (en) * 2007-06-25 2011-02-01 Nuance Communications, Inc. ASR-aided transcription with segmented feedback training
US9058818B2 (en) * 2009-10-22 2015-06-16 Broadcom Corporation User attribute derivation and update for network/peer assisted speech coding
US8639508B2 (en) * 2011-02-14 2014-01-28 General Motors Llc User-specific confidence thresholds for speech recognition
US8685548B2 (en) * 2011-03-31 2014-04-01 Seagate Technology Llc Lubricant compositions
US9053750B2 (en) * 2011-06-17 2015-06-09 At&T Intellectual Property I, L.P. Speaker association with a visual representation of spoken content

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6424946B1 (en) * 1999-04-09 2002-07-23 International Business Machines Corporation Methods and apparatus for unknown speaker labeling using concurrent speech recognition, segmentation, classification and clustering
US6748356B1 (en) * 2000-06-07 2004-06-08 International Business Machines Corporation Methods and apparatus for identifying unknown speakers using a hierarchical tree structure
WO2006126216A1 (en) * 2005-05-24 2006-11-30 Loquendo S.P.A. Automatic text-independent, language-independent speaker voice-print creation and speaker recognition
US20090319269A1 (en) * 2008-06-24 2009-12-24 Hagai Aronowitz Method of Trainable Speaker Diarization
US20100179811A1 (en) * 2009-01-13 2010-07-15 Crim Identifying keyword occurrences in audio data
US20110119060A1 (en) * 2009-11-15 2011-05-19 International Business Machines Corporation Method and system for speaker diarization

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3107089A1 (en) * 2015-06-18 2016-12-21 Airbus Operations GmbH Speech recognition on board of an aircraft
US10056085B2 (en) 2015-06-18 2018-08-21 Airbus Operations Gmbh Speech recognition on board of an aircraft
EP3584786A4 (en) * 2017-02-15 2019-12-25 Tencent Technology (Shenzhen) Company Limited Voice recognition method, electronic device, and computer storage medium
US10468031B2 (en) 2017-11-21 2019-11-05 International Business Machines Corporation Diarization driven by meta-information identified in discussion content
US11120802B2 (en) 2017-11-21 2021-09-14 International Business Machines Corporation Diarization driven by the ASR based segmentation
US11557288B2 (en) 2020-04-10 2023-01-17 International Business Machines Corporation Hindrance speech portion detection using time stamps

Also Published As

Publication number Publication date
US20120253811A1 (en) 2012-10-04
US8612224B2 (en) 2013-12-17
GB201105415D0 (en) 2011-05-11
GB2489489B (en) 2013-08-21

Similar Documents

Publication Publication Date Title
US8612224B2 (en) Speech processing system and method
JP6350148B2 (en) SPEAKER INDEXING DEVICE, SPEAKER INDEXING METHOD, AND SPEAKER INDEXING COMPUTER PROGRAM
US8249867B2 (en) Microphone array based speech recognition system and target speech extracting method of the system
Hain et al. New features in the CU-HTK system for transcription of conversational telephone speech
KR100612840B1 (en) Speaker clustering method and speaker adaptation method based on model transformation, and apparatus using the same
CN111816165A (en) Voice recognition method and device and electronic equipment
KR101237799B1 (en) Improving the robustness to environmental changes of a context dependent speech recognizer
WO2014025682A2 (en) Method and system for acoustic data selection for training the parameters of an acoustic model
CN106847259B (en) Method for screening and optimizing audio keyword template
JP5149107B2 (en) Sound processing apparatus and program
US9037463B2 (en) Efficient exploitation of model complementariness by low confidence re-scoring in automatic speech recognition
He et al. Target-speaker voice activity detection with improved i-vector estimation for unknown number of speaker
WO2008137616A1 (en) Multi-class constrained maximum likelihood linear regression
US20030144837A1 (en) Collaboration of multiple automatic speech recognition (ASR) systems
KR101618512B1 (en) Gaussian mixture model based speaker recognition system and the selection method of additional training utterance
JP5180928B2 (en) Speech recognition apparatus and mask generation method for speech recognition apparatus
US11929058B2 (en) Systems and methods for adapting human speaker embeddings in speech synthesis
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method of the system
US8768695B2 (en) Channel normalization using recognition feedback
Michel et al. Frame-level MMI as a sequence discriminative training criterion for LVCSR
GB2546325A (en) Speaker-adaptive speech recognition
Audhkhasi et al. Empirical link between hypothesis diversity and fusion performance in an ensemble of automatic speech recognition systems.
Takaki et al. Unsupervised speaker adaptation for DNN-based speech synthesis using input codes
Vogt et al. Bayes factor scoring of GMMs for speaker verification
Tachibana et al. Frame-level AnyBoost for LVCSR with the MMI criterion

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230330