GB2564478A

GB2564478A - Speech processing systems

Info

Publication number: GB2564478A
Application number: GB1711344.0A
Authority: GB
Inventors: Cunningham Stuart
Original assignee: University of Sheffield
Current assignee: University of Sheffield
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2019-01-16
Also published as: GB201711344D0

Abstract

An Automatic Speech Recognition (ASR) system 102 for speech disorders such as dysarthria (ie. inconsistent articulation which confounds trained ASR models such as Hidden Markov) receives audible utterances (productions) at a Speech Engine 104, characterises them according to acoustical features (LPC, MFCC etc.) and correlates them 116 with both target (ie. desired, 202, eg. sell) and off target (ie. undesired, 204, eg. shell) speech templates 112 from a database 110, with the correlation score being displayed at graphical interface 120 (eg. thumbs up for correct pronunciation, fig. 4). Target likelihood (SR) for an utterance U may be based on a scoring window (fig. 6) of width ε and having left and right edges bounding penalising zones.

Description

SPEECH PROCESSING SYSTEMS

BACKGROUND

[0001] Electronic Assistive Technology (EAT) is being increasingly used for interfacing with technology. EAT has taken a number of paths presenting various user interfaces (Ul). Once such user interface uses automatic speech recognition (ASR).

[0002] Dysarthria is a family of neurologically-based speech disorders characterized by loss of control of speech articulators. Speech produced by dysarthric speakers can be very difficult for listeners unfamiliar with the speaker to understand. Providing a generic ASR based Ul is challenging in general. However, providing such an ASR Ul for dysarthric speakers can be even more challenging due to the variability of articulatory output.

[0003] ASR technology uses a trained recognition model such as, for example, a trained Hidden Markov Model (HMM). However, the performance of such a trained model is dependent on consistently articulated speech data. Similarly, the performance of an ASR Ul is also dependent on consistently articulated speech commands. Developing a corpus of such consistently articulated speech data is a challenge at least in part due to the variability of articulatory output of dysarthric speakers.

BRIEF INTRODUCTION OF THE DRAWINGS

[0004] Figure 1 shows a speech recognition system according to an example, [0005] Figure 2 illustrates a template database according to an example, [0006] Figure 3 shows target and off target templates according to an example, [0007] Figure 4 shows a graphical output according to an example, [0008] Figure 5 depicts a first flow chart according to an example, [0009] Figure 6 shows a scoring function according to an example, [0010] Figure 7 depicts a second flow chart according to an example, and [0011] Figure 8 shows a processor and machine executable instructions or circuitry according to an example.

DETAILED DESCRIPTION

[0012] Figure 1 shows a view 100 of a speech recognition system 102. The speech recognition system 102 comprises a speech engine 104. The speech engine 104 processes speech data received from an audio input device such as, for example, a microphone 106, via an audio interface 108. The speech data is synonymously referred to as an utterance or a production. The speech processed by the speech engine 104 is digitized speech. The digitized speech can be received by the speech engine 104 in digital form. Alternatively, or additionally, the audio interface 108 can digitize received analogue speech.

[0013] The speech recognition system 102 can additionally comprise, or have access to, a template database 110. The template database 110 can store one or more than one speech template 112. The speech template 112 can comprise a speech model (not shown) associated with a respective utterance. The speech models can comprise, for example, generative stochastic models. The speech model can comprise data representing or otherwise associated with the utterance. For example, the data representing, or otherwise associated with, one or more than one utterance can comprise Linear Predictive Coding (LPC), Perceptual Linear Prediction (PLP), Mel Frequency Cepstral Coefficients (MFCC), PLP Relative Spectra (PLP-RASTA) and the like.

[0014] Although the view 100 in figure 1 shows a single template 112, example implementations can use one or more than one template. Example implementations can use a number of templates. The templates can be associated with different respective utterances. Furthermore, the utterances can be linked. An example of linked utterances, or an example of a set of one or more than one utterance, can be utterances associated with data representing or associated with linked commands such as, for example, “TV”, “Channel”, “Up” and “Down”.

[0015] Prior to receiving a speech production, the template database 110 is accessed to retrieve data associated with a target or desired production to be practiced. A word representing the speech to be produced or practiced can be output 113 via a Ul such as, for example a graphical user interface (GUI) 120. Such a GUI can comprise a screen. The template or templates 112 are associated with predetermined words that have a predetermined characteristic. The predetermined characteristic can be that the template or templates are associated with one or more than one sound that the speaker may be able to infrequently produce consistently, that is, an utterance that needs practicing. Therefore, the desired utterance can be selected to be something that the speaker is known to be able to produce, but is also something that the speaker may have difficultly articulating well or difficulty in articulating consistently.

[0016] In response to receiving speech data via the audio interface 108, the speech engine 104 can process that speech data to at least one of recognize and categorise any utterance associated with the speech. Various techniques can be used for such classification and recognition such as, for example, Hidden Markov Models (HMM), Vector Quantisation (VQ), Dynamic Time Warping (DTW), Support Vector Machine (SVM) and the like. The speech engine 104 produces an output 114. Optionally, the output 114 is used to index the template database 110 to access an utterance template 112 associated with received speech data. Alternatively, the utterance template 112 is selected in advance of receiving an utterance.

[0017] The accessed template 112 is output to a correlator 116. The correlator 116 is arranged to form a comparison between the output 114 and at least the accessed template 112 with a view to producing visual output 118 for display on the GUI 120. The GUI 120 can be a screen (not shown). The screen can be, for example, a screen of an iPad or other tablet device, a laptop, a mobile telephone, a computer or other display device.

[0018] The visual output 118 can be a score that represents the closeness of a processed speech production to a target defined by the template 112. The score is intended to give a speaker feedback on their speech production that is as close as possible to the judgment of an expert listener such as, for example, a speech and language expert. The aim of the practice is to help the speaker train their articulation so that they can produce a target articulation with at least one of greater accuracy or greater consistency.

[0019] Advantageously, example implementations have a potential benefit over an expert human listener in that the scoring should, over time, be more consistent than a human listener. This follows because humans can quickly adapt to accommodate the speech productions of any person, including those of a dysarthric speaker.

[0020] As a dysarthric speaker improves their articulation of a word, the database can be adapted. For example, an utterance that is no longer a challenge, can be removed from the database or offered for practice less frequently as compared to other words that are perceived as being a greater challenge.

[0021] For example, suppose a speaker is attempting to make the sound at the beginning of the word “sell”, but because their tongue is insufficiently far forward in their mouth, they generate a sound that is more similar to the sound one would make when uttering the word “shell”. Therefore, in the assumed situation, when trying to make the sounds in the word “sell” the speaker may sometimes produce them correctly, relative to their template, and other times they may produce the word “shell”.

[0022] Therefore, to generate feedback to the speaker, example implementations have two sets of templates, which correspond to a desired template or target and an undesirable template or target.

[0023] Referring to figure 2, there is shown, in greater detail, a view 200 of a further example of the template database 110. The template database 110 comprises a target or desired template 202 and one or more than one undesirable template 204. In the illustrated example, a single desired template 202 is shown together with a single undesirable template 204. However, example implementations are not limited to such arrangements. Example implementations can be realised in which a number of target or desired templates are used in conjunction with one or more than one respective undesirable template. Example implementations can be realised in which one or more than one target or desired template has one or more than one associated undesirable template. For example, a further desired or target template 206 is shown in association with a number of undesirable templates 208 and 210.

[0024] The desired templates 202 and 206 can form a set of desired templates 212. The undesirable templates 204, 208 and 210 can form a set of undesirable templates 214. A desired or target template and one or more than one associated undesirable template can form a group of associated templates. It can be appreciated that two such groups 216 and 218 are shown in figure 2.

[0025] The groups of associated templates are produced by the speaker or an expert in conjunction with, or by processing the utterances of, a speaker. Desirable productions can be categorized using a first categorisation. Undesirable utterances can be categorised using a second categorisation. For example, the first and second categorisations can be as simple as “good” or “bad”. Example implementations can be realised that use other categorisations that are more sophisticated or more graduated. For instance, an acceptable categorisation may include utterances that exhibit varying degrees of desirability. For example, a scale of desirability of 1 to 5 can be used to categorise acceptable utterances to form part of a desired or target template or set 212. Conversely, an unacceptable categorisation may include utterances that exhibit varying degrees of undesirability. For example, a scale of undesirability of 15 to 20 can be used to categorise unacceptable utterances to form part of an undesirable template or set 214. Although example implementations have been described with reference to the above ranges of 1 to 5 and 15 to 20, examples can alternatively or additionally be realised in which other ranges are used. Furthermore, each template may have a respective range or a common range.

[0026] Continuing with the above example, working with an expert, or other listener, a speaker can make recordings of their speech. Assuming they are trying to produce their clearest examples of the sounds in the word “sell”, during the recording sessions, the expert can provide feedback that helps them to produce the desired target production, especially if they produce sounds that are closer to “shell”. For each of the recordings, the expert will make a judgment and categorise the productions as either good or bad. Once a sufficient number of productions have been recorded, the expert can train speech engines for the productions captured.

[0027] Therefore, the recorded speech and associated categorisations are used to train two or more than two recognition models or speech engines.

[0028] Figure 3 shows a view 300 of a number of speech engines 302 and 304 that have been trained using the above productions. The first engine 302 has been trained using productions associated with one or more than one utterance of the good or desirable utterances 212. The second engine 304 has been trained using productions associated with one or more than one utterance ofthe undesirable utterances 214.

[0029] The one or more than one engine or model 302 and 304 can be used in at least one of two ways, which are, firstly, that the engines 302 and 304 can be used to discriminate between desirable and undesirable productions, otherwise also known as target or off-target productions respectively, and, secondly, to provide an indication, or score, regarding how close a production is to a target production.

[0030] Continuing with the above example, suppose a speaker is practicsing their productions. Assume in a first attempt, they produce the word “sell” well relative to their best, but imperfect, template for “sell”, the GUI could output a visual indication to that effect as shown in the view 400 of figure 4, which shows a graphical thumbs up 402, which provides an indications of their correspondence production being “good”.

[0031] Also shown in figure 4 are a number of controls. A first control 404 is actuated by the speaker to step through possible utterances, which allows a speaker to select an utterance from the template database to be practiced. Once an utterance of interest has been located, it is selected using a second actuator or button 406, which prompts the speaker to speak the word displayed (not shown) for assessment. The system captures and processes their production and outputs the thumbs up graphic 402 or a thumbs down graphic (not shown) according to the production being good or bad respectively. A still further control button 408 is used to repeat a practice session in response to which the system repeats capturing and processing the speaker’s production of the word corresponding to that session.

[0032] Example implementations calculate a score that is used as the basis for categorising a production. It will be appreciated that a speech recognition model for a word captures a variety of different ways a speaker will articulate that word. Even though not all of those differences are discernible to a human ear, every time a word is articulated it will, nevertheless, have small acoustic differences.

[0033] Referring to figure 5, there is shown a view 500 of a flowchart 502 of an example implementation for training a speech model and for processing the training data.

[0034] At step 504, a body of productions or recordings, /, relating to a word, U, are accessed or made available. At step 506, the body of productions is categorised into acceptable productions and unacceptable productions, which are otherwise known as target, or desirable, productions and off-target, or undesirable, productions, associated with the word U. A target, or desirable, speech model is produced for the acceptable productions at step 508. An off-target, or undesirable, speech model is produced for the unacceptable productions also at step 510. The speech models can be generative stochastic models such as, for example, HMMs, VQ, DTW, SVM and the like.

[0035] At step 512, the productions in the training set can be subjected to forced alignment, which aligns the speech with corresponding text. In the example implementation, the corresponding text is the word U. Each of the recordings, /, in the set of recordings used to train the target or desirable model are passed to the speech engine to be recognised. The recognition process can be used to produce a likelihood of the model generating (or accepting) the recording. The likelihood derived from the recognition process for each recording, /, is stored together with the alignment of which of the frames of the input were recognised as being part of the word, U

[0036] For each production of a set of productions, /, for a corresponding word, U, forced alignment is used to generate a likelihood and frame alignment, from which a per-frame alignment likelihood can be derived for that utterance as follows: [0037] SfUf = , where P(t/;) is the likelihood obtained by forced alignment of fi the Ith item in the training set of productions and / is the number of frames consumed by Ith item in the training set of productions.

[0038] Within a training set, an utterance is selected that is the closest to the typical way that the speaker pronounced the word U. The selection can be realised in a number of ways. A first way to select the utterance that is the closest to the average way the speaker pronounces the word U comprises selecting the utterance with the highest likelihood, that is, max(S(Ui)). A second way to select the utterance that is the closest to the average way the speaker pronounced the word U comprises selecting the utterance with the median likelihood, that is, med(S(Uj)). Either method can be used to define a quantity SR , which is known as the target likelihood, as follows: [0039] SR = max(S(Ui))Vi or SR = med(S(Ui))Vi. The target likelihood can be used to define a reference point of a scoring range or scoring window. Example implementations can be realised in which the reference point is the middle of a scoring window. Although the example implementations can use either of the above methods, they are not limited to such arrangements. Example implementations can be realised in which an expert selects productions to form the basis of acceptable and/or unacceptable productions to be used in the target and off target templates.

[0040] Optionally, a further quantity, ε , can be defined as the width of the scoring window. Referring to figure 6, there is shown a view 600 of the scoring window 602. The scoring window 602 has a width defined by ε , which gives left 604 and right 606 edges of the scoring window 602 centred around SR 608.

[0041] Also shown in figure 6 are two penalising zones 610 and 612. The penalising zones 610 and 612 are used to scale a score, SR, falling within those zones according to their distance from, or proximity to, the best scoring window 602. In the example implementation depicted in figure 6, it can be appreciated that a reciprocal function is used to express or define the profile of the penalising zones 610 and 612. Example implementations can be realised in which a value of SR falling within the left penalising zone 610 is scaled according to a predetermined scaling function. Example implementations of such a predetermined scaling function can be 100*f(/3) where 0< /(/1) <1. Example implementations can be realised in which a value of SR falling within the right penalising zone 610 is scaled according to a predetermined scaling function. Example implementations of such a predetermined scaling function can be 100*f(cc) where 0</(tz)<l. An example implementation ofthe above is given in the pseudo-code below: [0042] [0043] [0044] It can be appreciated that the further away a value of SR is from the scoring window, the greater the score output is scaled according to the profile of the penalising zones 610 and 612.

[0045] Although the example implementations use the above reciprocal functions as compressive functions, example implementations can be realised that use other compressive functions. Other functions, compressive or otherwise, can be arranged to scale SR according to the distance SR is from the scoring window.

[0046] It can be appreciated that having a scoring window 602 with a width defined by ε assists in ensuring that new productions added to the training data that are close to the target are not penalised. Therefore, the smaller the scoring window 602, the more challenging the target, that is, the closer the production would have to be to a target, or desirable, speech model to be classed or scored as acceptable. Conversely, the wider the scoring window 602, the easier the target, that is, the production can be further from a target, or desirable, speech model but still be classified or scored as acceptable. Therefore, it can be appreciated that the value of ε can be set according to the degree of difficulty to be applied in assessing a production.

[0047] Example implementations can be realised in which each word U has a respective

value of ε . Setting the value of ε can be done by an expert who will be able to make an initial or on-going assessment of a speaker’s performance relative to their productions. Alternatively, or additionally, the value of ε can be calculated or set automatically according to one or more than one criterion. For example, as articulation might improve with practice, a speaker’s values of SR for an utterance or word U might start to progressively cluster or progressively converge towards the central value of SR 608, in which case a value of ε that narrows the scoring window 602 can be calculated. Conversely, the width of the scoring window 602 could be widened if the values of SR for an utterance or word U are determined as progressively diverging or otherwise moving away from the central value of SR 608. For example, a statistical measure of variation can be used to assess progress or otherwise of productions relative to at least one of desired or undesired templates.

[0048] For example, embodiments can be realised in which the similarity between two words can be estimated by calculating the similarity between the constituent phonemes. That similarity can be used to influence at least one of the width of the scoring window or the compression function. For any two words, using a standard pronunciation it is possible to make an estimate of how similar the words may appear to a listener by comparing each of the constituent sounds in each of the words in turn. Two words with many sounds in common would have a higher probability of being confused by a listener. Likewise, words with few sounds in common are much less likely to be confused. Such estimates could be informed by the probability that a listener will confuse two distinct speech sounds. Such probabilities can (and have) been estimated from published data on the confusions made by listeners between diDerent speech sounds. Dynamic programming could be used to determine the probability of confusion between words of diDerent lengths. Hence using such a method it is possible to estimate the potential to confuse several words by making comparisons between them all of the words to produce a matrix of confusions. From this matrix it would be possible to obtain the probability of the confusion between any pair or words. Suitably, embodiments can be realised in which the off target production represents a further articulation. For example, the further articulation can be selected to have a degree of correlation with a target articulation. Confusable words can be arranged to have respective scoring window profiles that vary with the confusability of the words. For example, two highly confusable words, one being the target and one being off-target, can be arranged to have a very narrow scoring window and step roll-off compression or penalty functions. Conversely, less confusable words can have a relatively wider scoring window and a relatively shallower roll-off function or penalty window.

[0049] Referring to figure 7, there is shown a view 700 of a flowchart of an example implementation. At 702, a production or utterance is accessed. The access can result from recording an utterance or in any other way. The utterance can be recorded using a microphone such as the above microphone 106. The utterance is processed at 704 to extract features characteristic of the utterance or to parameterize the utterance, that is, to determine one or more than one parameter associated with the utterance. Any such feature extraction can comprise, for example, determining one or more than one acoustic vector that characterises, or is otherwise indicative of or related to, the utterance. Example implementations determine a number of acoustic vectors associated with an utterance. Each acoustic vector can be associated with a respective frame associated with the utterance. An utterance can have multiple associated frames. The frames may or may not be temporally overlapping. Example implementations of feature extraction can comprise using, or determining cepstral coefficients from one or more than one of Linear Predictive Codes (LPC), Perceptual Linear Prediction (PLP), Mel Frequency Cepstral Coefficient (MFCC), PLP Relative Spectra (PLP-RASTA) and the like. Example implementations use MFCC. Example implementations use 12 dimensional MFCC features and are normalised using cepstral mean normalisation. The MFCC features are produced by using a 26-channel filterbank, and sampled using a 25 ms analysis window and 10 ms frame-rate. Subsequently energy normalisation and cepstral mean normalisation are applied.

[0050] The one or more than one acoustic vector is output from 704 to 706 where it is processed by the speech engine or speech decoder. The speech engine 104 has access to the target model 508 and off target model 510. The speech decoder or speech engine 104 produces the output 114 that is passed to the correlator 116. Therefore, at 708, a determination is made regarding whether or not the output 114 is indicative of a closer match with the target model 508 or the off target model 510. If it is determined at 708 that a closer match has been established between the utterance and the off-target model 510, a respective feedback indication is give at 710. Example implementations can output the feedback in the form of at least one of a graphical output or a score. In the example implementation depicted, a feedback score of zero is output. If it is determined at 708 that a closer match has been established between the utterance and the target model 508, a respective utterance score is determined at 712 using figure 6, then a respective feedback is output at 714. Again, example implementations can be realised that provide at least one of a graphical output or a score. In the example depicted, the output is a score. It can be appreciated that when the score falls within the best scoring range, a score of 100% is returned. However, alternative implementations can be realised in which the score or feedback is output differently. For example, the score can be a range such as, one to five, where five represents utterances falling within the best scoring window and four to one represent scores that are progressively distant from the best scoring window, that is, they are scores that are associated with productions that have progressively poorer matches with the template for that utterance by the speaker.

[0051] It can, therefore, be appreciated that a score greater than zero means that the production of the speaker was recognised as the target sound, that is, it has a match with the template for that sound as opposed to having a match with the off-target or undesirable template. The closer the score is to 100 means that the production has been recognised as being closer to the template. A score of 100 means that the production has been recognised as being very close the template. Although the example implementation has been described with reference to a score or percentage of 100, as indicated above, alternatively, a range of scores can be used such as, for example, the range 0 to 5 and 15 to 20.

[0052] After one of 710 and 714, control returns to 702 where the exercise can be repeated. Alternatively, the exercise of recording a production, processing that production, and outputting an indication associated with that production can be terminated.

[0053] Referring to figure 8, there is shown a block diagram illustrating components, according to some examples, able to read instructions from a machine-readable storage or computer-readable medium (e.g., a machine-readable storage medium) and perform any one or more of the methodologies discussed herein. The machine-readable storage or computer-readable medium can be non-transitory. Specifically, figure 8 shows a diagrammatic representation of hardware resources 800 including one or more processors (or processor cores) 810, one or more memory/storage devices 820, and, optionally, one or more communication resources 830, each of which are communicatively coupled via a bus 840.

[0054] The processors 810 (e.g., a central processing unit (CPU), a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU), a digital signal processor (DSP) such as a baseband processor, an application specific integrated circuit (ASIC), another type of processor, or any suitable combination thereof) may include, for example, a processor 812 and a processor 814, which may be implementations of processors 124 and 126. The memory/storage devices 820 may include main memory, disk storage, or any suitable combination thereof.

[0055] The communication resources 830 may include interconnection and/or network interface components or other suitable devices to communicate with one or more peripheral devices 804 and/or one or more databases 806 via a network 808. For example, the communication resources 830 may include wired communication components (e.g., for coupling via a Universal Serial Bus (USB)), cellular communication components, Near Field Communication (NFC) components, Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components, and other communication components.

[0056] Instructions 850 may comprise machine executable instructions such as, for example, software, a program, an application, an applet, an app, or other executable code for causing at least any of the processors 810 to perform any one or more of the methodologies discussed herein. The instructions 850 may reside, completely or partially, within at least one of the processors 810 (e.g., within the processor’s cache memory), the memory/storage devices 820, or any suitable combination thereof. Furthermore, any portion of the instructions 850 may be transferred to the hardware resources 800 from any combination of the peripheral devices 804 and/or the databases 806. Accordingly, the memory of processors 810, the memory/storage devices 820, the peripheral devices 804, and the databases 806 are examples of computer-readable and machine-readable media.

[0057] As used herein, the term "circuitry" may refer to, be part of, or include, an Application Specific Integrated Circuit (ASIC), an integrated circuit, an electronic circuit, one or more than one processor (shared, dedicated, or group), and/or memory (shared, dedicated, or group), that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality. In some examples, the circuitry may be implemented in, or functions associated with the circuitry may be implemented by, one or more software or firmware modules. In some examples, circuitry may include logic, at least partially operable in hardware. Similarly, executable instructions may comprise instructions executable by a processor or instructions implemented in at least one of hardware or software such as, for example, instructions implemented using an ASIC or other logic.

[0058] Discussions herein utilising terms such as, for example, “processing”, “computing”, “calculating”, “determining”, “establishing”, “analyzing”, “checking”, or the like, may refer to operation(s) and/or process(es) of a processor, circuitry, logic, a computer, a computing platform, a computing system, or other electronic computing device, that manipulate and/or transform data represented as physical (e.g., electronic) quantities within the computer’s registers and/or memories into other data similarly represented as physical quantities within the computer’s registers and/or memories or other information storage medium that may store instructions to perform operations and/or processes.

[0059] Throughout the description and claims of this specification, the words “comprise” and “contain” and variations of them mean “including but not limited to”, and they are not intended to (and do not) exclude other moieties, additives, components, integers or steps. Throughout the description and claims of this specification, the singular encompasses the plural unless the context otherwise requires. In particular, where the indefinite article is used, the specification is to be understood as contemplating the plural as well as the singular, unless the context requires otherwise.

[0060] Features, integers, characteristics, compounds, chemical moieties or groups described in conjunction with a particular aspect, embodiment or example ofthe invention are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all ofthe steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed [0061] Although the above example implementations have been described with reference to assistive technology as particularly pertaining to a dysarthric speaker or other speech impaired speaker, example implementations are not limited such arrangements. Example implementations can be used in other contexts with other speakers. For instance, example implementations can be used to realised foreign language training systems in which a target or desired utterance is grouped with a corresponding or respective undesirable utterance.

Claims

1. A speech recognition system, comprising a. a speech interface for receiving a production, b. a speech engine for processing the received production, and c. a template interface for accessing i. a speech template representing a target production and ii. a speech template representing an off target production; d. the speech engine being arranged to characterise the received production; e. the speech recognition system further comprising a correlator to determine, from the characterised production, a degree of correlation between the characterised production and the target and off target productions.

2. The speech recognition system of claim 1, in which the target production represents a target articulation derived from at least one speech articulation, by a respective speaker, of a given word.

3. The speech recognition system of claim 2, in which the target production represents a target articulation derived from a plurality of speech articulations, by a respective speaker, of the given word.

4. The speech recognition system of any preceding claim, in which the off target production represents a further articulation.

5. The speech recognition system of claim 4, in which the further articulation has a degree of similarity with the at least one speech articulation.

6. The speech recognition system of any preceding claim, further comprising an output interface for outputting an indication of at least one of: a. a degree of correlation between the received production and the template representing the target production, or b. a degree of correlation between the received production and the template representing the off-target production.

7. The speech recognition system of claim 6, in which the indication is at least one of a graphical indication or a numerical indication.

8. The speech recognition system of claim 7, in which the indication comprises a score.

9. The speech recognition system of claim 8, in which the score is calculated relative to a predetermined scoring window.

10. The speech recognition system of claim 9, in which the predetermined scoring window comprises a predetermined width.

11. The speech recognition system of claim 10, in which the predetermined width is associated with a degree of correlation between the received production and the target production required for determining a match between the received production and the target production.

12. The speech recognition system of either of claims 10 and 11, in which a narrower predetermined width corresponds to a higher degree of correlation between the received production and the target production required for determining a match between the received production and the target production.

13. The speech recognition system of any of claims 10 to 12, in which a wider predetermined width corresponds to a lower degree of correlation between the received production and the target production required for determining a match between the received production and the target production.

14. The speech recognition system of any of claim 9 to 13, in which the predetermined scoring window is flanked by a function for varying an indication of a degree of correlation between the received production and the target production with varying distance from a range associated with the predetermined scoring window.

15. A speech recognition method comprising a. receiving a production representing an utterance, b. accessing i. a speech template representing a target production and ii. a speech template representing an off target production; c. characterising the received production; d. determining, from the characterised production, a degree of correlation between the characterised production and the target and off target productions.

16. The speech recognition method of claim 15, in which the target production represents a target articulation derived from at least one speech articulation, by a respective speaker, of a given word.

17. The speech recognition method of claim 16, in which the target production represents a target articulation derived from a plurality of speech articulations, by a respective speaker, of the given word.

18. The speech recognition method of any of claims 15 to 17, in which the off target production represents a further articulation.

19. The speech recognition method of claim 18, in which the further articulation has a degree of similarity with the at least one speech articulation.

20. The speech recognition method of any of claims 15 to 17, further comprising outputting an indication of at least one of: a. a degree of correlation between the received production and the template representing the target production, or b. a degree of correlation between the received production and the template representing the off-target production.

21. The speech recognition method of claim 20, in which the indication is at least one of a graphical indication or a numerical indication.

22. The speech recognition method of claim 21, in which the indication comprises a score.

23. The speech recognition method of claim 22, comprising calculating the score relative to a predetermined scoring window.

24. The speech recognition method of claim 23, in which the predetermined scoring window comprises a predetermined width.

25. The speech recognition method of claim 24, in which the predetermined width is associated with a degree of correlation between the received production and the target production required for determining a match between the received production and the target production.

26. The speech recognition method of either of claims 24 and 25, in which a narrower predetermined width corresponds to a higher degree of correlation between the received production and the target production required for determining a match between the received production and the target production.

27. The speech recognition method of any of claims 24 to 26, in which a wider predetermined width corresponds to a lower degree of correlation between the received production and the target production required for determining a match between the received production and the target production.

28. The speech recognition method of any of claim 8 to 27, in which the predetermined scoring window is flanked by a function for varying an indication of a degree of correlation between the received production and the target production with varying distance from a range associated with the predetermined scoring window.

29. Machine executable instructions arranged, when executed, to realise a system of any of claims 1 to 14 or a method of any of claims 15 to 28.

30. Machine readable storage storing machine executable instructions of claim 29.

31. A speech processing system comprising circuitry arranged to implement a method of any of claims 15 to 28.