GB2572222A - A speech recognition method and apparatus - Google Patents

A speech recognition method and apparatus Download PDF

Info

Publication number
GB2572222A
GB2572222A GB1804708.4A GB201804708A GB2572222A GB 2572222 A GB2572222 A GB 2572222A GB 201804708 A GB201804708 A GB 201804708A GB 2572222 A GB2572222 A GB 2572222A
Authority
GB
United Kingdom
Prior art keywords
speech signal
speech
map
signal
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1804708.4A
Other versions
GB2572222B (en
GB201804708D0 (en
Inventor
Thanh Do Cong
Stylianou Ioannis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to GB1804708.4A priority Critical patent/GB2572222B/en
Publication of GB201804708D0 publication Critical patent/GB201804708D0/en
Publication of GB2572222A publication Critical patent/GB2572222A/en
Application granted granted Critical
Publication of GB2572222B publication Critical patent/GB2572222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

An automatic speech recognition (ASR) system for noisy environments processes received speech S101 to produce a conspicuity or human-auditory saliency map 105 of the speech in order to enhance salient words or phrases by weighting them before converting them to graphemes or phonemes by extracting acoustical parameters FBANK according to a Convolutional Neural Network (CNN) acoustic model S113. The salience map may be constructed according to speech intensity or contrast via a log-spectral (307) Gabor filter (S311) and the determined words output to a dialogue system for query response or command recognition. It may first be trained (fig. 2 ‘A’ path) via a Kaldi training set (fig. 6).

Description

A Speech Recognition Method and Apparatus
FIELD
Embodiments described herein relate to methods and apparatus for speech recognition
BACKGROUND
Automatic speech recognition (ASR) is becoming more and more a feature of many devices. Many users these days control their mobile phones by voice commands. In order for a mobile phone to respond to a vocal command, it must first understand the command. Also, there are many help systems provided with dialogue intelligence and for users to interact with these systems, the system must be provided with some form of ASR.
However, the user will not always be using ASR in a quiet environment. For example, a user using their mobile phone outdoors where the ambient noise level is high. Help systems provided in stations et cetera will be subject to a lot of background noise. This background noise can affect the performance of the ASR.
Robust ASR aims at making ASR performance less dependent on the working environment of ASR system. Spectral masking is one of the robust ASR approaches which attempts to enhance speech spectrum prior to acoustic features extraction for ASR. In the spectral masking approach, a mask weight is computed for a timefrequency representation of speech, such as the magnitude spectrum or spectrogram of speech signal. Each time frequency unit is multiplied with a mask weight in order to emphasize the regions that are dominated by speech and reduce the regions that are dominated by other sources, for instance noises. Values of the mask are either binary or continuous. The computation of the values of the mask is often based on the ratio of the target energy to the mix energy or the probability that specific time-frequency unit belongs to the target speech.
BRIEF DESCRIPTION OF THE FIGURES
Embodiments will now be described with reference the following figures:
Figure 1(a) shows a schematic of a speech recognition apparatus in accordance with an embodiment;
Figure 1(b) shows a speech recognition apparatus in accordance with an embodiment in use in external surroundings;
Figure 1(c) shows a speech recognition apparatus in accordance with an embodiment in use in a train station;
Figure 2 is a flowchart showing a method in accordance with an embodiment;
Figure 3 is a flowchart showing a stage in constructing a saliency map;
Figure 4 is a flowchart showing the construction of a conspicuity map;
Figures 5 (a) to (c) are examples of Gabor transforms;
Figure 6 is a flowchart showing a method of training an ASR system in accordance with an embodiment;
Figure 7 is a flowchart showing a multistream ASR system in accordance with an embodiment; and
Figure 8 is a flowchart showing a multistream ASR system in accordance with a further embodiment.
DETAILED DESCRIPTION OF THE FIGURES
In an embodiment, a speech recognition apparatus is provided, said apparatus comprising:
a processor, said processor being adapted to:
receive an audio signal comprising a speech signal;
process said speech signal to produce a map of said speech signal indicating salient parts of said signal;
enhance said received speech signal using said map; determine words present in said speech signal; and provide an output in response to said determined words.
The above apparatus exploits the way in which humans recognise speech to enhance speech recognition in noisy environments. Human auditory attention is one of the mechanisms which help human to recognize better speech in noisy and adverse environments. The working principle of the human auditory attention mechanism is to some measure similar to the spectral masking approach: information which attract interest will receive more focus, or weighting, and those which are not attended to will receive less focus.
Auditory saliency map models the mechanism for allocating auditory attention. This model has been applied in a number of speech processing applications, for instance prominence detection or emotion recognition.
As the speech signal is enhanced using salient features, the speech recognition apparatus is robust to ambient noise. However, unlike other systems that optimise the speech signal for ambient noise, the system does not need to measure the ambient noise. Hence the apparatus requires little extra processing power than a speech recognition system where no signal enhancement is used. The system can operate in real time.
In an embodiment, a speech spectrum enhancement method is provided using the auditory saliency map. The proposed method enhances a time-frequency representation of speech by putting a weight on each time-frequency unit, similar to a spectral masking approach.
The output may take many forms. For example, the output may be a text output of said determined words for a dictation system of the like. Alternatively, the determined words may be provided to a dialogue system and said output is a response from said dialogue system. In a further embodiment, said output is a control signal activated by said determined words. For example, a command to a music player on a mobile phone or other voice activated application.
In an embodiment, said processor is adapted to determine words present in said speech signal by extracting acoustic features from said speech signal and converting said acoustic features to acoustic units using an acoustic model, wherein said acoustic units comprise phonemes and/or graphemes.
The speech signal may be enhanced with said map prior to the extraction of acoustic features from said speech signal. In some embodiments, the enhancement takes the form of weighting the speech signal. Element-wise multiplication may be used to weight the signal.
In an embodiment, the processor is adapted to produce a map of said speech signal indicating salient parts by filtering said speech signal using a filter to approximate the function of the auditory receptive fields of a brain. Further, the processor is adapted to produce a map of said speech signal indicating salient parts by filtering said speech signal, wherein the speech signal is converted to the time/frequency domain prior to filtering.
In an embodiment, said processor is adapted to produce a map of said speech signal indicating said salient parts by taking a logarithm of the time/frequency domain signal and filtering this signal in parallel at a plurality of different scales to produce a plurality of filtered spectra, the filtered spectra being combined to produce the map. The filter may be a Gabor filter. Such a filter can be adapted to approximate the function of the auditory receptive fields in order to extract features relating to intensity, frequency contrast or temporal contrast. .
The processor may be adapted to combine maps determined from filters relating to intensity, frequency contrast and temporal contrast into a single map for enhancing said received speech signal.
In a further embodiment, a multi-stream speech recognition apparatus is provided comprising a plurality of single stream speech recognition apparatus, wherein each single stream speech recognition apparatus is an apparatus as described above where each single stream apparatus comprising a filter adapted for a different property or combination of properties to the filters for the other single streams. The outputs from the acoustic models from each single stream may be combined prior to a common decoder for all streams that determines said words. In a further embodiment, each stream comprises a decoder that receives the output of the respective acoustic model and outputs a lattice, wherein the lattice produced for each stream are combined into a common lattice and said words are determined from said common lattice.
As explained above, the processing of the signal to enhance the signal requires little extra processing beyond a standard ASR system and therefore the above apparatus is well suited to being provided in a mobile telephone.
Also, as the above apparatus provides robust speech recognition, it is useful in a device combined with a dialogue management system said dialogue management system being adapted to provide a response to a query derived from the said speech signal. Such a system can be located in a station or other public place.
In a further embodiment, the apparatus further comprises a control system, said control system being adapted to provide a control signal in response to a query derived from the said speech signal. For example, the control signal can cause a telephone to play a particular song, display a webpage, call a number in a contacts list etc.
In a further aspect, a speech recognition method is provided comprising:
receiving an audio signal comprising a speech signal;
processing said speech signal to produce a map of said speech signal indicating salient parts of said signal;
enhancing said received speech signal using said map; determining words present in said speech signal; and providing an output in response to said determined words.
In a yet further aspect, a method of training a speech recognition apparatus is provided, said method comprising:
receiving training data, said training data comprising a speech signal with corresponding text;
processing said speech signal to produce a map of said speech signal indicating salient parts of said signal;
enhancing said received speech signal using said map;
training an acoustic model that converts an input speech signal to text using said enhanced speech signal and corresponding text.
There is also provided a carrier medium comprising computer readable code configured to cause a computer to perform any of the above methods.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise a storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a non-transitory computer readable storage medium.
Figure 1(a) is a schematic illustration of an ASR system 1. The system 1 comprises a processor 3, and takes input speech signals. The system may output text signals. A computer program 5 is stored in non-volatile memory. The non-volatile memory is accessed by the processor 3 and the stored computer program code is retrieved and executed by the processor 3. The storage 7 stores data that is used by the program 5.
The system 1 further comprises an input module 11. The input module 11 is connected to an input 15 for receiving data relating to a speech signal. The input 15 may be an interface that allows a user to directly input data, for example a microphone. Alternatively, the input 15 may be a receiver for receiving data from an external storage medium or a network.
The system 1 may further comprise an output module 13. Connected to the output module 13 may be an output 17. The output 17 may be an interface that displays data to the user, for example a screen. Alternatively, the output 17 may be a transmitter for transmitting data to an external storage medium or a network.
Alternatively, the ASR system may be simply part of a system, for example, the ASR system may be part of a spoken dialogue system, in which the output of the ASR is used to generate a system action, e.g. a response to the input speech signal. In this case, the ASR does not output to an interface, but provides output information to a further functional part of the system.
The system may be provided such that it receives speech from a location with a high level of ambient noise, for example, on a mobile phone, linked to a chatbot in a public place such as a station etc. For the reasons described later, the ASR method is of particular use in situations where there is background noise.
Figure 1(b) is a schematic of such a situation. The user 53 has a mobile telephone 51. The mobile telephone 51 comprises an ASR system as described with reference to figure 1(a). The user 53 is using the mobile phone outside as can be seen by the presence of Sun 59. The user is trying to provide commands to the telephone 51 in the presence of background noise, for example from cars 57.
The mobile phone 51 is connected to remote server 55 and is able to action the users commands either using its internal processor or by sending requests to server 55.
The mobile phone 51 may be provided with a dictation application that allows the user to record notes. Alternatively, the mobile 4 may be provided with many types of voiceactivated functions, for example functions that allow a user to make a telephone call by just saying the recipient’s name, applications that allow a user to select a song to be played, a video clip to be played or a particular webpage to be displayed.
The mobile phone 51 is able to understand the user’s commands even in the presence of high ambient noise due to the use of signal enhancing that will be described with reference to figure 2 onwards.
Figure 1(c) schematically shows a further embodiment. Here, chatbot 63 is used as part of a help system in a railway station. The chatbot 63 receives a query from a user 61 and can output a response to that query. For example, the chatbot can provide response to enquiries concerning the times of trains, location of facilities in the station etc.
A railway station has considerably high ambient noise. For example, there is the noise of trains 65 and also there are continual announcements from speakers 67. Therefore, the chatbot 63 receives the users query in the presence of very high ambient noise. The chatbot uses the system that will be described with reference to figure 2 onwards in order to enhance the user’s voice input to better recognise the query made by the user 61.
In use, the system 1 receives speech signals through the input 15. The program 5 is executed on processor 3 in the manner which will be described with reference to the following figures. It may output a text signal at the output 17. The system 1 may be configured and trained in the manner which will be described with reference to the following figures.
Figure 2 is a flow chart showing a method in according with an embodiment. In step S101, input speech is subjected to a one dimensional Fourier transform to obtain a two dimensional time frequency representation shown as magnitude spectrum 103.
The flow chart of the figure 2 actually shows two independent processes for ASR in accordance with different embodiments of the invention. First, the method will be described using path “A” of the figure. In path A, an auditory saliency map is then calculated in step SI 05. The purpose of the auditory saliency map is to represent the conspicuity - or saliency - at every location in a time-frequency representation of speech, namely the magnitude spectrum of speech or speech spectrogram, by a scalar quantity to guide the selection of attended locations, based on the spatial distribution of saliency.
How the auditory saliency map will be calculated will now be explained with reference to Figure 3. The auditory saliency map has been determined using natural acoustical scenarios and attempts to reproduce human judgments of auditory saliency and predict the detectability of salient sounds embedded in noisy backgrounds.
In an embodiment, three conspicuity maps 201, 203, 205 computed from intensity (I), frequency contrast (F), and temporal contrast (T) features are linearly combined in step S207 to produce saliency map 209.
Figure 4 schematically shows a method for computing a conspicuity map of the types shown in figure 3.
In step S301, a speech signal x[//] is input. The magnitude spectrum X of x[n] 305, is obtained through Fourier transform of x[n] in step S303.
Next, in step S307, the log magnitude spectrum X is computed as X = log(X). X is then decimated by factors of 2*; k = 0,.....7V-1 to create Ariog magnitude spectra
Xo, Xi,.......X/v-i in parallel steps S3O9o to S309 v./. These log magnitude spectra are then filtered by a 2D Gabor filter G in steps S3110 to S311^./ to create N maps Mo; Mi; ....., Mv./ where Mk = Xk * G, k=0, ......,7V-1.
The 2D Gabor filter, which is product of a 2D sinusoidal plane wave and a 2D Gaussian envelope, is used to approximate the function of the auditory receptive fields.
Figure 5 shows a plot of 3 2D Gabor filters. In this embodiment, three different 2D Gabor filters are used to compute three different conspicuity maps.
2D Gabor filter is the product of a Gaussian envelope and a sinusoid:
G(x,y) = S(x,y)Wr(x,y)
The sinusoid:
S(x,y) = cos(2k(u0x + voy))
The parameters u0 and v0 define the spatial frequency of the sinusoid in Cartesian coordinates. In an embodiment uq = 1 and Vo = 1.
The Gaussian envelope:
Wr(x,y) = exp(-K(a2(x-x0)r2 + b2(y-y0)r 2))
Where (xo,yo) is the peak of the function, a and b are scaling parameters of the Gaussian, and the r script stands for a rotation operation such that:
(x-xo)r = (x-xo)cosO + (y-yo)sinO (y-yo)r = -(x-xo)sine + (y-yo)cos0
In an embodiment for all three filters Gi, GF, GT, 0 = 0.0 = 0 for frequency contrast (GF) and 0 = 90 for temporal contrast filter (GT).
In the above configuration, the Gaussian gets smaller in the space domain if a and b get larger. In an embodiment, possible values are a = 1/V507T and b = 1/V0.077T.
Sidebands in frequency direction and a possible post-inhibition can also be added to the Gabor filters. Among the three filters G|, GF and GT, sidebands are added to GF and a post-inhibition is added to GT. In an embodiment, the difference between the three filters are in the sidebands and post-inhibition.
Figures 5(a), 5(b) and 5(c) show these three filters Gi; GF and Gt which are used to extract features related to sound intensity, frequency contrast and temporal contrast, respectively, from the speech spectrum. The conspicuity maps computed by using GF Gp and GT are denoted as CMi; CMF and CMT, respectively.
The difference between the maps Mt; k = 0,.....7V-1 at different scales are then computed through a center-surround difference computation in step S313. The acrossscale difference between two maps is obtained by interpolating the coarser map to the finer scale and doing point-by-point subtraction; this computation mimics the properties of local cortical inhibition. Given a scale k, k = 0,.....7V-1 the difference is computed between maps at scale k and scales k + 1 and k + 2 to create two feature maps F^+1 and Fw+2 as follows: Fkk+1 = Mk - Mk+1 and Fk k+2 = Mk - Mk+2 where Mk+1 and Mk+2 are the interpolated versions of M^+i and M^+2 respectively, having the same size as Μ*. A threshold is then applied on the resulting feature maps to keep only their positive values. The feature maps 315 are then normalized by a normalization algorithm in step S317 which coarsely replicates cortical lateral inhibition mechanism. This algorithm first normalizes the values in a feature map to a fixed range [0.. <I>]where Φ is the global maximum of a map. This feature map is then globally multiplied with (Φ-φ) where φ is the average of the local maxima of the map. After being normalized, the feature maps are combined to create the conspicuity map in step S319.
Returning to figure 3, a conspicuity map for intensity is produced in step S201. In parallel, a conspicuity map for frequency contrast is produced in step S203 and in parallel a conspicuity map for temporal contrast is produced in step S205. The method described in relation to figure 4 is used to produce each conspicuity map.
Although 3, the conspicuity maps are combined in figure 3, it is possible for the saliency map to just include one, conspicuity map or 2 out of the 3. Further, conspicuity maps could also be derived and combined to produce the saliency map. In some of the following embodiments, a saliency map that is a linear combination of two or more of the conspicuity maps are used. Further conspicuity maps based on audio parameters that allow the brain to extract conspicuous auditory features to eventually improve speech recognition performance can also be used.
Returning now to the flow chart of figure 2, the saliency map is produced as discussed above in relation to figures 3 to 5. The conspicuity maps and saliency maps are applied to enhance the magnitude spectrum prior to acoustic feature extraction for ASR.
In an embodiment, these maps are computed from one speech utterance and then used to weight the magnitude spectrum matrix, computed from the same utterance, by elementwise multiplication in step SI09. In an embodiment, a speech utterance is a segment of speech which begins when the speaker starts speaking and ends when the speaker stops speaking to the automatic speech recognition system. Typical length of a speech utterance is between 0 and 10 second.
Acoustic features for ASR are then extracted from the weighted magnitude spectrum matrix in the FBANK in step Sill.
In this embodiment, weighting is performed using element-wise multiplication between the conspicuity/saliency maps and the linearscale magnitude spectrum because there is a log function within the FBANK features extraction which is applied on the sums of the element-wise multiplication between the magnitude spectrum and Mel-scale filter-bank.
Further, in this embodiment, an exponential function on the conspicuity and saliency maps is applied in step SI06 prior to multiplying them with the linear-scale magnitude spectrum because these maps are computed from the log magnitude spectrum. The application of an exponential function on the conspicuity and saliency maps whose values are in the range [0..1] transforms the values of the maps to those greater or equal to 1. This means the magnitude spectrum is not reduced but enhanced at all timefrequency units. The degree of enhancement at each time-frequency unit depends on the spatial distribution of saliency throughout the maps.
In this embodiment, Mel filter-bank (FBANK) features, which are created by skipping the discrete cosine transform (DCT) in the Mel frequency cepstral coefficients (MFCCs) computation, are used. Convolutional neural network (CNN) acoustic models are used in hybrid HMM/CNN ASR systems in step SI 13.
The posterior probabilities at the output of CNN acoustic models at step SI 13 are then passed into the weighted finite state transducer (WFST) decoding framework SI 15 to create lattices 117 that are used to produce the output text. The CNNs can be trained with FBANK features extracted from non-enhanced magnitude spectrum or from enhanced magnitude spectrum in the scenarios where maps are applied which will be described later.
The embodiments described herein relate to HMM/CNN based ASR systems. However, the above technique can be applied to any acoustic model from GMM-based to neural network-based systems, for example HMM/GMM or HMM/DNN.
In figure 2, two ASR methods are shown. The ASR method shown in path B is similar to that of path A. To avoid unnecessary repetition, like reference numerals will be used to denote like features.
In path B, in step SI04, the magnitude spectrum 103 is weighted using a conspicuity map. This is a conspicuity map derived exactly as described in relation to figures 4 and
5. The conspicuity map can be related to frequency, temporal contrast or intensity.
All features are exactly the same as those described in relation to the saliency map shown in path A.
Figure 6 relates to a method for training an acoustic model to use in the ASR method of figure 2.
As explained above, the ASR method may use an acoustic model which has been trained using conventional methods. However, in one embodiment, the training method of figure 6 is used. Here, training data in the form of speech is provided in step S41. Again as described in relation to the recognition of figure 2, Fourier transform is performed in step S403.
The arrangement of figure 6 is similar to that of figure 2 where path A and path B are shown. Path A relates the use of the saliency map whereas path B relates to the use of conspicuity map.
Magnitude spectrum 405 is obtained from the Fourier transform step S4 03. Looking at Path A, the saliency map 407 is determined in the same way as described in relation to figures 3 to 5. The magnitude spectrum 405 is then weighted using element-wise multiplication derived from the saliency map in step S409. In the same way as described in figure 2, the features are then extracted in step S411 using an FBANK.
The training of the model is then performed in step S413. During the training stage, as the training data includes both speech and the corresponding text, the state alignment is performed using known methods. In the example of figure 6, state-level alignments are performed using the Kai di training set. However, other training tools could be used. The outputted model 415 is then a trained model which has been optimised for the saliency map.
The same manner as described in Path A, in path B, the data is weighted using a single conspicuity map as opposed to a saliency map. All other features are identical.
In a further embodiment, multi-stream speech recognition is used. Here, information from different speech recognition streams is used to improve ASR performance. In the examples that will be discussed below, the multiple streams come from different combinations of conspicuity and saliency maps.
The combination of different ASR streams can exploit particular strength of each technique, for instance features, used in each stream. The combination can happen at different levels of ASR system, from features to posterior probabilities or lattices.
Figure 7 shows a flow diagram of a multi-stream ASR system where the different streams are combined at the posterior probabilities level.
To avoid any unnecessary repetition, like reference numerals will be used to denote like features. Figure 7 shows 4 streams 1ΟΟο, lOCfi, 1002, 1003. These are 3 streams with each of the conspicuity maps and a further stream with a saliency map. The streams are processed in the same manner as described with reference to figure 2.
The output of the parallel acoustic models in step SI 13 are a set of words with that they are posterior probabilities for the given utterance for each acoustic model. In this embodiment, the posterior probabilities are combined at this stage S151.
Posterior probabilities can be combined using a number of methods. In this embodiment, an inverse entropy combination is used which computes the weights for the posterior probabilities produced different CNN acoustic models. In the combination, each stream is attributed a weight 2,, i = 1,.....4 with Σί=ι A = 1 The combined posterior probabilities are then passed to a single decoding frame work in step SI53 to produce lattice 155.
Figure 8 shows a further embodiment for multi-stream combination. Here, the combination of the results from the different streams happens at the lattice level.
To avoid any unnecessary repetition, like reference numerals will be used to denote like features. Figure 8 shows 4 streams 1ΟΟο, 100χ, 1002, I003. These are 3 streams one for each of the conspicuity maps and a further stream with a saliency map. The streams are processed in the same manner as described with reference to figure 2.
The output from each acoustic model in step SI 13 is then passed to an individual decoding framework for each model. From this, lattices 117 produced. However, then in step S131, the lattices are combined to produce a combined lattice 133.
The lattices 117 are generated from four ASR streams which use three conspicuity maps CMi; CMf and CMT and a saliency map. In this paper, lattices are combined based on Bayes risk minimization which is an efficient method for lattices combination. The weights A> i = 1,.....4 with Σί=ι A = 1 are attributed to individual streams during the combination.
To test the above, the ASR systems were trained and evaluated using the Aurora-4 corpus. Aurora-4 is a medium vocabulary task based on the Wall Street Journal (WSJO) corpus. The multi-condition training set consists of 7137 utterances from 83 speakers. The speech utterances in the multi-condition training set are both clean and noisy. The noisy utterances were created by corrupting clean speech utterances by six different noises (airport, babble, car, restaurant, street, and train) at 10-20 dB signal to noise ratio (SNR). The evaluation set was derived from WSJO 5K-word closed-vocabulary test set which consists of 330 utterances spoken by 8 speakers. This test set was recorded by a primary Sennheiser microphone and a secondary microphone. 14 test sets were created by corrupting these two sets by the same six noises used in the training set at 5-15 dB SNR. These 14 test sets can be grouped into 4 subsets: clean, noisy, clean with channel distortion, noisy with channel distortion, which will be referred to as A, B, C, and D, respectively. All the data used for the experiments in this paper are sampled at 16 kHz.
For these tests, the acoustic models were CNNs consisting of 7 hidden layers, with the first two layers being convolutional layers followed by five fully-connected layers. The convolutional layers used 1-dimension filters, applying convolutions and max-pooling on the frequency axis. Acoustic features for training CNNs are 40-dimensional FBANK features together with their delta and acceleration coefficients. The features are spliced with 5 frames on each side of the current frame. Utterance-level mean normalization is performed on static features. Multi condition training data were used. The first convolutional layer has 128 filters of size 8. It is followed by a max-pooling layer with pooling size of 3 and a pooling step of 3, and then, a second convolutional layer. The second convolutional layer has 256 filters of size 4. The output of the second convolution layer is passed to 5 fully-connected hidden layers. Each fully-connected hidden layer has 2048 nodes. The output layer consists of 2298 nodes which are the number of tied context-dependent acoustic states. The state-level alignments for training all the CNN acoustic models are obtained from an HMM-GMM system, trained on multi-condition training data using MFCCs features. The task-standard WSJO bi-gram language model is used. Decoding is performed within the WFST framework. The training and decoding are performed using Kaldi speech recognition toolkit.
Table 1 below shows word error rates (WERs) of single-stream ASR systems.
Method Condition A B C D A eg.
Baseline (he »enhancement) 4,09 7.43 7.94 18.61 12.02
Enhance with CMj 4.04 7,37 7,70 18.88 12.09
Enhance with CMp 4.00 7,60 8.28 19.03 12.29
Enhance with CM j· 3.94 7,22 J 18.51 11.82
Enh. with saliency map 4.11 7.39 7.98 18.62 12.01
In single-stream ASR, the conspicuity map CMT computed from temporal contrast features provides a relative WER reduction of 2.4% compared to baseline system where no spectrum enhancement is applied. This reduction is better than the reduction, if any, provided by other maps.
Table 2: WER» ofmuhi
-stream ASR systems using posterior probabilities combination.
Condition Melb.xl ----- A B C L) Avg.
4.00 7.18 7.90 1 8.56 108
3. Sri 7.21 7.45 18 56 11/76
CM{ t CM} 5,94 7.10 7.44 18.28 I.1A9
CM-} + CM} ~ CM}. 1.S7 7.05 /.64 IS. 18
CMT + CMi + CM}· + Sal. J. 8 / 6.99 7.53 18.06 Η M
Table 3: WERs of mu hi-stream ASR systems usins lattices . :-<y; y:y y y . y y:y:y y:y : ::y : y y-.,:y^,y y<<<-y . y y y:y : yyy<:: ,-γ-y y : ^^-. .y^y : y combination.
_____ Condition Mei hud lllll B C I.) Avg.
CMf + CM}. 7.04 7.62 18.48 :1:1.77
CM : + CM? 4.02 7.05 7 :'7 : y > 1. i 18.28 11,66
CW. + .CM.} //3/11/// 6.99 7.21 /1/4831' 1,1/.58
' CW+WitW· XB9 6.95 7.19 18.16 1 1.5-4
;:CMr + C Mi +: CMf: + Sal > X83 6.8 / ........--7 .-v5 ........ :::::/.: ^4:::: 18.04 M .47
In multi-stream ASR, both the combination of two, three or four streams are performed.
The weights of the streams are varied, given their sum is equal to 1, to find the best weight for each stream. The best combination results and the weight of each stream for getting these results are shown in Tables 2 and 3.
In both posterior probabilities and lattices combination, the 2-stream combination using CMI and CMT provides better results compared to other combination using two conspicuity maps. The combination of three streams provides best gains in both posterior probabilities and lattices combination. The combination of the three streams and the saliency map enhanced features gives the best results.
Relative WER reductions of 3.0% and 4.0% are obtained with 3-stream posterior probabilities and lattices combination, respectively. In both posterior probabilities and lattices combination, the WERs obtained with three streams are close to those obtained with two streams using CMT and CMI. The results in Tables 1, 2, and 3 show that, among the three features used to create the conspicuity maps, temporal contrast features are more relevant and frequency contrast features are less relevant for ASR.
In multi-stream ASR, relative WER reductions of 3.0% and 4.0% were obtained with the above method when posterior probabilities and lattices were combined, respectively, compared to the baseline single-stream ASR system in which no spectrum enhancement was applied. Experimental results showed that, among the three features used to create the conspicuity maps, temporal contrast features were more relevant and frequency contrast features were less relevant for ASR.
While certain arrangements have been described, these arrangements have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made.

Claims (20)

1. A speech recognition apparatus, said apparatus comprising:
a processor, said processor being adapted to:
receive an audio signal comprising a speech signal;
process said speech signal to produce a map of said speech signal indicating salient parts of said signal;
enhance said received speech signal using said map; determine words present in said speech signal; and provide an output in response to said determined words.
2. An apparatus according to claim 1, wherein said processor is adapted to determine words present in said speech signal by extracting acoustic features from said speech signal and converting said acoustic features to acoustic units using an acoustic model, wherein said acoustic units comprise phonemes and/or graphemes.
3. An apparatus according to claim 2, wherein the processor is adapted to enhance said speech signal with said map prior to the extraction of acoustic features from said speech signal.
4. An apparatus according to either of claims 2 or 3, wherein said processor is adapted to produce a map of said speech signal indicating salient parts by filtering said speech signal using a filter to approximate the function of the auditory receptive fields of a brain.
5. An apparatus according to claim 4, wherein said processor is adapted to produce a map of said speech signal indicating salient parts by filtering said speech signal, wherein the speech signal is converted to the time/frequency domain prior to filtering.
6. An apparatus according to claim 5, wherein said processor is adapted to produce a map of said speech signal indicating said salient parts by taking a logarithm of the time/frequency domain signal and filtering this signal in parallel at a plurality of different scales to produce a plurality of filtered spectra, the filtered spectra being combined to produce the map.
7. An apparatus according to any of claims 4 to 6, wherein the filter is a Gabor filter,
8. An apparatus according to any of claim 4 to 6, wherein the filter is adapted to approximate the function of the auditory receptive fields in order to extract features relating to intensity, frequency contrast or temporal contrast..
9. An apparatus according to claim 8, wherein the processor is adapted to combine maps determined from filters relating to intensity, frequency contrast and temporal contrast into a single map for enhancing said received speech signal.
10. An apparatus according to any preceding claim, wherein said output is a text output of said determined words or wherein said determined words are provided to a dialogue system and said output is a response from said dialogue system or wherein said output is a control signal activated by said determined words.
11. A multi-stream speech recognition apparatus comprising a plurality of single stream speech recognition apparatus, wherein each single stream speech recognition apparatus is an apparatus according to either of claims 8 or 9, each single stream apparatus comprising a filter adapted for a different property or combination of properties to the filters for the other single streams.
12. A multi-stream speech recognition apparatus according to claim 11, wherein the outputs from the acoustic models from each single stream are combined prior to a common decoder for all streams that determines said words.
13. A multi-stream speech recognition apparatus according to claim 11, wherein each stream comprises a decoder that receives the output of the respective acoustic model and outputs a lattice, wherein the lattice produced for each stream are combined into a common lattice and said words are determined from said common lattice.
14. A mobile telephone having a speech recognition apparatus, said speech recognition apparatus being an apparatus according to any preceding claim.
15. A speech recognition apparatus according to any preceding claim, further comprising a dialogue management system said dialogue management system being adapted to provide a response to a query derived from the said speech signal.
16. A speech recognition apparatus according to claim 15, located in a station or other public place.
17. A speech recognition apparatus according to any preceding claim, further comprising a control system, said control system being adapted to provide a control signal in response to a query derived from the said speech signal.
18. A speech recognition method comprising:
receiving an audio signal comprising a speech signal;
processing said speech signal to produce a map of said speech signal indicating salient parts of said signal;
enhancing said received speech signal using said map; determining words present in said speech signal; and providing an output in response to said determined words.
19. A method of training a speech recognition apparatus, said method comprising:
receiving training data, said training data comprising a speech signal with corresponding text;
processing said speech signal to produce a map of said speech signal indicating salient parts of said signal;
enhancing said received speech signal using said map;
training an acoustic model that converts an input speech signal to text using said enhanced speech signal and corresponding text.
20. A carrier medium comprising computer readable code configured to cause a computer
5 to perform the method of either of claims 18 or 19.
GB1804708.4A 2018-03-23 2018-03-23 A speech recognition method and apparatus Active GB2572222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1804708.4A GB2572222B (en) 2018-03-23 2018-03-23 A speech recognition method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1804708.4A GB2572222B (en) 2018-03-23 2018-03-23 A speech recognition method and apparatus

Publications (3)

Publication Number Publication Date
GB201804708D0 GB201804708D0 (en) 2018-05-09
GB2572222A true GB2572222A (en) 2019-09-25
GB2572222B GB2572222B (en) 2021-04-28

Family

ID=62068296

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1804708.4A Active GB2572222B (en) 2018-03-23 2018-03-23 A speech recognition method and apparatus

Country Status (1)

Country Link
GB (1) GB2572222B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786016B (en) * 2019-11-11 2022-07-19 北京声智科技有限公司 Voice recognition method, device, medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020178002A1 (en) * 2001-05-24 2002-11-28 International Business Machines Corporation System and method for searching, analyzing and displaying text transcripts of speech after imperfect speech recognition
WO2013009949A1 (en) * 2011-07-13 2013-01-17 Dts Llc Microphone array processing system
WO2014062521A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020178002A1 (en) * 2001-05-24 2002-11-28 International Business Machines Corporation System and method for searching, analyzing and displaying text transcripts of speech after imperfect speech recognition
WO2013009949A1 (en) * 2011-07-13 2013-01-17 Dts Llc Microphone array processing system
WO2014062521A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice

Also Published As

Publication number Publication date
GB2572222B (en) 2021-04-28
GB201804708D0 (en) 2018-05-09

Similar Documents

Publication Publication Date Title
Sehgal et al. A convolutional neural network smartphone app for real-time voice activity detection
JP6480644B1 (en) Adaptive audio enhancement for multi-channel speech recognition
US10373609B2 (en) Voice recognition method and apparatus
Wu et al. An end-to-end deep learning approach to simultaneous speech dereverberation and acoustic modeling for robust speech recognition
Han et al. Learning spectral mapping for speech dereverberation and denoising
CN111402855A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
Wang et al. A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures
Lu et al. Automatic speech recognition
Ganapathy et al. Temporal envelope compensation for robust phoneme recognition using modulation spectrum
Pohjalainen et al. Detection of shouted speech in noise: Human and machine
Saleem et al. A review of supervised learning algorithms for single channel speech enhancement
CN111883135A (en) Voice transcription method and device and electronic equipment
Gupta et al. Speech feature extraction and recognition using genetic algorithm
CN115331656A (en) Non-instruction voice rejection method, vehicle-mounted voice recognition system and automobile
Biagetti et al. Speaker identification in noisy conditions using short sequences of speech frames
CN115602165A (en) Digital staff intelligent system based on financial system
Kaladevi et al. Data Analytics on Eco-Conditional Factors Affecting Speech Recognition Rate of Modern Interaction Systems
Wang et al. TeCANet: Temporal-contextual attention network for environment-aware speech dereverberation
Bashirpour et al. Speech emotion recognition based on power normalized cepstral coefficients in noisy conditions
GB2572222A (en) A speech recognition method and apparatus
JP2021157145A (en) Inference device and learning method of inference device
González-Salazar et al. Enhancing speech recorded from a wearable sensor using a collection of autoencoders
Prisyach et al. Data augmentation for training of noise robust acoustic models
Nayem et al. Attention-based speech enhancement using human quality perception modelling
US20220122596A1 (en) Method and system of automatic context-bound domain-specific speech recognition