WO2022122121A1 - End-to-end streaming acoustic trigger apparatus and method - Google Patents

End-to-end streaming acoustic trigger apparatus and method Download PDF

Info

Publication number
WO2022122121A1
WO2022122121A1 PCT/EP2020/085015 EP2020085015W WO2022122121A1 WO 2022122121 A1 WO2022122121 A1 WO 2022122121A1 EP 2020085015 W EP2020085015 W EP 2020085015W WO 2022122121 A1 WO2022122121 A1 WO 2022122121A1
Authority
WO
WIPO (PCT)
Prior art keywords
acoustic
layer
attention
self
convolutional
Prior art date
Application number
PCT/EP2020/085015
Other languages
French (fr)
Inventor
Zixing ZHANG
Thorin FARNSWORTH
Senling Lin
Salah KAROUT
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2020/085015 priority Critical patent/WO2022122121A1/en
Priority to EP20821179.7A priority patent/EP4238088A1/en
Publication of WO2022122121A1 publication Critical patent/WO2022122121A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems

Definitions

  • This invention relates to acoustic detection systems, such as voice trigger systems.
  • voice trigger applications which may also be referred to as keyword spotting or wakeup word detection
  • keyword spotting or wakeup word detection it is desirable to improve model performance not only in terms of accuracy and robustness, but also in terms of hardware efficiency due to the ‘always-on’ characteristics of such systems.
  • reducing the storage requirements and computational costs of voice trigger models to fit the memory and energy constraint is of significant importance.
  • Previous approaches for voice trigger can generally be grouped into filler-based and end-to- end approaches.
  • the former approaches regard all background noise and non-wakeup speech as fillers, and model both the wakeup words and the fillers, whereas the latter approaches model the offset of wakeup words and the other words.
  • Typical filler-based approaches seek help from automatic speech recognition (ASR) systems, where hidden Markov models (HMMs) are used to represent both the wakeup word (a.k.a. the keyword) and the background audio.
  • ASR automatic speech recognition
  • HMMs hidden Markov models
  • their performance highly depends on the prediction accuracy of phoneme predictions.
  • the complexity of ASR systems also increases the deployment difficulty, due to their high memory and power requirements.
  • neural network-only based approaches have previously been proposed. These approaches utilize advanced deep learning models to predict the wakeup words framewisely and straightforwardly by stacking multiple acoustic frames as inputs. Then, a sliding window is applied to average the posteriors. Once the smoothed value surpasses a pre-defined threshold, a wakeup word may be detected.
  • advanced deep learning models to predict the wakeup words framewisely and straightforwardly by stacking multiple acoustic frames as inputs. Then, a sliding window is applied to average the posteriors. Once the smoothed value surpasses a pre-defined threshold, a wakeup word may be detected.
  • voice trigger accuracy can suffer a performance decrease in voice trigger accuracy.
  • end-to-end approaches have gradually become the mainstream technology for voice trigger. Such approaches can straightforwardly estimate the wakeup point of keywords. Compared with the filler-based approaches, the end-to-end structure is simpler and may be more effective, as it directly optimizes the detection score. However, in current end-to-end models, the context information over sequences is not well explored for voice trigger.
  • the Transformer encoder as described in Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I, “Attention is all you need”, Advances in neural information processing systems, 2017;30:5998-6008, as well as its variants such as Bert, as described in Devlin J, Chang MW, Lee K, Toutanova K, “Bert: Pre-training of deep bidirectional transformers for language understanding”, arXiv preprint arXiv: 1810.04805, 2018 Oct 11 , are commonly used in natural language processing (NLP).
  • NLP natural language processing
  • RNNs Recurrent Neural Networks
  • LSTM long short-term memory
  • GRU Gated Recurrent Unit
  • Transformer avoids the recurrent process which is considered to be unfriendly for parallel computation when using a Graphic Processing Unit.
  • NLP Recurrent Neural Networks
  • LSTM long short-term memory
  • GRU Gated Recurrent Unit
  • vanilla Transformer encoder was designed without considering deployment of the model in an edge device. This issue largely impedes its applications, because these devices normally have strong storage and energy consumption limitations. Recently, much effort has been made toward compressing the model size into smaller ones. In the context of voice trigger, nevertheless, these models are still larger than needed.
  • an apparatus for detecting a predetermined acoustic trigger in a stream of acoustic data comprising one or more processors configured to perform a detection method comprising: receiving sequential acoustic features extracted from a stream of acoustic signals; applying a sliding window to the acoustic features to identify a plurality of acoustic segments; performing convolutional processing on each acoustic segment to form intermediate data; and processing the intermediate data using a plurality of self-attention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
  • the one or more processors may be configured to perform the self-attention processes using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes. This may allow the model size to be kept small whilst achieving improved results and keeping the latency low.
  • the one or more processors may be configured to perform the self-attention processes on intermediate data derived from a time-restricted subset of the stream of acoustic data and to thereby form an estimate of the likelihood of detection of the acoustic characteristics in that time-restricted subset.
  • Time-restricted (truncated) self-attention may therefore be utilized, which is capable of streaming inference and is efficient in terms of computational complexity.
  • the self-attention processes may be transformer processes. This may allow the process to explicitly capture context-dependent representations and may be friendly for parallel computation compared with other methods such as LSTM/GRU-RNN.
  • Each self-attention process may comprise, in order: a layer normalisation layer; a multi-head attention layer configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers.
  • the multi-head attention layer may be configured to operate on an output of the layer normalisation layer using a key, query and value weight matrix to extract a semantic representation therefrom.
  • the key, query and value weight matrix may be shared across multiple time-restricted selfattention blocks. This may allow the model size to be kept small whilst achieving improved results and keeping the latency low.
  • the apparatus may be configured to compress the matrix by low-rank decomposition. This may further reduce the model size and improve efficiency.
  • the or each convolutional layer may be configured to perform group-separable convolution.
  • the or each convolutional layer may be configured to perform one-dimensional convolution.
  • the convolutional processing may comprise processing the intermediate data using a dilated causal convolutional layer. This may allow for encoding of the position of acoustic inputs and extraction of initial representations.
  • the acoustic trigger may be a voice trigger. This may allow the apparatus to be used in an electronic device incorporating a speech assistant function.
  • the apparatus may be configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. This may allow the apparatus to be implemented in electronic devices having a voice assistant feature, such as smartphones.
  • the sliding window may be of a constant size. This may allow the process to be computationally efficient.
  • a computer-implemented method for detecting a predetermined acoustic trigger in a stream of acoustic data comprising: receiving a stream of acoustic data comprising a plurality of time-separated acoustic features; performing convolutional processing on each acoustic segment to form intermediate data; and processing the intermediate data using a plurality of self-attention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
  • Figure 1 illustrates a flowchart describing the operation of the system described herein.
  • Figure 2 schematically illustrates an example of the voice trigger behaviour in the training and reference stages.
  • Figure 3 shows a block diagram of the dilated residual causal convolutional components.
  • Figure 4 shows a block diagram of the time-restricted self-attention components with attention weights sharing strategy.
  • Figure 5 shows a block diagram of the time-restricted self-attention block with three different model compression approaches.
  • Figure 6 schematically illustrates the low-rank decomposition.
  • Figure 7 shows a block diagram illustrating a group separable convolution layer.
  • Figure 8 schematically illustrates an example of a computer-implemented method for detecting a predetermined acoustic trigger in a stream of acoustic data.
  • Figure 9 schematically illustrates an example of an apparatus for detecting a predetermined acoustic trigger in a stream of acoustic data.
  • Described herein is an end-to-end voice trigger system based on a Transformer encoder, referred to as “WakeupNet”.
  • This system performs end-to-end training and inference processes for wakeup word detection.
  • the system may achieve an accurate prediction using only a small footprint.
  • the system can exploit the context-capturing capability of Transformer, as sequential information is important for wakeup-word detection.
  • the conventional Transformer encoder is too large to fit this task.
  • different model compression approaches are introduced to redesign the traditional vanilla Transformer encoder into a smaller but efficient one, referred to herein as mobile-Transformer.
  • the system takes an end-to-end voice trigger structure where only the endpoints of wakeup words (i.e. , the quite short region right after each wakeup word) will be annotated as positive labels. All other regions may be annotated as negative labels, optionally including the wakeup word itself.
  • Such an end-to-end framework may make the model optimisation more straightforward and may avoid any other intermediate prediction steps.
  • the approach described herein does not depend on the phoneme prediction accuracy of an ASR system and can avoid the need to deploy a complex ASR system into devices having computational-resource constraints.
  • the framework described herein uses a Transformer encoder as its backbone.
  • Transformer is capable of capturing the context-dependent representations by using a self-attention mechanism.
  • Transformer has no recurrent operations.
  • GPU Graphic Processing Unit
  • the basic structure of mobile-Transformer comprises three components - dilated residual causal convolutional blocks, time-restricted self-attention blocks, and a linear decoder.
  • the convolutional blocks are used to encode the position of acoustic inputs and extract initial representations.
  • the self-attention blocks are mainly deployed for capturing the context dependence, as discussed above.
  • the linear decoder is used to provide a final prediction. Compared with the conventional Transformer encoder that is mainly used for NLP, the mobile- Transformer described herein is suitable for use in the speech domain.
  • the mobile-Transformer Compared to the conventional Transformer encoder, which is often too large to be used in the task of voice trigger, the mobile-Transformer holds fewer weights but still exhibits efficient performance.
  • a compression approach can be utilized - attention weight sharing. Attention weights can be shared across blocks so as to significantly reduce the memory storage requirement for model saving.
  • the apparatus can advantageously perform self-attention processes using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes.
  • a low-rank decomposition and group separable convolution can also be introduced to replace the conventional attention weights and feedforward layers, to further reduce the model size. A much smaller model can be achieved with competitive performance when compared to the conventional Transformer encoder.
  • the end-to-end system only annotates the wakeup point of the keywords as the training target in the training process.
  • the end-to-end system uses an additional average smoothing function to reduce the reluctuation of prediction logits in the inference stage.
  • the average smoothing function may be the arithmetic mean, or the weighted linear/nonlinear mean.
  • the endpoint of a wakeup word is much shorter than the remaining region. This can lead to a significant data unbalanced issue when only annotating the end region as positive and the rest as negative labels. To overcome this issue, the end-to-end system repeats the positive annotations several frames before or after the endpoint.
  • the end-to-end system uses focal loss to further deal with the data unbalanced issue (described below) and to explore the hard samples in the training process.
  • the system can use a variety of online data augmentation approaches in the training process, including adding additive noise and convolutional noise, changing speech speed, applying specAugmentation and specCutout.
  • FIG. 1 illustrates a flowchart of the WakeupNet voice trigger system.
  • a feature extraction unit converts the audio stream into sequential acoustic features at 102.
  • the extracted acoustic features are segmented consecutively, preferably by a fixed window size with a fixed step size, into smaller segments at 103.
  • each segment is then fed into the mobile-Transformer model for prediction.
  • the obtained sequential predictions can then be smoothed by a smoothing unit at 105.
  • the system finally determines whether the smoothed predictions are above a predefined threshold at 106. If yes, it triggers the system; otherwise, it keeps monitoring wakeup words at 107.
  • FIG. 2 provides more detailed illustration of the WakeupNet voice trigger system by using mobile-Transformer as the system backbone.
  • the mobile-Transformer 202 comprises M stacked dilated residual causal convolutional blocks 205, N stacked time-restricted self-attention blocks 206, and a linear decoder 207.
  • the dilated residual causal convolution blocks 205 are mainly responsible for position encoding and initial acoustic presentation extraction.
  • the attention blocks 206 are to capture the context-dependence information over the sequence.
  • the linear decoder 207 is used to give a final prediction.
  • LMFBs logMel filter banks
  • the LMFBs are extracted frame-wisely with a typical window size of 25ms and a step size of 10ms. Therefore, in this implementation, 100 frames can be extracted within a one second utterance.
  • positive samples i.e., the utterances that contain a wakeup word
  • negative samples i.e., the utterances that do not include any wakeup words
  • positive samples i.e., the utterances that contain a wakeup word
  • negative samples i.e., the utterances that do not include any wakeup words
  • the endpoint of a wakeup word is first determined.
  • an approach can be used that combines ASR and voice activity detection (VAD) methods.
  • the ASR system is preferably well trained on a large-scale English dataset beforehand, and is then used to predict the phoneme information of the keywords along with the timestamp.
  • the VAD approach can then be used to estimate the end timestamp of the keyword utterances.
  • the VAD approach can be utilized because most positive samples used for training are collected in a quiet scenario.
  • the end timestamp is then converted into the frame order. With the help of the ASR and VAD approaches, this avoids annotating the data manually, which can be time and costconsuming.
  • FIG. 2 One example of the annotations for one positive sample is shown in Figure 2 (204).
  • the annotations are composed of ‘0’s, (L+1+R) * Ts, and ‘0’s.
  • the L and R denote the repeated times of the positive annotation T in the left and right sides of the frame order of the endpoint of keywords. This annotation repetition strategy is helpful to relieve the data unbalanced problem.
  • focal loss (see Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P (2017), “Focal loss for dense object detection”, Proceedings of the IEEE international conference on computer vision, pp. 2980-2988) is selected as the objective function to calculate the distribution distance between predictions and targets, due to its effectiveness when training with unbalanced data over class.
  • the hyperparameter 0 ⁇ a ⁇ 1 controls the importance of positive/negative samples in a linear way, whilst this does not make any difference between easy/hard samples.
  • the sequential acoustic features are extracted from the streaming audio, and then split into consecutive segments by a preferably fixed-length window size and preferably a fixed step size.
  • the window size is determined by the perception field of the mobile-Transformer, and the step size presents how frequently to apply a keyword detection.
  • Each segment is then fed into the mobile-Transformer network for inference, and a prediction yi is obtained.
  • smoothing is applied to reduce the vibration of predictions.
  • the smoothing function can be an average, a linear/nonlinear weighted, or an inversed cosine weighted function.
  • Each dilated residual casual convolutional block 300 contains a dilated causal convolutional layer 301 , a pooling layer 302 and an add and layer normalization layer 303.
  • the dilated convolution aims to increase the perception field of the network, whilst the causal convolution forces the network look back only to its historical information and thus reduces the latency.
  • the outputs of the last dilated residual causal convolutional block are then fed into the attention blocks (206 in Figure 2).
  • the attention blocks contain a series of identical time-restricted self-attention blocks 401 but with shared attention weights, as illustrated at 402.
  • the self-attention block does not use recurrent operations, which are considered to be unfriendly for parallel computation when using a GPU. Thus, this largely facilitates the model training and inference process.
  • One self-attention block is illustrated in Figure 5.
  • Each self-attention process comprises, in order: a layer normalisation layer 501 ; a multi-head attention layer 502 configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers 503 configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers 503.
  • the multi-head attention layer 502 is configured to operate on an output of the layer normalisation layer 501 using a key, query and value weight matrix to extract a semantic representation therefrom.
  • the key, query and value weight matrix is shared across multiple time-restricted self-attention blocks.
  • the attention block firstly applies layer normalization at 501 , then projects the input to queries (Q), keys (K), and values (V).
  • Attention a is combination of Q and K, where Q acts as giver and K acts as receiver and is defined as inner product of Q and K divided by square root of its dimensions.
  • Q acts as giver
  • K acts as receiver
  • Each input unit creates its attention toward all units by providing Q to match K.
  • V can be seen as information extractors and will extract a unique value based on attentions of its input representations.
  • the output units are obtained by summarizing the unique values over the sequences, indicated at 503. In doing this, each obtained representation implicitly contains the semantic information over the whole sequence.
  • a multihead attention strategy is applied as well by splitting queries, keys, and values into several parts.
  • a feedforward layer 504 with ReLLI activation functions can follow to increase the non-linear learning capability of the blocks.
  • residual adding is applied to deal with the gradient vanishing problem when increasing the depth of networks, while layer normalising is for reducing the internal covariate shift, as shown at 505.
  • the system also uses group separable convolution layers 506 and 507 to replace the last two feedforward layers in self-attention blocks.
  • time-restricted (truncated) self-attention is preferably utilised because of i) its capability of streaming inference; and ii) its efficiency of computational complexity.
  • the model structure of the present mobile-Transformer is redesigned and compresses the large model size into a smaller one as much as possible, whilst retaining its performance efficiency. It is found that the major weights of the conventional Transformer encoder come from its stacked structure, the attention layers, as well as the feedforward layers. In the following, three different approaches are introduced to compress these three parts, respectively.
  • a crosslayer parameter sharing strategy is employed for mobile-Transformer, as shown in Figure 4.
  • the attention parameters are shared across blocks (402).
  • the all-shared strategy or the feedforward-shared strategy are not used.
  • the motivation of the cross-layer parameter sharing is that the semantic relationship among the sequence is supposed to be similar although in different layers. By doing this, the number of attention weights can be significantly reduced to 1/N of the original size.
  • the V, K, and Q weight matrices contain d*d values, respectively. This also contributes greatly to the entire model size.
  • LRD low- rank decomposition
  • LRD maps high-dimensional vectors into a low-rank sub-space, and then reconstructs them into another high-dimensional vector with minimum information loss.
  • the use of LRD for matrix compression may save storage and reduce computational complexity.
  • a bottleneck layer is inserted between the input and output layers to simulate LRD, such that the number of attention weights d*d becomes 2*d*r, where r is the dimension of the bottleneck layer.
  • group convolution and separable convolution may be used, as illustrated in Figure 7.
  • Group convolution and separable convolution which were first proposed in AlexNet and MobileNet respectively (see Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., and Adam, H. (2017), “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv preprint arXiv: 1704.04861), are considered as two alternative convolutional methods.
  • group convolution the filters are separated into different groups, as illustrated at 701.
  • Each group is responsible for conventional convolutions with a certain depth.
  • the group convolution is efficient in training as i) it shrinks the original convolution; and ii) it allows the model training over multiple GPUs in a parallel fashion.
  • Each filter group is able to learn a unique representation of the data.
  • the separable convolution normally contains two steps: i) a depthwise convolution 702, where a single filter per each input channel is applied, and ii) a pointwise convolution 703, where a simple 1 *1 convolution is then used to create a linear combination of the output of the depthwise layer.
  • a shuffle operation 704 is conducted to avoid channel separation.
  • Figure 8 shows an example of a computer-implemented method 800 for detecting a predetermined acoustic trigger in a stream of acoustic data.
  • the method comprises receiving a stream of acoustic data comprising a plurality of time-separated acoustic features.
  • the method comprises applying a sliding window to the acoustic features to identify a plurality of acoustic segments.
  • the method comprises performing convolutional processing on each acoustic segment to form intermediate data.
  • the method comprises processing the intermediate data using a plurality of selfattention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
  • Figure 9 is a schematic representation of an apparatus 900 configured to perform the methods described herein.
  • the apparatus 900 may be implemented on a device, such as a laptop, tablet, smart phone or TV.
  • the apparatus 900 comprises a processor 901 configured to process the sequential acoustic features in the manner described herein.
  • the processor 901 may be implemented as a computer program running on a programmable device such as a Central Processing Unit (CPU).
  • the apparatus 900 also comprises a memory 902 which is arranged to communicate with the processor 901.
  • Memory 902 may be a non-volatile memory.
  • the processor 901 may also comprise a cache (not shown in Figure 9), which may be used to temporarily store data from memory 902.
  • the system may comprise more than one processor and more than one memory.
  • the memory may store data that is executable by the processor.
  • the processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium.
  • the computer program may store instructions for causing the processor to perform its methods in the manner described herein.
  • the system uses a convolutional network as the position encoder and the initial feature extractor.
  • the system preferably utilises a causal convolutional network to allow it to better deal with streaming audio signals.
  • the system preferably employs a time-restrict Transformer to reduce computational cost and reduce the latency.
  • the Transformer encoder is thus redesigned to make it hardware efficient.
  • the system can take the attention weights shared strategy cross blocks to significantly reduce the model size.
  • the system can take the low-rank decomposition to replace the attention weights.
  • the Transformer encoder can be pre-trained by a large-scaled unlabelled dataset by self-supervised learning.
  • the self-supervised learning may include, but may be not limited to, i) generation approaches, such as input reconstruction after masking/replace a certain percentage of input features or distorting the input features by noise, and ii) contrastive approaches by contrastive predictive coding from raw audio.
  • the acoustic trigger is preferably a voice trigger, as exemplified in the examples described above.
  • the apparatus may advantageously be configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger.
  • voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger.
  • other acoustic triggers may be determined.
  • the apparatus may be configured to cause a device implementing the apparatus, or connected to the apparatus, to perform a function in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. For example, the apparatus may cause the device to ‘wake up’ (for example, enter a mode in which the device is fully operational from a low-power mode) or perform some other function.
  • a device implementing the apparatus or connected to the apparatus, to perform a function in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger.
  • the apparatus may cause the device to ‘wake up’ (for example, enter a mode in which the device is fully operational from a low-power mode) or perform some other function.
  • the device may implement a further detection method that is more computationally intensive. For example, when the device has been ‘woken up’ and is in a higher power mode, the device may use a different acoustic detection method to that described above as part of a voice assistant function.
  • the system employs a Transformer encoder as its backbone in the application of voice trigger.
  • the system uses a convolutional network as the position encoder and the initial feature extractor.
  • the system utilises a causal convolutional network to allow it to better deal with streaming audio signals.
  • the system can employ a time-restrict T ransformer to further reduce computational cost and the latency.
  • the Transformer encoder is thus redesigned to make it hardware efficient. Sharing attention weights may significantly reduce the model size.
  • the system can also take the low-rank decomposition to replace the attention weights, and use group separable convolution layers to replace feedforward layers in the self-attention blocks.
  • the mobile-transformer architecture described herein may significantly reduce the model size and support streaming voice trigger, with low latency and good performance.
  • the mobile-Transformer When compared with other state-of-the-art models having similar model sizes for voice trigger on the HiMia dataset, the mobile-Transformer has, in some implementations, significantly outperformed other models in both clean and noisy conditions and is more robust than other models in noisy scenarios.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Described is an apparatus (900) and method (800) for detecting a predetermined acoustic trigger in a stream of acoustic data. The apparatus comprises one or more processors (901) configured to perform a detection method (800) comprising: receiving (801) sequential acoustic features (201) extracted from a stream of acoustic signals; applying (802) a sliding window to the acoustic features to identify a plurality of acoustic segments; performing (803) convolutional processing (205) on each acoustic segment to form intermediate data; and processing (804) the intermediate data using a plurality of self-attention processes (206) to form a series of signals (203), each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features. This may significantly reduce the model size required and support streaming voice trigger, with low latency and good performance.

Description

END-TO-END STREAMING ACOUSTIC TRIGGER APPARATUS AND METHOD
FIELD OF THE INVENTION
This invention relates to acoustic detection systems, such as voice trigger systems.
BACKGROUND
For voice trigger applications, which may also be referred to as keyword spotting or wakeup word detection, it is desirable to improve model performance not only in terms of accuracy and robustness, but also in terms of hardware efficiency due to the ‘always-on’ characteristics of such systems. Thus, reducing the storage requirements and computational costs of voice trigger models to fit the memory and energy constraint is of significant importance.
Previous approaches for voice trigger can generally be grouped into filler-based and end-to- end approaches. The former approaches regard all background noise and non-wakeup speech as fillers, and model both the wakeup words and the fillers, whereas the latter approaches model the offset of wakeup words and the other words.
Typical filler-based approaches seek help from automatic speech recognition (ASR) systems, where hidden Markov models (HMMs) are used to represent both the wakeup word (a.k.a. the keyword) and the background audio. However, their performance highly depends on the prediction accuracy of phoneme predictions. The complexity of ASR systems also increases the deployment difficulty, due to their high memory and power requirements.
To overcome these issues, neural network-only based approaches have previously been proposed. These approaches utilize advanced deep learning models to predict the wakeup words framewisely and straightforwardly by stacking multiple acoustic frames as inputs. Then, a sliding window is applied to average the posteriors. Once the smoothed value surpasses a pre-defined threshold, a wakeup word may be detected. However, such methods can suffer a performance decrease in voice trigger accuracy.
More recently, end-to-end approaches have gradually become the mainstream technology for voice trigger. Such approaches can straightforwardly estimate the wakeup point of keywords. Compared with the filler-based approaches, the end-to-end structure is simpler and may be more effective, as it directly optimizes the detection score. However, in current end-to-end models, the context information over sequences is not well explored for voice trigger. The Transformer encoder, as described in Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I, “Attention is all you need”, Advances in neural information processing systems, 2017;30:5998-6008, as well as its variants such as Bert, as described in Devlin J, Chang MW, Lee K, Toutanova K, “Bert: Pre-training of deep bidirectional transformers for language understanding”, arXiv preprint arXiv: 1810.04805, 2018 Oct 11 , are commonly used in natural language processing (NLP). The major advantage of Transformer is its efficiency in extracting context-dependent representations. It can explicitly explore the context dependence over a long sequence by a self-attention mechanism. Compared with Recurrent Neural Networks (RNNs), such as long short-term memory (LSTM) or Gated Recurrent Unit (GRU) RNNs, Transformer avoids the recurrent process which is considered to be unfriendly for parallel computation when using a Graphic Processing Unit. Thus, it largely facilitates the model training and inference processes. Encouraged by its great success in NLP, the Transformer encoder has recently attracted increasing attention in other domains, such as computer vision and speech processing.
However, the vanilla Transformer encoder was designed without considering deployment of the model in an edge device. This issue largely impedes its applications, because these devices normally have strong storage and energy consumption limitations. Recently, much effort has been made toward compressing the model size into smaller ones. In the context of voice trigger, nevertheless, these models are still larger than needed.
It is desirable to develop an apparatus and method for voice trigger applications that overcomes these problems.
SUMMARY OF THE INVENTION
According to one aspect there is provided an apparatus for detecting a predetermined acoustic trigger in a stream of acoustic data, the apparatus comprising one or more processors configured to perform a detection method comprising: receiving sequential acoustic features extracted from a stream of acoustic signals; applying a sliding window to the acoustic features to identify a plurality of acoustic segments; performing convolutional processing on each acoustic segment to form intermediate data; and processing the intermediate data using a plurality of self-attention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
This may significantly reduce the model size required and support streaming voice trigger, with low latency and good performance. The one or more processors may be configured to perform the self-attention processes using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes. This may allow the model size to be kept small whilst achieving improved results and keeping the latency low.
The one or more processors may be configured to perform the self-attention processes on intermediate data derived from a time-restricted subset of the stream of acoustic data and to thereby form an estimate of the likelihood of detection of the acoustic characteristics in that time-restricted subset. Time-restricted (truncated) self-attention may therefore be utilized, which is capable of streaming inference and is efficient in terms of computational complexity.
The self-attention processes may be transformer processes. This may allow the process to explicitly capture context-dependent representations and may be friendly for parallel computation compared with other methods such as LSTM/GRU-RNN.
Each self-attention process may comprise, in order: a layer normalisation layer; a multi-head attention layer configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers.
The multi-head attention layer may be configured to operate on an output of the layer normalisation layer using a key, query and value weight matrix to extract a semantic representation therefrom.
The key, query and value weight matrix may be shared across multiple time-restricted selfattention blocks. This may allow the model size to be kept small whilst achieving improved results and keeping the latency low.
The apparatus may be configured to compress the matrix by low-rank decomposition. This may further reduce the model size and improve efficiency.
The or each convolutional layer may be configured to perform group-separable convolution.
This may further reduce the model size and improve efficiency. The or each convolutional layer may be configured to perform one-dimensional convolution. The convolutional processing may comprise processing the intermediate data using a dilated causal convolutional layer. This may allow for encoding of the position of acoustic inputs and extraction of initial representations.
The acoustic trigger may be a voice trigger. This may allow the apparatus to be used in an electronic device incorporating a speech assistant function.
The apparatus may be configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. This may allow the apparatus to be implemented in electronic devices having a voice assistant feature, such as smartphones.
The sliding window may be of a constant size. This may allow the process to be computationally efficient.
According to a second aspect there is provided a computer-implemented method for detecting a predetermined acoustic trigger in a stream of acoustic data, the method comprising: receiving a stream of acoustic data comprising a plurality of time-separated acoustic features; performing convolutional processing on each acoustic segment to form intermediate data; and processing the intermediate data using a plurality of self-attention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
This may significantly reduce the model size required and support streaming voice trigger, with low latency and good performance.
BRIEF DESCRIPTION OF THE FIGURES
The present invention will now be described by way of example with reference to the accompanying drawings.
In the drawings:
Figure 1 illustrates a flowchart describing the operation of the system described herein.
Figure 2 schematically illustrates an example of the voice trigger behaviour in the training and reference stages. Figure 3 shows a block diagram of the dilated residual causal convolutional components.
Figure 4 shows a block diagram of the time-restricted self-attention components with attention weights sharing strategy.
Figure 5 shows a block diagram of the time-restricted self-attention block with three different model compression approaches.
Figure 6 schematically illustrates the low-rank decomposition.
Figure 7 shows a block diagram illustrating a group separable convolution layer.
Figure 8 schematically illustrates an example of a computer-implemented method for detecting a predetermined acoustic trigger in a stream of acoustic data.
Figure 9 schematically illustrates an example of an apparatus for detecting a predetermined acoustic trigger in a stream of acoustic data.
DETAILED DESCRIPTION
Described herein is an end-to-end voice trigger system based on a Transformer encoder, referred to as “WakeupNet”. This system performs end-to-end training and inference processes for wakeup word detection. The system may achieve an accurate prediction using only a small footprint.
The system can exploit the context-capturing capability of Transformer, as sequential information is important for wakeup-word detection. However, the conventional Transformer encoder is too large to fit this task. To address this issue, different model compression approaches are introduced to redesign the traditional vanilla Transformer encoder into a smaller but efficient one, referred to herein as mobile-Transformer.
The system takes an end-to-end voice trigger structure where only the endpoints of wakeup words (i.e. , the quite short region right after each wakeup word) will be annotated as positive labels. All other regions may be annotated as negative labels, optionally including the wakeup word itself. Such an end-to-end framework may make the model optimisation more straightforward and may avoid any other intermediate prediction steps. Thus, compared with the prior filler-based voice trigger approaches, the approach described herein does not depend on the phoneme prediction accuracy of an ASR system and can avoid the need to deploy a complex ASR system into devices having computational-resource constraints.
As mentioned previously, the framework described herein uses a Transformer encoder as its backbone. Transformer is capable of capturing the context-dependent representations by using a self-attention mechanism. Transformer has no recurrent operations. Thus, it supports parallel computation well when using a Graphic Processing Unit (GPU), such that it facilitates speed in both the training and inference stages.
In the preferred implementation, the basic structure of mobile-Transformer comprises three components - dilated residual causal convolutional blocks, time-restricted self-attention blocks, and a linear decoder.
The convolutional blocks are used to encode the position of acoustic inputs and extract initial representations. The self-attention blocks are mainly deployed for capturing the context dependence, as discussed above. The linear decoder is used to provide a final prediction. Compared with the conventional Transformer encoder that is mainly used for NLP, the mobile- Transformer described herein is suitable for use in the speech domain.
Compared to the conventional Transformer encoder, which is often too large to be used in the task of voice trigger, the mobile-Transformer holds fewer weights but still exhibits efficient performance. To assist this, a compression approach can be utilized - attention weight sharing. Attention weights can be shared across blocks so as to significantly reduce the memory storage requirement for model saving. As will be described in more detail below, the apparatus can advantageously perform self-attention processes using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes. A low-rank decomposition and group separable convolution can also be introduced to replace the conventional attention weights and feedforward layers, to further reduce the model size. A much smaller model can be achieved with competitive performance when compared to the conventional Transformer encoder.
In the preferred implementation, the end-to-end system only annotates the wakeup point of the keywords as the training target in the training process. Optionally, the end-to-end system uses an additional average smoothing function to reduce the reluctuation of prediction logits in the inference stage. The average smoothing function may be the arithmetic mean, or the weighted linear/nonlinear mean. In the training dataset, the endpoint of a wakeup word is much shorter than the remaining region. This can lead to a significant data unbalanced issue when only annotating the end region as positive and the rest as negative labels. To overcome this issue, the end-to-end system repeats the positive annotations several frames before or after the endpoint. The end-to-end system uses focal loss to further deal with the data unbalanced issue (described below) and to explore the hard samples in the training process. The system can use a variety of online data augmentation approaches in the training process, including adding additive noise and convolutional noise, changing speech speed, applying specAugmentation and specCutout.
Figure 1 illustrates a flowchart of the WakeupNet voice trigger system. Once the smartphone, or any other electronic device having a microphone, receives an audio stream at 101 , a feature extraction unit converts the audio stream into sequential acoustic features at 102. The extracted acoustic features are segmented consecutively, preferably by a fixed window size with a fixed step size, into smaller segments at 103. At 104, each segment is then fed into the mobile-Transformer model for prediction. The obtained sequential predictions can then be smoothed by a smoothing unit at 105. The system finally determines whether the smoothed predictions are above a predefined threshold at 106. If yes, it triggers the system; otherwise, it keeps monitoring wakeup words at 107.
Figure 2 provides more detailed illustration of the WakeupNet voice trigger system by using mobile-Transformer as the system backbone. Given an input of sequential acoustic features {xt, t=0, ..., T}, indicated at 201 , and corresponding targets (labels/annotations) {y; e [0,1], i=0, ..., I}, indicated at 204, where T and I are the corresponding acoustic frame and label numbers respectively, the system aims to find a nonlinear mapping function f that is represented by the mobile-Transformer 202, which is able to accurately and promptly detect the wakeup words from the sequential acoustic features.
In this example, the mobile-Transformer 202 comprises M stacked dilated residual causal convolutional blocks 205, N stacked time-restricted self-attention blocks 206, and a linear decoder 207. The dilated residual causal convolution blocks 205 are mainly responsible for position encoding and initial acoustic presentation extraction. The attention blocks 206 are to capture the context-dependence information over the sequence. The linear decoder 207 is used to give a final prediction.
One example of the acoustic feature type is logMel filter banks (LMFBs) with a dimension of 40. The LMFBs are extracted frame-wisely with a typical window size of 25ms and a step size of 10ms. Therefore, in this implementation, 100 frames can be extracted within a one second utterance. In the training stage, positive samples (i.e., the utterances that contain a wakeup word) and negative samples (i.e., the utterances that do not include any wakeup words) are required. For negative samples, the negative class ‘0’ is spread throughout the whole utterance. For positive samples, it is of importance to annotate the timestamp of the endpoint of keywords to configure the labels. Only the end duration of keywords is labelled as positive class T, and the remainder duration is all labelled as negative class ‘O’. Such an annotation approach is helpful, the reasons being at least twofold: i) it directly optimises the detection task and avoids any intermediate components compared with the filler-based systems; ii) it ultimately avoids advanced and delayed problems when triggering the voice.
In order to annotate data as positive “1” (i.e., the endpoint of wakeup words) and negative “0”, the endpoint of a wakeup word is first determined. To obtain the endpoint of the keyword in positive samples, an approach can be used that combines ASR and voice activity detection (VAD) methods. The ASR system is preferably well trained on a large-scale English dataset beforehand, and is then used to predict the phoneme information of the keywords along with the timestamp. When the prediction score from the ASR system is low, the VAD approach can then be used to estimate the end timestamp of the keyword utterances. The VAD approach can be utilized because most positive samples used for training are collected in a quiet scenario. The end timestamp is then converted into the frame order. With the help of the ASR and VAD approaches, this avoids annotating the data manually, which can be time and costconsuming.
One example of the annotations for one positive sample is shown in Figure 2 (204). Here, the annotations are composed of ‘0’s, (L+1+R) * Ts, and ‘0’s. The L and R denote the repeated times of the positive annotation T in the left and right sides of the frame order of the endpoint of keywords. This annotation repetition strategy is helpful to relieve the data unbalanced problem.
Preferably, to optimise the network, focal loss (see Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P (2017), “Focal loss for dense object detection”, Proceedings of the IEEE international conference on computer vision, pp. 2980-2988) is selected as the objective function to calculate the distribution distance between predictions and targets, due to its effectiveness when training with unbalanced data over class. In the focal loss, the hyperparameter 0 < a < 1 controls the importance of positive/negative samples in a linear way, whilst this does not make any difference between easy/hard samples. In contrast, another hyperparameter y > 0 in focal loss controls the importance of easy/hard samples in an exponential way, whilst it does not make any difference between positive/negative samples. Higher y forces the model to learn more from difficult (hard) samples. When a and y are set to 0, the focal loss is then equal to conventional cross entropy.
In the inference stage, the sequential acoustic features are extracted from the streaming audio, and then split into consecutive segments by a preferably fixed-length window size and preferably a fixed step size. The window size is determined by the perception field of the mobile-Transformer, and the step size presents how frequently to apply a keyword detection. Each segment is then fed into the mobile-Transformer network for inference, and a prediction yi is obtained. For a series of segments, a series of predictions {yi, i=0,... ,1} are then obtained. After that, smoothing is applied to reduce the vibration of predictions. The smoothing function can be an average, a linear/nonlinear weighted, or an inversed cosine weighted function. Once a smoothed prediction yi is higher than a predefined threshold score, it triggers the system.
For sequence learning, the positional information of elements is of importance. However, the attention mechanism in Transformer is neither recurrent nor convoluted. In NLP, one simple way is to add a position encoding. Different from this explicit way, in the speech processing domain an implicit way has been shown to be efficient by using convolutional operations, which may automatically capture the contextual information when a deep structure is taken.
An example of the detailed configuration of the convolutional blocks is shown in Figure 3. Each dilated residual casual convolutional block 300 contains a dilated causal convolutional layer 301 , a pooling layer 302 and an add and layer normalization layer 303. The dilated convolution aims to increase the perception field of the network, whilst the causal convolution forces the network look back only to its historical information and thus reduces the latency. There is a residual connection between the output of the pooling layer and input of the block, which makes the blocks more efficient when increasing the depth of the convolutional block, because such a residual structure can efficiently deal with the gradient vanishing issue that often occurs when training neural networks with gradient-based learning methods and backpropagation.
The outputs of the last dilated residual causal convolutional block are then fed into the attention blocks (206 in Figure 2). As illustrated in the example of Figure 4, the attention blocks contain a series of identical time-restricted self-attention blocks 401 but with shared attention weights, as illustrated at 402. Compared with other models such as LSTM-RNN or GRU-RNN, the self-attention block does not use recurrent operations, which are considered to be unfriendly for parallel computation when using a GPU. Thus, this largely facilitates the model training and inference process. One self-attention block is illustrated in Figure 5. Each self-attention process comprises, in order: a layer normalisation layer 501 ; a multi-head attention layer 502 configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers 503 configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers 503.
In this example, the multi-head attention layer 502 is configured to operate on an output of the layer normalisation layer 501 using a key, query and value weight matrix to extract a semantic representation therefrom. The key, query and value weight matrix is shared across multiple time-restricted self-attention blocks.
The attention block firstly applies layer normalization at 501 , then projects the input to queries (Q), keys (K), and values (V). Attention a is combination of Q and K, where Q acts as giver and K acts as receiver and is defined as inner product of Q and K divided by square root of its dimensions. Each input unit creates its attention toward all units by providing Q to match K. V can be seen as information extractors and will extract a unique value based on attentions of its input representations. The output units are obtained by summarizing the unique values over the sequences, indicated at 503. In doing this, each obtained representation implicitly contains the semantic information over the whole sequence. To jointly attend the information from different representation subspaces at different positions, a multihead attention strategy is applied as well by splitting queries, keys, and values into several parts. After that, a feedforward layer 504 with ReLLI activation functions can follow to increase the non-linear learning capability of the blocks. For the self-attention layers and feedforward layers, residual adding is applied to deal with the gradient vanishing problem when increasing the depth of networks, while layer normalising is for reducing the internal covariate shift, as shown at 505.
In this example, the system also uses group separable convolution layers 506 and 507 to replace the last two feedforward layers in self-attention blocks.
As mentioned previously, low latency and computation is important for voice trigger, due to its ‘always-on’ nature. Therefore, for the mobile-Transformer encoder blocks, time-restricted (truncated) self-attention is preferably utilised because of i) its capability of streaming inference; and ii) its efficiency of computational complexity. Compared with the original self-attention that depends on the entire input sequence {xt, t = 0, ... , T}, the truncated self-attention only accesses the sequence {xt, t = th, ... , t, ... , tt>} at time t, with h look-ahead and b look-back. Therefore, the self-attention processes can be performed on data derived from a time- restricted subset of the stream of acoustic data and an estimate can be formed of the likelihood of detection of the acoustic characteristics in that time-restricted subset.
In contrast to the conventional Transformer encoder, which is normally too large to be deployed for voice trigger, the model structure of the present mobile-Transformer is redesigned and compresses the large model size into a smaller one as much as possible, whilst retaining its performance efficiency. It is found that the major weights of the conventional Transformer encoder come from its stacked structure, the attention layers, as well as the feedforward layers. In the following, three different approaches are introduced to compress these three parts, respectively.
To deal with the stacked structure that contributes heavily to the whole model size, a crosslayer parameter sharing strategy is employed for mobile-Transformer, as shown in Figure 4. However, preferably, only the attention parameters are shared across blocks (402). Preferably, the all-shared strategy or the feedforward-shared strategy are not used. The motivation of the cross-layer parameter sharing is that the semantic relationship among the sequence is supposed to be similar although in different layers. By doing this, the number of attention weights can be significantly reduced to 1/N of the original size.
As to the attention layers, the V, K, and Q weight matrices contain d*d values, respectively. This also contributes greatly to the entire model size. To compress the attention matrices, low- rank decomposition (LRD) may be used, as illustrated in Figure 6. LRD maps high-dimensional vectors into a low-rank sub-space, and then reconstructs them into another high-dimensional vector with minimum information loss. The use of LRD for matrix compression may save storage and reduce computational complexity. In embodiments of the present invention, a bottleneck layer is inserted between the input and output layers to simulate LRD, such that the number of attention weights d*d becomes 2*d*r, where r is the dimension of the bottleneck layer. Thus, the value of r determines the compression rate 2r/d. That is, when r = d/4, the matrix size is then reduced by half.
As to the two feedforward layers on top of each self-attention block, they are not shared across blocks as mentioned previously. They thus become another major weight contributor. To shrink this component, group convolution and separable convolution may be used, as illustrated in Figure 7. Group convolution and separable convolution, which were first proposed in AlexNet and MobileNet respectively (see Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., and Adam, H. (2017), “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv preprint arXiv: 1704.04861), are considered as two alternative convolutional methods. In group convolution, the filters are separated into different groups, as illustrated at 701. Each group is responsible for conventional convolutions with a certain depth. The group convolution is efficient in training as i) it shrinks the original convolution; and ii) it allows the model training over multiple GPUs in a parallel fashion. Each filter group is able to learn a unique representation of the data. The separable convolution normally contains two steps: i) a depthwise convolution 702, where a single filter per each input channel is applied, and ii) a pointwise convolution 703, where a simple 1 *1 convolution is then used to create a linear combination of the output of the depthwise layer. After applying separable convolution per group, a shuffle operation 704 is conducted to avoid channel separation.
Generally, Figure 8 shows an example of a computer-implemented method 800 for detecting a predetermined acoustic trigger in a stream of acoustic data. At step 801 , the method comprises receiving a stream of acoustic data comprising a plurality of time-separated acoustic features. At step 802, the method comprises applying a sliding window to the acoustic features to identify a plurality of acoustic segments. At step 803, the method comprises performing convolutional processing on each acoustic segment to form intermediate data. At step 804, the method comprises processing the intermediate data using a plurality of selfattention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
Figure 9 is a schematic representation of an apparatus 900 configured to perform the methods described herein. The apparatus 900 may be implemented on a device, such as a laptop, tablet, smart phone or TV.
The apparatus 900 comprises a processor 901 configured to process the sequential acoustic features in the manner described herein. For example, the processor 901 may be implemented as a computer program running on a programmable device such as a Central Processing Unit (CPU). The apparatus 900 also comprises a memory 902 which is arranged to communicate with the processor 901. Memory 902 may be a non-volatile memory. The processor 901 may also comprise a cache (not shown in Figure 9), which may be used to temporarily store data from memory 902. The system may comprise more than one processor and more than one memory. The memory may store data that is executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein.
As described above, the system uses a convolutional network as the position encoder and the initial feature extractor. The system preferably utilises a causal convolutional network to allow it to better deal with streaming audio signals. The system preferably employs a time-restrict Transformer to reduce computational cost and reduce the latency.
The Transformer encoder is thus redesigned to make it hardware efficient. The system can take the attention weights shared strategy cross blocks to significantly reduce the model size. Alternatively, the system can take the low-rank decomposition to replace the attention weights.
Optionally, the Transformer encoder can be pre-trained by a large-scaled unlabelled dataset by self-supervised learning. The self-supervised learning may include, but may be not limited to, i) generation approaches, such as input reconstruction after masking/replace a certain percentage of input features or distorting the input features by noise, and ii) contrastive approaches by contrastive predictive coding from raw audio.
The acoustic trigger is preferably a voice trigger, as exemplified in the examples described above. The apparatus may advantageously be configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. However, other acoustic triggers may be determined.
The apparatus may be configured to cause a device implementing the apparatus, or connected to the apparatus, to perform a function in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. For example, the apparatus may cause the device to ‘wake up’ (for example, enter a mode in which the device is fully operational from a low-power mode) or perform some other function.
The method described above can be integrated with other detection methods. For example, after implementing the detection method described above, the device may implement a further detection method that is more computationally intensive. For example, when the device has been ‘woken up’ and is in a higher power mode, the device may use a different acoustic detection method to that described above as part of a voice assistant function. As described above, the system employs a Transformer encoder as its backbone in the application of voice trigger. The system uses a convolutional network as the position encoder and the initial feature extractor. The system utilises a causal convolutional network to allow it to better deal with streaming audio signals. The system can employ a time-restrict T ransformer to further reduce computational cost and the latency.
The Transformer encoder is thus redesigned to make it hardware efficient. Sharing attention weights may significantly reduce the model size. The system can also take the low-rank decomposition to replace the attention weights, and use group separable convolution layers to replace feedforward layers in the self-attention blocks.
Thus, the mobile-transformer architecture described herein may significantly reduce the model size and support streaming voice trigger, with low latency and good performance.
When compared with other state-of-the-art models having similar model sizes for voice trigger on the HiMia dataset, the mobile-Transformer has, in some implementations, significantly outperformed other models in both clean and noisy conditions and is more robust than other models in noisy scenarios.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1 . An apparatus (900) for detecting a predetermined acoustic trigger in a stream of acoustic data, the apparatus comprising one or more processors (901) configured to perform a detection method (800) comprising: receiving (801) sequential acoustic features (201) extracted from a stream of acoustic signals; applying (802) a sliding window to the acoustic features to identify a plurality of acoustic segments; performing (803) convolutional processing (205) on each acoustic segment to form intermediate data; and processing (804) the intermediate data using a plurality of self-attention processes (206) to form a series of signals (203), each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
2. An apparatus as claimed in claim 1 , wherein the one or more processors (901) are configured to perform the self-attention processes (206, 401) using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes.
3. An apparatus as claimed in claim 1 or 2, wherein the one or more processors (901) are configured to perform the self-attention processes (401) on intermediate data derived from a time-restricted subset of the stream of acoustic data and to thereby form an estimate of the likelihood of detection of the acoustic characteristics in that time-restricted subset.
4. An apparatus as claimed in any preceding claim, wherein the self-attention processes are transformer processes.
5. An apparatus as claimed in any preceding claim, wherein each self-attention process comprises, in order: a layer normalisation layer (501); a multi-head attention layer (502) configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers (503) configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers.
6. An apparatus as claimed in claim 5, wherein the multi-head attention layer (502) is configured to operate on an output of the layer normalisation layer (501) using a key, query and value weight matrix to extract a semantic representation therefrom.
7. An apparatus as claimed in claim 6, wherein the key, query and value weight matrix is shared across multiple time-restricted self-attention blocks.
8. An apparatus as claimed in claim 6 or 7, the apparatus being configured to compress the matrix by low-rank decomposition.
9. An apparatus as claimed in claim any of claims 6 to 8, wherein the or each convolutional layer (506, 507) is/are configured to perform group-separable convolution.
10. An apparatus as claimed in any of claims 6 to 9, wherein the or each convolutional layer is/are configured to perform one-dimensional convolution.
11. An apparatus as claimed in any preceding claim, wherein the convolutional processing (205) comprises processing the intermediate data using a dilated causal convolutional layer (301).
12. An apparatus as claimed in any preceding claim, wherein the acoustic trigger is a voice trigger.
13. An apparatus as claimed in any preceding claim, the apparatus (900) being configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger.
14. An apparatus as claimed in any preceding claim, wherein the sliding window is of a constant size.
15. A computer-implemented method (800) for detecting a predetermined acoustic trigger in a stream of acoustic data, the method comprising: receiving (801) a stream of acoustic data (201) comprising a plurality of time-separated acoustic features; applying (802) a sliding window to the acoustic features to identify a plurality of acoustic segments; performing (803) convolutional processing (205) on each acoustic segment to form intermediate data; and processing (804) the intermediate data using a plurality of self-attention processes
(206) to form a series of signals (203), each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.
17
PCT/EP2020/085015 2020-12-08 2020-12-08 End-to-end streaming acoustic trigger apparatus and method WO2022122121A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2020/085015 WO2022122121A1 (en) 2020-12-08 2020-12-08 End-to-end streaming acoustic trigger apparatus and method
EP20821179.7A EP4238088A1 (en) 2020-12-08 2020-12-08 End-to-end streaming acoustic trigger apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/085015 WO2022122121A1 (en) 2020-12-08 2020-12-08 End-to-end streaming acoustic trigger apparatus and method

Publications (1)

Publication Number Publication Date
WO2022122121A1 true WO2022122121A1 (en) 2022-06-16

Family

ID=73790095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/085015 WO2022122121A1 (en) 2020-12-08 2020-12-08 End-to-end streaming acoustic trigger apparatus and method

Country Status (2)

Country Link
EP (1) EP4238088A1 (en)
WO (1) WO2022122121A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220310070A1 (en) * 2021-03-26 2022-09-29 Mitsubishi Electric Research Laboratories, Inc. Artificial Intelligence System for Capturing Context by Dilated Self-Attention
CN116072125A (en) * 2023-04-07 2023-05-05 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment
CN117524228A (en) * 2024-01-08 2024-02-06 腾讯科技(深圳)有限公司 Voice data processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US20190378509A1 (en) * 2019-07-22 2019-12-12 Lg Electronics Inc. Speech processing method using artificial intelligence device
US20200105256A1 (en) * 2018-09-28 2020-04-02 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
WO2020122985A1 (en) * 2018-12-10 2020-06-18 Interactive-Al, Llc Neural modulation codes for multilingual and style dependent speech and language processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200105256A1 (en) * 2018-09-28 2020-04-02 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
WO2020122985A1 (en) * 2018-12-10 2020-06-18 Interactive-Al, Llc Neural modulation codes for multilingual and style dependent speech and language processing
US20190378509A1 (en) * 2019-07-22 2019-12-12 Lg Electronics Inc. Speech processing method using artificial intelligence device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DEVLIN JCHANG MWLEE K, TOUTANOVA K: "Bert: Pre-training of deep bidirectional transformers for language understanding", ARXIV PREPRINT ARXIV: 1810.04805, 11 October 2018 (2018-10-11)
LIN, T. Y.GOYAL, P.GIRSHICK, R.HE, K.DOLLAR, P: "Focal loss for dense object detection", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2017, pages 2980 - 2988
VASWANI ASHAZEER NPARMAR NUSZKOREIT JJONES LGOMEZ ANKAISER TPOLOSUKHIN I: "Attention is all you need", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 30, 2017, pages 5998 - 6008
VASWANI ASHISH ET AL: "Attention Is All You Need", 31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS NIPS 2017, 9 December 2017 (2017-12-09), Long Beach, CA, USA, XP055832424, Retrieved from the Internet <URL:https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf> [retrieved on 20210817] *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220310070A1 (en) * 2021-03-26 2022-09-29 Mitsubishi Electric Research Laboratories, Inc. Artificial Intelligence System for Capturing Context by Dilated Self-Attention
US11557283B2 (en) * 2021-03-26 2023-01-17 Mitsubishi Electric Research Laboratories, Inc. Artificial intelligence system for capturing context by dilated self-attention
CN116072125A (en) * 2023-04-07 2023-05-05 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment
CN116072125B (en) * 2023-04-07 2023-10-17 成都信息工程大学 Method and system for constructing self-supervision speaker recognition model in noise environment
CN117524228A (en) * 2024-01-08 2024-02-06 腾讯科技(深圳)有限公司 Voice data processing method, device, equipment and medium

Also Published As

Publication number Publication date
EP4238088A1 (en) 2023-09-06

Similar Documents

Publication Publication Date Title
CN108010515B (en) Voice endpoint detection and awakening method and device
US10403266B2 (en) Detecting keywords in audio using a spiking neural network
WO2022122121A1 (en) End-to-end streaming acoustic trigger apparatus and method
CN106683661B (en) Role separation method and device based on voice
Ravuri et al. Recurrent neural network and LSTM models for lexical utterance classification.
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
KR102483774B1 (en) End-to-end streaming keyword detection
US10032463B1 (en) Speech processing with learned representation of user interaction history
US9378733B1 (en) Keyword detection without decoding
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
US11450310B2 (en) Spoken language understanding
CN117059103A (en) Acceleration method of voice recognition fine tuning task based on low-rank matrix approximation
CN112071308A (en) Awakening word training method based on speech synthesis data enhancement
Bluche et al. Predicting detection filters for small footprint open-vocabulary keyword spotting
CN114467096A (en) Enhancing attention-based neural networks to selectively focus on past inputs
Lim et al. Weakly labeled semi-supervised sound event detection using CRNN with inception module.
Tripathi et al. Focal loss based residual convolutional neural network for speech emotion recognition
CN115472160A (en) System and method for robust wake word detection
CN111210815B (en) Deep neural network construction method for voice command word recognition, and recognition method and device
Mhiri et al. A low latency ASR-free end to end spoken language understanding system
Pan et al. Speech recognition via Hidden Markov Model and neural network trained by genetic algorithm
Yang et al. Multimodal short video rumor detection system based on contrastive learning
CN114927128A (en) Voice keyword detection method and device, electronic equipment and readable storage medium
Wazir et al. Deep learning-based detection of inappropriate speech content for film censorship
JP7345667B2 (en) Small footprint multichannel keyword spotting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20821179

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020821179

Country of ref document: EP

Effective date: 20230601

NENP Non-entry into the national phase

Ref country code: DE