WO2022122121A1

WO2022122121A1 - End-to-end streaming acoustic trigger apparatus and method

Info

Publication number: WO2022122121A1
Application number: PCT/EP2020/085015
Authority: WO
Inventors: Zixing ZHANG; Thorin FARNSWORTH; Senling Lin; Salah KAROUT
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-16
Also published as: EP4238088A1

Abstract

Described is an apparatus (900) and method (800) for detecting a predetermined acoustic trigger in a stream of acoustic data. The apparatus comprises one or more processors (901) configured to perform a detection method (800) comprising: receiving (801) sequential acoustic features (201) extracted from a stream of acoustic signals; applying (802) a sliding window to the acoustic features to identify a plurality of acoustic segments; performing (803) convolutional processing (205) on each acoustic segment to form intermediate data; and processing (804) the intermediate data using a plurality of self-attention processes (206) to form a series of signals (203), each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features. This may significantly reduce the model size required and support streaming voice trigger, with low latency and good performance.

Description

END-TO-END STREAMING ACOUSTIC TRIGGER APPARATUS AND METHOD

FIELD OF THE INVENTION

This invention relates to acoustic detection systems, such as voice trigger systems.

BACKGROUND

For voice trigger applications, which may also be referred to as keyword spotting or wakeup word detection, it is desirable to improve model performance not only in terms of accuracy and robustness, but also in terms of hardware efficiency due to the ‘always-on’ characteristics of such systems. Thus, reducing the storage requirements and computational costs of voice trigger models to fit the memory and energy constraint is of significant importance.

Previous approaches for voice trigger can generally be grouped into filler-based and end-to- end approaches. The former approaches regard all background noise and non-wakeup speech as fillers, and model both the wakeup words and the fillers, whereas the latter approaches model the offset of wakeup words and the other words.

Typical filler-based approaches seek help from automatic speech recognition (ASR) systems, where hidden Markov models (HMMs) are used to represent both the wakeup word (a.k.a. the keyword) and the background audio. However, their performance highly depends on the prediction accuracy of phoneme predictions. The complexity of ASR systems also increases the deployment difficulty, due to their high memory and power requirements.

To overcome these issues, neural network-only based approaches have previously been proposed. These approaches utilize advanced deep learning models to predict the wakeup words framewisely and straightforwardly by stacking multiple acoustic frames as inputs. Then, a sliding window is applied to average the posteriors. Once the smoothed value surpasses a pre-defined threshold, a wakeup word may be detected. However, such methods can suffer a performance decrease in voice trigger accuracy.

More recently, end-to-end approaches have gradually become the mainstream technology for voice trigger. Such approaches can straightforwardly estimate the wakeup point of keywords. Compared with the filler-based approaches, the end-to-end structure is simpler and may be more effective, as it directly optimizes the detection score. However, in current end-to-end models, the context information over sequences is not well explored for voice trigger. The Transformer encoder, as described in Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I, “Attention is all you need”, Advances in neural information processing systems, 2017;30:5998-6008, as well as its variants such as Bert, as described in Devlin J, Chang MW, Lee K, Toutanova K, “Bert: Pre-training of deep bidirectional transformers for language understanding”, arXiv preprint arXiv: 1810.04805, 2018 Oct 11 , are commonly used in natural language processing (NLP). The major advantage of Transformer is its efficiency in extracting context-dependent representations. It can explicitly explore the context dependence over a long sequence by a self-attention mechanism. Compared with Recurrent Neural Networks (RNNs), such as long short-term memory (LSTM) or Gated Recurrent Unit (GRU) RNNs, Transformer avoids the recurrent process which is considered to be unfriendly for parallel computation when using a Graphic Processing Unit. Thus, it largely facilitates the model training and inference processes. Encouraged by its great success in NLP, the Transformer encoder has recently attracted increasing attention in other domains, such as computer vision and speech processing.

However, the vanilla Transformer encoder was designed without considering deployment of the model in an edge device. This issue largely impedes its applications, because these devices normally have strong storage and energy consumption limitations. Recently, much effort has been made toward compressing the model size into smaller ones. In the context of voice trigger, nevertheless, these models are still larger than needed.

It is desirable to develop an apparatus and method for voice trigger applications that overcomes these problems.

SUMMARY OF THE INVENTION

According to one aspect there is provided an apparatus for detecting a predetermined acoustic trigger in a stream of acoustic data, the apparatus comprising one or more processors configured to perform a detection method comprising: receiving sequential acoustic features extracted from a stream of acoustic signals; applying a sliding window to the acoustic features to identify a plurality of acoustic segments; performing convolutional processing on each acoustic segment to form intermediate data; and processing the intermediate data using a plurality of self-attention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.

This may significantly reduce the model size required and support streaming voice trigger, with low latency and good performance. The one or more processors may be configured to perform the self-attention processes using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes. This may allow the model size to be kept small whilst achieving improved results and keeping the latency low.

The one or more processors may be configured to perform the self-attention processes on intermediate data derived from a time-restricted subset of the stream of acoustic data and to thereby form an estimate of the likelihood of detection of the acoustic characteristics in that time-restricted subset. Time-restricted (truncated) self-attention may therefore be utilized, which is capable of streaming inference and is efficient in terms of computational complexity.

The self-attention processes may be transformer processes. This may allow the process to explicitly capture context-dependent representations and may be friendly for parallel computation compared with other methods such as LSTM/GRU-RNN.

Each self-attention process may comprise, in order: a layer normalisation layer; a multi-head attention layer configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers.

The multi-head attention layer may be configured to operate on an output of the layer normalisation layer using a key, query and value weight matrix to extract a semantic representation therefrom.

The key, query and value weight matrix may be shared across multiple time-restricted selfattention blocks. This may allow the model size to be kept small whilst achieving improved results and keeping the latency low.

The apparatus may be configured to compress the matrix by low-rank decomposition. This may further reduce the model size and improve efficiency.

The or each convolutional layer may be configured to perform group-separable convolution.

This may further reduce the model size and improve efficiency. The or each convolutional layer may be configured to perform one-dimensional convolution. The convolutional processing may comprise processing the intermediate data using a dilated causal convolutional layer. This may allow for encoding of the position of acoustic inputs and extraction of initial representations.

The acoustic trigger may be a voice trigger. This may allow the apparatus to be used in an electronic device incorporating a speech assistant function.

The apparatus may be configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. This may allow the apparatus to be implemented in electronic devices having a voice assistant feature, such as smartphones.

The sliding window may be of a constant size. This may allow the process to be computationally efficient.

According to a second aspect there is provided a computer-implemented method for detecting a predetermined acoustic trigger in a stream of acoustic data, the method comprising: receiving a stream of acoustic data comprising a plurality of time-separated acoustic features; performing convolutional processing on each acoustic segment to form intermediate data; and processing the intermediate data using a plurality of self-attention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.

This may significantly reduce the model size required and support streaming voice trigger, with low latency and good performance.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings.

In the drawings:

Figure 1 illustrates a flowchart describing the operation of the system described herein.

Figure 2 schematically illustrates an example of the voice trigger behaviour in the training and reference stages. Figure 3 shows a block diagram of the dilated residual causal convolutional components.

Figure 4 shows a block diagram of the time-restricted self-attention components with attention weights sharing strategy.

Figure 5 shows a block diagram of the time-restricted self-attention block with three different model compression approaches.

Figure 6 schematically illustrates the low-rank decomposition.

Figure 7 shows a block diagram illustrating a group separable convolution layer.

Figure 8 schematically illustrates an example of a computer-implemented method for detecting a predetermined acoustic trigger in a stream of acoustic data.

Figure 9 schematically illustrates an example of an apparatus for detecting a predetermined acoustic trigger in a stream of acoustic data.

DETAILED DESCRIPTION

Described herein is an end-to-end voice trigger system based on a Transformer encoder, referred to as “WakeupNet”. This system performs end-to-end training and inference processes for wakeup word detection. The system may achieve an accurate prediction using only a small footprint.

The system can exploit the context-capturing capability of Transformer, as sequential information is important for wakeup-word detection. However, the conventional Transformer encoder is too large to fit this task. To address this issue, different model compression approaches are introduced to redesign the traditional vanilla Transformer encoder into a smaller but efficient one, referred to herein as mobile-Transformer.

The system takes an end-to-end voice trigger structure where only the endpoints of wakeup words (i.e. , the quite short region right after each wakeup word) will be annotated as positive labels. All other regions may be annotated as negative labels, optionally including the wakeup word itself. Such an end-to-end framework may make the model optimisation more straightforward and may avoid any other intermediate prediction steps. Thus, compared with the prior filler-based voice trigger approaches, the approach described herein does not depend on the phoneme prediction accuracy of an ASR system and can avoid the need to deploy a complex ASR system into devices having computational-resource constraints.

As mentioned previously, the framework described herein uses a Transformer encoder as its backbone. Transformer is capable of capturing the context-dependent representations by using a self-attention mechanism. Transformer has no recurrent operations. Thus, it supports parallel computation well when using a Graphic Processing Unit (GPU), such that it facilitates speed in both the training and inference stages.

In the preferred implementation, the basic structure of mobile-Transformer comprises three components - dilated residual causal convolutional blocks, time-restricted self-attention blocks, and a linear decoder.

The convolutional blocks are used to encode the position of acoustic inputs and extract initial representations. The self-attention blocks are mainly deployed for capturing the context dependence, as discussed above. The linear decoder is used to provide a final prediction. Compared with the conventional Transformer encoder that is mainly used for NLP, the mobile- Transformer described herein is suitable for use in the speech domain.

Compared to the conventional Transformer encoder, which is often too large to be used in the task of voice trigger, the mobile-Transformer holds fewer weights but still exhibits efficient performance. To assist this, a compression approach can be utilized - attention weight sharing. Attention weights can be shared across blocks so as to significantly reduce the memory storage requirement for model saving. As will be described in more detail below, the apparatus can advantageously perform self-attention processes using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes. A low-rank decomposition and group separable convolution can also be introduced to replace the conventional attention weights and feedforward layers, to further reduce the model size. A much smaller model can be achieved with competitive performance when compared to the conventional Transformer encoder.

In the preferred implementation, the end-to-end system only annotates the wakeup point of the keywords as the training target in the training process. Optionally, the end-to-end system uses an additional average smoothing function to reduce the reluctuation of prediction logits in the inference stage. The average smoothing function may be the arithmetic mean, or the weighted linear/nonlinear mean. In the training dataset, the endpoint of a wakeup word is much shorter than the remaining region. This can lead to a significant data unbalanced issue when only annotating the end region as positive and the rest as negative labels. To overcome this issue, the end-to-end system repeats the positive annotations several frames before or after the endpoint. The end-to-end system uses focal loss to further deal with the data unbalanced issue (described below) and to explore the hard samples in the training process. The system can use a variety of online data augmentation approaches in the training process, including adding additive noise and convolutional noise, changing speech speed, applying specAugmentation and specCutout.

Figure 1 illustrates a flowchart of the WakeupNet voice trigger system. Once the smartphone, or any other electronic device having a microphone, receives an audio stream at 101 , a feature extraction unit converts the audio stream into sequential acoustic features at 102. The extracted acoustic features are segmented consecutively, preferably by a fixed window size with a fixed step size, into smaller segments at 103. At 104, each segment is then fed into the mobile-Transformer model for prediction. The obtained sequential predictions can then be smoothed by a smoothing unit at 105. The system finally determines whether the smoothed predictions are above a predefined threshold at 106. If yes, it triggers the system; otherwise, it keeps monitoring wakeup words at 107.

Figure 2 provides more detailed illustration of the WakeupNet voice trigger system by using mobile-Transformer as the system backbone. Given an input of sequential acoustic features {x_t, t=0, ..., T}, indicated at 201 , and corresponding targets (labels/annotations) {y; e [0,1], i=0, ..., I}, indicated at 204, where T and I are the corresponding acoustic frame and label numbers respectively, the system aims to find a nonlinear mapping function f that is represented by the mobile-Transformer 202, which is able to accurately and promptly detect the wakeup words from the sequential acoustic features.

In this example, the mobile-Transformer 202 comprises M stacked dilated residual causal convolutional blocks 205, N stacked time-restricted self-attention blocks 206, and a linear decoder 207. The dilated residual causal convolution blocks 205 are mainly responsible for position encoding and initial acoustic presentation extraction. The attention blocks 206 are to capture the context-dependence information over the sequence. The linear decoder 207 is used to give a final prediction.

One example of the acoustic feature type is logMel filter banks (LMFBs) with a dimension of 40. The LMFBs are extracted frame-wisely with a typical window size of 25ms and a step size of 10ms. Therefore, in this implementation, 100 frames can be extracted within a one second utterance. In the training stage, positive samples (i.e., the utterances that contain a wakeup word) and negative samples (i.e., the utterances that do not include any wakeup words) are required. For negative samples, the negative class ‘0’ is spread throughout the whole utterance. For positive samples, it is of importance to annotate the timestamp of the endpoint of keywords to configure the labels. Only the end duration of keywords is labelled as positive class T, and the remainder duration is all labelled as negative class ‘O’. Such an annotation approach is helpful, the reasons being at least twofold: i) it directly optimises the detection task and avoids any intermediate components compared with the filler-based systems; ii) it ultimately avoids advanced and delayed problems when triggering the voice.

In order to annotate data as positive “1” (i.e., the endpoint of wakeup words) and negative “0”, the endpoint of a wakeup word is first determined. To obtain the endpoint of the keyword in positive samples, an approach can be used that combines ASR and voice activity detection (VAD) methods. The ASR system is preferably well trained on a large-scale English dataset beforehand, and is then used to predict the phoneme information of the keywords along with the timestamp. When the prediction score from the ASR system is low, the VAD approach can then be used to estimate the end timestamp of the keyword utterances. The VAD approach can be utilized because most positive samples used for training are collected in a quiet scenario. The end timestamp is then converted into the frame order. With the help of the ASR and VAD approaches, this avoids annotating the data manually, which can be time and costconsuming.

One example of the annotations for one positive sample is shown in Figure 2 (204). Here, the annotations are composed of ‘0’s, (L+1+R) * Ts, and ‘0’s. The L and R denote the repeated times of the positive annotation T in the left and right sides of the frame order of the endpoint of keywords. This annotation repetition strategy is helpful to relieve the data unbalanced problem.

Preferably, to optimise the network, focal loss (see Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P (2017), “Focal loss for dense object detection”, Proceedings of the IEEE international conference on computer vision, pp. 2980-2988) is selected as the objective function to calculate the distribution distance between predictions and targets, due to its effectiveness when training with unbalanced data over class. In the focal loss, the hyperparameter 0 < a < 1 controls the importance of positive/negative samples in a linear way, whilst this does not make any difference between easy/hard samples. In contrast, another hyperparameter y > 0 in focal loss controls the importance of easy/hard samples in an exponential way, whilst it does not make any difference between positive/negative samples. Higher y forces the model to learn more from difficult (hard) samples. When a and y are set to 0, the focal loss is then equal to conventional cross entropy.

In the inference stage, the sequential acoustic features are extracted from the streaming audio, and then split into consecutive segments by a preferably fixed-length window size and preferably a fixed step size. The window size is determined by the perception field of the mobile-Transformer, and the step size presents how frequently to apply a keyword detection. Each segment is then fed into the mobile-Transformer network for inference, and a prediction yi is obtained. For a series of segments, a series of predictions {yi, i=0,... ,1} are then obtained. After that, smoothing is applied to reduce the vibration of predictions. The smoothing function can be an average, a linear/nonlinear weighted, or an inversed cosine weighted function. Once a smoothed prediction yi is higher than a predefined threshold score, it triggers the system.

For sequence learning, the positional information of elements is of importance. However, the attention mechanism in Transformer is neither recurrent nor convoluted. In NLP, one simple way is to add a position encoding. Different from this explicit way, in the speech processing domain an implicit way has been shown to be efficient by using convolutional operations, which may automatically capture the contextual information when a deep structure is taken.

An example of the detailed configuration of the convolutional blocks is shown in Figure 3. Each dilated residual casual convolutional block 300 contains a dilated causal convolutional layer 301 , a pooling layer 302 and an add and layer normalization layer 303. The dilated convolution aims to increase the perception field of the network, whilst the causal convolution forces the network look back only to its historical information and thus reduces the latency. There is a residual connection between the output of the pooling layer and input of the block, which makes the blocks more efficient when increasing the depth of the convolutional block, because such a residual structure can efficiently deal with the gradient vanishing issue that often occurs when training neural networks with gradient-based learning methods and backpropagation.

The outputs of the last dilated residual causal convolutional block are then fed into the attention blocks (206 in Figure 2). As illustrated in the example of Figure 4, the attention blocks contain a series of identical time-restricted self-attention blocks 401 but with shared attention weights, as illustrated at 402. Compared with other models such as LSTM-RNN or GRU-RNN, the self-attention block does not use recurrent operations, which are considered to be unfriendly for parallel computation when using a GPU. Thus, this largely facilitates the model training and inference process. One self-attention block is illustrated in Figure 5. Each self-attention process comprises, in order: a layer normalisation layer 501 ; a multi-head attention layer 502 configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers 503 configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers 503.

In this example, the multi-head attention layer 502 is configured to operate on an output of the layer normalisation layer 501 using a key, query and value weight matrix to extract a semantic representation therefrom. The key, query and value weight matrix is shared across multiple time-restricted self-attention blocks.

The attention block firstly applies layer normalization at 501 , then projects the input to queries (Q), keys (K), and values (V). Attention a is combination of Q and K, where Q acts as giver and K acts as receiver and is defined as inner product of Q and K divided by square root of its dimensions. Each input unit creates its attention toward all units by providing Q to match K. V can be seen as information extractors and will extract a unique value based on attentions of its input representations. The output units are obtained by summarizing the unique values over the sequences, indicated at 503. In doing this, each obtained representation implicitly contains the semantic information over the whole sequence. To jointly attend the information from different representation subspaces at different positions, a multihead attention strategy is applied as well by splitting queries, keys, and values into several parts. After that, a feedforward layer 504 with ReLLI activation functions can follow to increase the non-linear learning capability of the blocks. For the self-attention layers and feedforward layers, residual adding is applied to deal with the gradient vanishing problem when increasing the depth of networks, while layer normalising is for reducing the internal covariate shift, as shown at 505.

In this example, the system also uses group separable convolution layers 506 and 507 to replace the last two feedforward layers in self-attention blocks.

As mentioned previously, low latency and computation is important for voice trigger, due to its ‘always-on’ nature. Therefore, for the mobile-Transformer encoder blocks, time-restricted (truncated) self-attention is preferably utilised because of i) its capability of streaming inference; and ii) its efficiency of computational complexity. Compared with the original self-attention that depends on the entire input sequence {x_t, t = 0, ... , T}, the truncated self-attention only accesses the sequence {x_t, t = th, ... , t, ... , tt>} at time t, with h look-ahead and b look-back. Therefore, the self-attention processes can be performed on data derived from a time- restricted subset of the stream of acoustic data and an estimate can be formed of the likelihood of detection of the acoustic characteristics in that time-restricted subset.

In contrast to the conventional Transformer encoder, which is normally too large to be deployed for voice trigger, the model structure of the present mobile-Transformer is redesigned and compresses the large model size into a smaller one as much as possible, whilst retaining its performance efficiency. It is found that the major weights of the conventional Transformer encoder come from its stacked structure, the attention layers, as well as the feedforward layers. In the following, three different approaches are introduced to compress these three parts, respectively.

To deal with the stacked structure that contributes heavily to the whole model size, a crosslayer parameter sharing strategy is employed for mobile-Transformer, as shown in Figure 4. However, preferably, only the attention parameters are shared across blocks (402). Preferably, the all-shared strategy or the feedforward-shared strategy are not used. The motivation of the cross-layer parameter sharing is that the semantic relationship among the sequence is supposed to be similar although in different layers. By doing this, the number of attention weights can be significantly reduced to 1/N of the original size.

As to the attention layers, the V, K, and Q weight matrices contain d*d values, respectively. This also contributes greatly to the entire model size. To compress the attention matrices, low- rank decomposition (LRD) may be used, as illustrated in Figure 6. LRD maps high-dimensional vectors into a low-rank sub-space, and then reconstructs them into another high-dimensional vector with minimum information loss. The use of LRD for matrix compression may save storage and reduce computational complexity. In embodiments of the present invention, a bottleneck layer is inserted between the input and output layers to simulate LRD, such that the number of attention weights d*d becomes 2*d*r, where r is the dimension of the bottleneck layer. Thus, the value of r determines the compression rate 2r/d. That is, when r = d/4, the matrix size is then reduced by half.

As to the two feedforward layers on top of each self-attention block, they are not shared across blocks as mentioned previously. They thus become another major weight contributor. To shrink this component, group convolution and separable convolution may be used, as illustrated in Figure 7. Group convolution and separable convolution, which were first proposed in AlexNet and MobileNet respectively (see Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., and Adam, H. (2017), “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv preprint arXiv: 1704.04861), are considered as two alternative convolutional methods. In group convolution, the filters are separated into different groups, as illustrated at 701. Each group is responsible for conventional convolutions with a certain depth. The group convolution is efficient in training as i) it shrinks the original convolution; and ii) it allows the model training over multiple GPUs in a parallel fashion. Each filter group is able to learn a unique representation of the data. The separable convolution normally contains two steps: i) a depthwise convolution 702, where a single filter per each input channel is applied, and ii) a pointwise convolution 703, where a simple 1 *1 convolution is then used to create a linear combination of the output of the depthwise layer. After applying separable convolution per group, a shuffle operation 704 is conducted to avoid channel separation.

Generally, Figure 8 shows an example of a computer-implemented method 800 for detecting a predetermined acoustic trigger in a stream of acoustic data. At step 801 , the method comprises receiving a stream of acoustic data comprising a plurality of time-separated acoustic features. At step 802, the method comprises applying a sliding window to the acoustic features to identify a plurality of acoustic segments. At step 803, the method comprises performing convolutional processing on each acoustic segment to form intermediate data. At step 804, the method comprises processing the intermediate data using a plurality of selfattention processes to form a series of signals, each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.

Figure 9 is a schematic representation of an apparatus 900 configured to perform the methods described herein. The apparatus 900 may be implemented on a device, such as a laptop, tablet, smart phone or TV.

The apparatus 900 comprises a processor 901 configured to process the sequential acoustic features in the manner described herein. For example, the processor 901 may be implemented as a computer program running on a programmable device such as a Central Processing Unit (CPU). The apparatus 900 also comprises a memory 902 which is arranged to communicate with the processor 901. Memory 902 may be a non-volatile memory. The processor 901 may also comprise a cache (not shown in Figure 9), which may be used to temporarily store data from memory 902. The system may comprise more than one processor and more than one memory. The memory may store data that is executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein.

As described above, the system uses a convolutional network as the position encoder and the initial feature extractor. The system preferably utilises a causal convolutional network to allow it to better deal with streaming audio signals. The system preferably employs a time-restrict Transformer to reduce computational cost and reduce the latency.

The Transformer encoder is thus redesigned to make it hardware efficient. The system can take the attention weights shared strategy cross blocks to significantly reduce the model size. Alternatively, the system can take the low-rank decomposition to replace the attention weights.

Optionally, the Transformer encoder can be pre-trained by a large-scaled unlabelled dataset by self-supervised learning. The self-supervised learning may include, but may be not limited to, i) generation approaches, such as input reconstruction after masking/replace a certain percentage of input features or distorting the input features by noise, and ii) contrastive approaches by contrastive predictive coding from raw audio.

The acoustic trigger is preferably a voice trigger, as exemplified in the examples described above. The apparatus may advantageously be configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. However, other acoustic triggers may be determined.

The apparatus may be configured to cause a device implementing the apparatus, or connected to the apparatus, to perform a function in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger. For example, the apparatus may cause the device to ‘wake up’ (for example, enter a mode in which the device is fully operational from a low-power mode) or perform some other function.

The method described above can be integrated with other detection methods. For example, after implementing the detection method described above, the device may implement a further detection method that is more computationally intensive. For example, when the device has been ‘woken up’ and is in a higher power mode, the device may use a different acoustic detection method to that described above as part of a voice assistant function. As described above, the system employs a Transformer encoder as its backbone in the application of voice trigger. The system uses a convolutional network as the position encoder and the initial feature extractor. The system utilises a causal convolutional network to allow it to better deal with streaming audio signals. The system can employ a time-restrict T ransformer to further reduce computational cost and the latency.

The Transformer encoder is thus redesigned to make it hardware efficient. Sharing attention weights may significantly reduce the model size. The system can also take the low-rank decomposition to replace the attention weights, and use group separable convolution layers to replace feedforward layers in the self-attention blocks.

Thus, the mobile-transformer architecture described herein may significantly reduce the model size and support streaming voice trigger, with low latency and good performance.

When compared with other state-of-the-art models having similar model sizes for voice trigger on the HiMia dataset, the mobile-Transformer has, in some implementations, significantly outperformed other models in both clean and noisy conditions and is more robust than other models in noisy scenarios.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1 . An apparatus (900) for detecting a predetermined acoustic trigger in a stream of acoustic data, the apparatus comprising one or more processors (901) configured to perform a detection method (800) comprising: receiving (801) sequential acoustic features (201) extracted from a stream of acoustic signals; applying (802) a sliding window to the acoustic features to identify a plurality of acoustic segments; performing (803) convolutional processing (205) on each acoustic segment to form intermediate data; and processing (804) the intermediate data using a plurality of self-attention processes (206) to form a series of signals (203), each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.

2. An apparatus as claimed in claim 1 , wherein the one or more processors (901) are configured to perform the self-attention processes (206, 401) using weighted networks with at least some of the weights of each such network being shared across multiple self-attention processes.

3. An apparatus as claimed in claim 1 or 2, wherein the one or more processors (901) are configured to perform the self-attention processes (401) on intermediate data derived from a time-restricted subset of the stream of acoustic data and to thereby form an estimate of the likelihood of detection of the acoustic characteristics in that time-restricted subset.

4. An apparatus as claimed in any preceding claim, wherein the self-attention processes are transformer processes.

5. An apparatus as claimed in any preceding claim, wherein each self-attention process comprises, in order: a layer normalisation layer (501); a multi-head attention layer (502) configured to operate on data extracted from the convolutional processing and the layer normalisation layer; one or more combining layers (503) configured to combine and normalise outputs of the multi-head attention layer and the layer normalisation layer; and multiple convolutional layers configured to form convolution on outputs of the combining layers.

6. An apparatus as claimed in claim 5, wherein the multi-head attention layer (502) is configured to operate on an output of the layer normalisation layer (501) using a key, query and value weight matrix to extract a semantic representation therefrom.

7. An apparatus as claimed in claim 6, wherein the key, query and value weight matrix is shared across multiple time-restricted self-attention blocks.

8. An apparatus as claimed in claim 6 or 7, the apparatus being configured to compress the matrix by low-rank decomposition.

9. An apparatus as claimed in claim any of claims 6 to 8, wherein the or each convolutional layer (506, 507) is/are configured to perform group-separable convolution.

10. An apparatus as claimed in any of claims 6 to 9, wherein the or each convolutional layer is/are configured to perform one-dimensional convolution.

11. An apparatus as claimed in any preceding claim, wherein the convolutional processing (205) comprises processing the intermediate data using a dilated causal convolutional layer (301).

12. An apparatus as claimed in any preceding claim, wherein the acoustic trigger is a voice trigger.

13. An apparatus as claimed in any preceding claim, the apparatus (900) being configured to perform as a voice assistant in response to the signals indicating the presence in the stream of acoustic data of a series of acoustic characteristics that correspond to the acoustic trigger.

14. An apparatus as claimed in any preceding claim, wherein the sliding window is of a constant size.

15. A computer-implemented method (800) for detecting a predetermined acoustic trigger in a stream of acoustic data, the method comprising: receiving (801) a stream of acoustic data (201) comprising a plurality of time-separated acoustic features; applying (802) a sliding window to the acoustic features to identify a plurality of acoustic segments; performing (803) convolutional processing (205) on each acoustic segment to form intermediate data; and processing (804) the intermediate data using a plurality of self-attention processes

(206) to form a series of signals (203), each signal corresponding to an acoustic characteristic and indicating an estimated likelihood of detection of that characteristic in the acoustic features.

17