CN110619871B

CN110619871B - Voice wakeup detection method, device, equipment and storage medium

Info

Publication number: CN110619871B
Application number: CN201810637168.1A
Authority: CN
Inventors: 陈梦喆; 雷鸣; 高杰; 张仕良; 刘勇; 姚海涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2023-06-30
Anticipated expiration: 2038-06-20
Also published as: CN110619871A

Abstract

The disclosure provides a voice wake-up detection method, a voice wake-up detection device, voice wake-up detection equipment and a storage medium. Inputting an audio data frame in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, and the output of the acoustic model component is a state identification result of at least one frame of audio data in the target frame and the audio data frame in the preset range; taking single-frame audio data which is positioned behind a target frame and is not processed in multi-frame audio data as a next target frame, and iteratively processing a plurality of target frames by using an acoustic model component; and comparing the state identification result of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word to identify whether the multi-frame audio data is a wake-up instruction or not. Therefore, while the occupation of equipment-side resources is reduced, good awakening performance can be ensured, and the requirement of instantaneity required by awakening is met.

Description

Voice wakeup detection method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of voice technology, and in particular relates to a voice wake-up detection method, device, equipment and storage medium.

Background

Voice wake refers to the switching of the device from a sleep state to an active state when the user speaks a specific voice command (i.e., wake word). The wake-up technology is used for completely using voice for the operation of the equipment by the user, and the user is free from the help of both hands; meanwhile, by utilizing a mechanism of waking up, the equipment is not required to be in a working state all the time, so that the energy consumption is greatly saved. At present, the voice wake-up technology is widely applied to various voice control products, such as robots, mobile phones, wearable equipment, smart home, vehicle-mounted products and the like.

Generally, such products need to support operation in both network and non-network environments, and wake-up as a first step in interaction necessarily requires normal operation in the non-network environment, which requires implementation using storage and computing resources on the device side. The computing resources of the device end are usually very limited, and the number of CPU cores, the memory size and the core frequency are far smaller than those of computers commonly used by users, so that the device end cannot be compared with a cloud server. In the off-line case, this limited resource is allocated to wake-up, and also takes on tasks such as signal processing and semantic understanding, so that wake-up is a part of high-frequency use and the occupation of the resource needs to be reduced as much as possible.

Moreover, on the premise of ensuring the occupation of smaller resources, the awakening performance is also important. Since wake words have little context information, the decision whether to wake is completely dependent on the acoustic model. In order to pursue better performance, namely higher recall rate and lower false wake-up rate, a model structure with larger regulation and stronger data expression capability is often adopted in acoustic modeling; meanwhile, the wake-up technology has high requirements on real-time rate and time, which determines the feedback speed of the product after the user sends out wake-up word instructions, and the calculated amount of the acoustic model and the structure directly influence the two indexes. It can be seen that there is a certain contradiction between the two. Therefore, in the voice wake-up technology, on the premise of not obviously increasing the resource occupation, how to ensure good wake-up performance and meet the real-time performance is a main problem facing the prior art.

Disclosure of Invention

An object of the present disclosure is to propose a voice wake-up detection scheme capable of ensuring good wake-up performance without significantly increasing the resource occupation.

According to a first aspect of the present disclosure, there is provided a voice wake-up detection method, including: inputting an audio data frame in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, and the output of the acoustic model component is a state identification result of at least one frame of audio data in the target frame and the audio data frame in the preset range; taking single-frame audio data which is positioned behind the target frame and is not processed in the multi-frame audio data as a next target frame, and iteratively processing a plurality of target frames by using an acoustic model component; and comparing the state identification result of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word to identify whether the multi-frame audio data is a wake-up instruction or not.

Optionally, the frames of audio data within the predetermined range include: frames of audio data within a first predetermined range of the multi-frame audio data that is located before the target frame; and/or frames of audio data within a second predetermined range of the multi-frame audio data after the target frame.

Optionally, the voice wake-up detection method further includes: detecting voice input of a user in real time; and framing the detected speech input to obtain the multi-frame audio data.

Optionally, the step of comparing the state recognition result of the audio data of the plurality of frames in the multi-frame audio data with a preset wake-up word includes: searching a path model matched with the analysis result from a plurality of preset path models to identify whether the multi-frame audio data is a wake-up instruction or not, wherein different path models correspond to different identification results.

Optionally, the path model includes: a wake instruction model; a whitening model; and a mute model.

Optionally, the acoustic model comprises: an input layer; a hidden layer structure; and the plurality of output layers are used for respectively predicting the analysis results of the audio data of different frames in the input.

Optionally, the hidden layer structure includes a plurality of hidden layers, wherein a memory module is disposed between at least two adjacent hidden layers, and the memory module is used for storing history information and future information that are useful for determining the current target frame.

Optionally, the output of the memory module is used as an input of a next hidden layer, and the output of the memory module includes an output of a current hidden layer, an output of a hidden layer of a predetermined lookahead order, and an output of a hidden layer of a predetermined lookahead order.

Alternatively, the process may be carried out in a single-stage,

wherein, the liquid crystal display device comprises a liquid crystal display device,

an input representing the 1+1th hidden layer, which is obtained by a nonlinear transformation of the activation function f, U ^l Representing weights +.>

Representing the output of the memory module,/->

The offset is indicated as being a function of the offset,

output representing the first hidden layer, +.>

Representing input of the first hidden layer, W ^l Representing weights, b ^l Represents the bias, t represents the current time, s ₁ Sum s ₂ Coding stride factors, N, representing historical and future times, respectively ₁ And N ₂ Respectively representing the review order and the look-ahead order,/->

And->

Is the coding coefficient of the memory module. />

The output of the hidden layer, which can be regarded as a predetermined review level, is expressed as s before the current time t ₁ For the result of the bit-wise multiplication of the output of the hidden layer of the coding stride factor at different review orders with the corresponding coding coefficient->

The output of the hidden layer, which can be regarded as a predetermined look-ahead order, represents s after the current time t ₂ The result is obtained by multiplying the output of the hidden layer of the coding stride factor under different foresight orders with the corresponding coding coefficient bit by bit.

According to a second aspect of the present disclosure, there is also provided a voice wake-up detection apparatus, including: the state recognition module is used for inputting an audio data frame in a preset range near a target frame in multi-frame audio data and the target frame to the acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, the output of the acoustic model component is the state recognition result of at least one frame of audio data in the target frame and the audio data frame in the preset range, the state recognition module takes single-frame audio data which is positioned behind the target frame and is not predicted in the multi-frame audio data as a next target frame to be analyzed, and the acoustic model component is used for processing a plurality of target frames; and the wake-up recognition module is used for comparing the state recognition results of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word so as to recognize whether the multi-frame audio data is a wake-up instruction or not.

Optionally, the voice wake-up detection device further includes: the detection module is used for detecting the voice input of the user in real time; and the framing module is used for framing the detected voice input to obtain multi-frame audio data.

Optionally, the wake-up recognition module searches a path model matched with the state recognition results of the audio data of the multiple frames from a plurality of preset path models to recognize whether the audio data of the multiple frames is a wake-up instruction, wherein different path models correspond to different recognition results.

Alternatively, the process may be carried out in a single-stage,

Representing the output of the memory module,/->

The offset is indicated as being a function of the offset,

output representing the first hidden layer, +.>

Representing input of the first hidden layer, W ^l Representing weights, b ^l Represents the bias, t represents the current time, s ₁ Sum s ₂ Coding stride factors, N, representing historical and future times, respectively ₁ And N ₂ Respectively representing the lookahead order and the lookahead order. />

And->

Is the coding coefficient of the memory module. />

Can be regarded as a predetermined look-ahead orderIs represented by s after the current time t ₂ The result is obtained by multiplying the output of the hidden layer of the coding stride factor under different foresight orders with the corresponding coding coefficient bit by bit.

According to a third aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method as set forth in the first aspect of the present disclosure.

According to a fourth aspect of the present disclosure there is also provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as set out in the first aspect of the present disclosure.

According to the method and the device, the multi-frame prediction mode and the FSMN are combined to perform wake-up detection, so that the number of frames to be calculated can be reduced in multiple, the occupation of equipment-side resources can be greatly reduced, good wake-up performance can be ensured while the occupation of smaller resources is ensured, and the requirement of real-time performance required by wake-up is met.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 is an exemplary diagram showing an analysis manner for multi-frame audio data.

Fig. 2 is a schematic flow chart diagram illustrating a voice wake detection method according to an embodiment of the present disclosure.

Fig. 3A, 3B are example diagrams illustrating an analysis manner for multi-frame audio data according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram showing a structure of an acoustic model according to an embodiment of the present disclosure.

Figure 5 is a schematic diagram illustrating an incoming FSMN structure.

Fig. 6 is a schematic diagram illustrating a structure of an acoustic model according to an embodiment of the present disclosure.

Fig. 7 is a structural framework diagram illustrating a voice wake system in accordance with an embodiment of the present disclosure.

Fig. 8 is a schematic block diagram showing a structure of a voice wake-up detection apparatus according to an embodiment of the present disclosure.

FIG. 9 illustrates a schematic diagram of a computing device according to an embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[ scheme overview ]

When the acoustic model is used for voice wake-up detection, the current frame in multi-frame audio data is generally used as the input of the acoustic model to obtain the output of the current frame. In order to improve the accuracy of the output result, for the input frame to be processed currently, audio data with a certain length before and after the input frame can be spliced to serve as input of the acoustic model, so that the input contains associated information of the input frame context. Thus, when the current frame is processed (i.e., predicted) using the acoustic model, audio data of a certain range before and after the current frame including the current frame is input, and only the prediction result of the current frame is output.

When the mode of multi-frame input and single-frame output is adopted for voice wake-up detection, repeated audios with a certain length exist in two adjacent inputs, namely, the features of the two adjacent inputs are overlapped to a certain extent, namely, the features of the two adjacent inputs are similar to each other to a certain extent. Since the acoustic model predicts the current frame, the feature overlapping is a waste of resources in the prediction process, and the more features are overlapped, the more obvious the phenomenon of resource waste is.

As shown in fig. 1, scale 0 to scale 9 represent consecutive multi-frame audio data after slicing. In the present disclosure, the audio data of the segments 0 to 1 may be regarded as 1 st frame audio data, the audio data of the segments 1 to 2 may be regarded as 2 nd frame audio data, and so on. It is assumed that audio data of 3 frames length after the input frame is spliced as input of the acoustic model for the input frame to be predicted currently. When predicting the 1 st frame of audio data, the 1 st to 4 th frames of audio data can be used as input; when predicting the 2 nd frame audio data, the 2 nd to 5 th frame audio data can be used as input; in predicting 3 rd frame audio data, 3 rd to 6 th frame audio data may be used as input.

It can be seen that there is repeated audio data in the 1 st and 2 nd inputs (frame 2-frame 4), there is repeated audio data in the 2 nd and 3 rd inputs (frame 3-frame 5), and there is also repeated audio data in the 1 st and 3 rd inputs (frame 3, frame 4).

The acoustic model processes the 1 st input to obtain a prediction result of the 1 st frame of audio data, then processes the 2 nd input to predict the 2 nd frame of audio data, and the 2 nd to 4 th frames of audio data in the current input are processed data when the 1 st input is processed. And, the acoustic model continues to process the 3 rd input to predict the 3 rd frame of audio data, with the 3 rd to 5 th frames of the current input being data processed by the model when the 2 nd input is processed, and with the 3 rd and 4 th frames of the current input being data processed by the model when the 1 st input is processed. It can be seen that repeated features (or similar features) among such adjacent inputs result in some degree of waste of computing resources.

In view of this, the present disclosure proposes that the output of an acoustic model may be modified by a Multi-Frame Prediction (MFP) method, changing the "one-to-one Prediction mode" to the "one-to-many Prediction mode". Specifically, for the input frame currently to be predicted, since the input is associated information including the input frame and its context, the acoustic model may be adapted to predict the input frame and the other frame or frames included in the input. Therefore, the number of frames to be calculated can be reduced by times, and accordingly occupation of equipment-side resources can be greatly reduced.

Further, as described in the background section, the performance of waking up is also important under the premise of ensuring the occupation of smaller resources. Since wake words have little context information, the decision whether to wake is completely dependent on the acoustic model. In order to pursue better performance, namely higher recall rate and lower false wake-up rate, a model structure with larger regulation and stronger data expression capability is often adopted in acoustic modeling; meanwhile, the wake-up technology has high requirements on real-time rate and time, which determines the feedback speed of the product after the user sends out wake-up word instructions, and the calculated amount of the acoustic model and the structure directly influence the two indexes. It can be seen that there is a certain contradiction between the two. Therefore, in the voice wake-up technology, on the premise of not obviously increasing the resource occupation, how to ensure good wake-up performance and meet the real-time performance is a main problem facing the prior art.

In order to obtain better analysis performance, the most adopted in the current acoustic modeling part is a deep neural network (Deep Neural Network, DNN), and DNN has obvious advantages in calculation amount compared with other neural network structures, but has the defect that long-term information cannot be utilized, so that the performance improvement is limited.

In order to make up for the defect, a recurrent neural network (Long Short-Term Memory Recurrent Neural Network, LSTM-RNN) based on a Long-Short-term memory unit can be adopted, and the model performance can be improved by using the cyclic link of the recurrent network and the storage capacity of the LSTM unit to the history information. However, both the structure of the LSTM unit and the mechanism of the loop require a large amount of computing resources, which is disadvantageous for resource-constrained device-side products (e.g., mobile-side products).

The inventor of the present disclosure notes that the feedforward sequence memory neural network (Feedforward Sequential Memory Networks, FSMN) is a memory module introduced based on DNN, and a large performance improvement can be obtained by increasing a small amount of calculation amount. Taking a four-layer hidden layer and a model of 512 nodes in each layer as an example, under the condition that the input and output numbers are the same, the calculated amount of the FSMN is only increased by 1% compared with DNN for each frame of data, and the calculated amount of the LSTM is 5 times of that of the FSMN; and when FSMN and LSTM models with the same calculated amount are selected, the performance of the FSMN model is far better than that of the LSTM model with the same calculated amount. Thus, in the present disclosure, the acoustic model may employ an FSMN model. Therefore, the resource occupation is reduced, and the wake-up performance can be improved.

Aspects of the disclosure are further described below.

[ Multi-frame prediction ]

The implementation mechanism of the voice wake-up detection method of the present disclosure is described below with reference to fig. 2. Fig. 2 is a schematic flow chart diagram illustrating a voice wake detection method according to an embodiment of the present disclosure.

Referring to fig. 2, in step S210, frames of audio data within a predetermined range around a target frame in multi-frame audio data are input to the acoustic model component together with the target frame.

The target frame may be regarded as a currently pending frame in the multi-frame audio data, and the audio data frames within the predetermined range around the target frame may be audio data frames within a certain time period before and/or after the target frame. For example, it may be an audio data frame in a first predetermined range before the target frame in the multi-frame audio data, or an audio data frame in a second predetermined range after the target frame in the multi-frame audio data. Preferably, the audio data frames within the first predetermined range and the audio data frames within the second predetermined range may be included simultaneously, such that the input may contain the association information of the target frame context at the same time.

In general, the first predetermined range and the second predetermined range are set to be too small, so that the input of the context information of the included target frame is limited, and the accuracy of the state recognition result obtained by processing the target frame by the acoustic model component is reduced; however, setting the first predetermined range and the second predetermined range too large may result in waste of computing resources. Accordingly, specific values of the first predetermined range and the second predetermined range may be determined experimentally. In the present disclosure, the first predetermined range and the second predetermined range may include at least a single frame length, and preferably may be an integer multiple of the frame length. In other words, the audio data frames within the predetermined range around the target frame may or may not be integer frames, and the present disclosure is not limited thereto. As a preferred embodiment, the frame of audio data may comprise one or several frames of audio data preceding and/or following the target frame.

Since the input is an audio data frame including a target frame to be currently analyzed and within a predetermined range therearound, for example, audio data of a certain frame length before and after the target frame may be included. Accordingly, the acoustic model may be modified such that the output of the acoustic model is a state recognition result (i.e., a prediction result) of at least one frame of audio data among the target frame and the audio data within the predetermined range. In the present disclosure, an acoustic model component may be considered as an aggregate of software and/or hardware resources that are capable of implementing the processing functions of the acoustic model, and thus, the output of the acoustic model, i.e., the output of the acoustic model component. The structure of the acoustic model component and the state recognition result will be described in detail below, and will not be described here.

It should be noted that, in order to improve accuracy of the output result of the acoustic model component, the "at least one frame of audio data" referred to in the present disclosure may refer to any one or more frames of all complete frames of audio data included in the audio data within a predetermined range. For example, in the case where the predetermined range is audio data of two frames after the target frame, the input may be regarded as three frames of audio data including the target frame. For the target frame, the latter two frames of audio data may be regarded as the context information of the target frame, for the intermediate frame of audio data, the target frame and the last frame of audio data may be regarded as the context information of the frame, and for the last frame of audio data of the target frame, the target frame and the intermediate frame of audio data may be regarded as the context information of the frame. Accordingly, the acoustic model component may be adapted to predict the target frame, the intermediate frame, and the last frame audio data, respectively, to obtain analysis results of the target frame, the intermediate frame, and the last frame, respectively. Of course, the acoustic model component may also be adapted to predict the target frame and the intermediate frame, respectively, to obtain analysis results of the target frame and the intermediate frame, respectively.

In step S220, single frame audio data which is located after the target frame and is not processed among the multi-frame audio data is taken as a next target frame, and a plurality of target frames are processed thereafter using the acoustic model component iteratively.

Thus, it is originally necessary to input a plurality of frames of audio data into the acoustic model component frame by frame to obtain a prediction result of each frame. Based on the audio analysis scheme disclosed by the disclosure, when the state of multi-frame audio data is identified by using the acoustic model component, the multi-frame audio data can be input at intervals of a preset interval (one frame or a plurality of frames), so that the calculated amount is reduced to 1/N before, and the calculation resource occupation of equipment end products can be greatly reduced. Wherein N may be an integer greater than or equal to 2, and specific values for N may be set according to practical situations, which are not limited to this disclosure.

As shown in fig. 3A and 3B, the scale marks 0 to 10 represent continuous multi-frame audio data. In the present disclosure, the audio data of the segments 0 to 1 may be regarded as 1 st frame audio data, the audio data of the segments 1 to 2 may be regarded as 2 nd frame audio data, and so on. Assume that for the input frame to be predicted currently, an audio data frame of 3 frames length after the input frame is spliced as input to the acoustic model component. In predicting the 1 st frame of audio data, the 1 st to 4 th frames of audio data may be taken as input. Unlike FIG. 1, for the 1 st input, the acoustic model component may predict the state of the 1 st frame and one or more frames following the 1 st frame. Since the 1 st input includes audio data of 1 st to 4 th frames, theoretically, the acoustic model component may be adapted to predict states of 1 st, 2 nd, 3 rd and 4 th frames, respectively, to obtain state recognition results of 1 st, 2 nd, 3 rd and 4 th frames, respectively. However, in view of the accuracy of the prediction, the acoustic model component may preferably predict the state of the frame data with context in the input, e.g. the acoustic model component may predict the state of the 1 st, 2 nd and 3 rd frame audio data, respectively.

As shown in fig. 3A, for example, for the 1 st input, the acoustic model component may predict the state of the 1 st frame and its following frames (i.e., the 2 nd frame) to obtain the state recognition results of the 1 st and 2 nd frames, respectively. Therefore, after the processing of the 1 st input is completed, the acoustic model component can take the 3 rd frame of audio data which is not analyzed as the current target frame to be predicted, then splice the audio data with the 3 rd frame length after the 3 rd frame as the 2 nd input, and input the acoustic model component, and the acoustic model component can predict the 3 rd frame and the following frame (namely the 4 th frame) so as to respectively obtain the prediction results of the 3 rd frame and the 4 th frame. Thus, it is possible to realize frame-by-frame (one frame-by-one) prediction so that the calculation amount is reduced to the previous 1/2.

As shown in fig. 3B, for example, for the 1 st input, the acoustic model component may predict the 1 st frame and its two following frames (i.e., the 2 nd and 3 rd frames) to obtain the prediction results (i.e., the state recognition results) of the 1 st, 2 nd and 3 rd frames, respectively. Therefore, after the acoustic model finishes processing the 1 st input, the acoustic model can take unprocessed 4 th frame of audio data as a target frame to be predicted currently, then splice 3 rd frame of audio data after the 4 th frame as the 2 nd input, input the acoustic model component, and the acoustic model component can predict the states of the 4 th frame and the two frames (namely the 5 th frame and the 6 th frame) so as to obtain the prediction results of the 4 th frame, the 5 th frame and the 6 th frame respectively. Thus, it is possible to realize frame-by-frame (two frames apart) prediction so that the calculation amount is reduced to the previous 1/3.

In step S230, the state recognition result of the audio data of the multiple frames in the multiple frames of audio data is compared with the preset wake-up word to recognize whether the multiple frames of audio data are wake-up instructions.

The multi-frame audio data referred to in this disclosure may be obtained by framing a detected speech input. For example, a user's voice input may be detected in real time, and then the detected voice input may be subjected to framing processing to obtain multi-frame audio data.

For each input, the acoustic model component may be configured to predict a state of at least one frame of audio data in the target frame and the audio frames within the predetermined range, for example, the acoustic model component may be configured to calculate a score (i.e., probability) of the at least one frame of audio data in the target frame and the audio frames within the predetermined range in each state, and the state with the highest score may be used as a state recognition result of the corresponding frame.

Thus, the state of each frame of audio data can be determined based on the state recognition result of the frame of audio data, phonemes can be recognized based on the states of several consecutive frames of audio data, and a plurality of phonemes can be combined into words. Therefore, according to the state identification result of a plurality of frames in the multi-frame audio data, whether the multi-frame audio data contains a wake-up instruction or not can be identified. For example, the state recognition results of the plurality of frames may be compared with a preset wake-up word, and if the state recognition results of the audio data of the plurality of frames are identical to the wake-up word, it may be determined that the multi-frame audio data includes a wake-up instruction. In the case that the multi-frame audio data is determined to contain a wake-up instruction, a subsequent wake-up operation may be performed, which will not be described again.

As an example, multiple path models may be preset, and different path models may correspond to different wake-up recognition results. Based on the state recognition result of the audio data of a plurality of frames in the multi-frame audio data, a path model matched with the state recognition result can be searched from the preset path models so as to recognize whether the multi-frame audio data is awakened or not. Path models may include wake instruction models (which may also be referred to as "Keyword models"), complementary models (filer models), and Silence models (Silence models). The wake instruction models may be multiple, and different wake instruction models may correspond to different wake instructions (i.e. wake words), for example, wake instruction models may include wake instruction models respectively corresponding to multiple wake instructions such as "open", "play", "i want to see", and the like. A complementary Model (Filler Model) may be used as a Filler to characterize the audio Model of the non-wake-up instruction portion of the representation. A Silence Model (Silence Model) may refer to an audio Model that has no speech input.

[ Acoustic model ]

In the present disclosure, to improve the analytical performance of the acoustic model component, the acoustic model component may be an FSMN model component. Moreover, the present disclosure also modifies the output of the acoustic model component to enable the acoustic model component to predict multiple frames in the input separately.

Fig. 4 is a schematic diagram illustrating a network structure of an acoustic model component according to an embodiment of the present disclosure.

As shown in fig. 4, the network structure of the acoustic model component may include an Input Layer (Input Layer), a Hidden Layer structure (Hidden Layer), and a plurality of Output layers (Output Layer). The output layers are used for respectively predicting analysis results of the audio data of the different frames in the input.

The hidden layer structure may include a plurality of hidden layers, and the plurality of output layers may be all connected to the last hidden layer. While each frame is originally prepared with a target value during training, the acoustic model of the present disclosure needs to provide target values for the current frame and the next N frames. In practical use, each frame input can generate multi-frame output, so that only a plurality of frames are needed to be input, the calculated amount is reduced to one-N of the original calculated amount, and the saved calculated resources are valuable for equipment end products with short resources.

In the present disclosure, an audio data frame in a predetermined range around a target frame in multi-frame audio data may be spliced with the target frame as input, directly input into an input layer, perform feature extraction on the input data by the input layer, and then input into a hidden layer structure. In addition, the audio data frames in the preset range near the target frame in the multi-frame audio data can be spliced with the target frame, then the characteristics of the spliced audio data are extracted, the extracted characteristics are input into the input layer, and the input layer inputs the extracted characteristics into the hidden layer structure.

The hidden layer structure may employ an FSMN structure. The FSMN is different from the conventional DNN layer in that a memory module is provided between adjacent hidden layers, and the memory module is used for storing history information and future information useful for determining a current target frame. The output of the memory module is used as the input of the next hidden layer, and the output of the memory module may include the output of the current hidden layer, the output of the hidden layer with the predetermined lookahead order, and the output of the hidden layer with the predetermined lookahead order.

Figure 5 is a schematic diagram illustrating an incoming FSMN structure.

As shown in fig. 5, the FSMN is different from the conventional DNN layer in that a memory module B is added, part of the past and future information is stored in the memory module B, and the information is processed by the memory module B and then transmitted to the next hidden layer, so that the network has long-term information processing capability. In order to reduce the calculation amount, the previous hidden layer can be output to the module A, the dimension setting of the module A is smaller than that of the previous hidden layer, which is equivalent to splitting the parameter matrix from the previous hidden layer to the module B into two parts, and the reasonable setting of the module A can reduce the calculation amount and simultaneously not lose the performance. The computational expression of the FSMN layer is shown below.

Representing the output of the memory module,/->

The offset is indicated as being a function of the offset,

output representing the first hidden layer, +.>

And->

Is the coding coefficient of the memory module.

According to the above formula, the output of the memory module is the sum of the output of the current hidden layer, the output of the hidden layer with the predetermined look-back order, and the output of the hidden layer with the predetermined look-ahead order. Wherein, the liquid crystal display device comprises a liquid crystal display device,

the output of the hidden layer, which can be regarded as a predetermined review level, is expressed as s before the current time t ₁ The result is obtained by multiplying the output of the hidden layer of the coding stride factor under different looking-back orders with the corresponding coding coefficient bit by bit. />

The difference in the calculated amount of FSMN compared to DNN comes from equation (2). Specific calculations show that under a similar network structure (the number of layers and the nodes of each layer are the same or similar), the floating point operation times per second is similar to that of DNN, and under the similar network structure, the calculation amount of LSTM exceeds twice that of DNN. It can be seen that the amount of computation introduced by the FSMN is much smaller than that introduced by the LSTM with an isostructure, so that the model can effectively control the real-time rate, and meanwhile, the model has long-time information modeling capability which is not possessed by DNN, and the performance of the model is better than that of the LSTM.

Fig. 6 is a network architecture diagram illustrating acoustic model components according to an embodiment of the present disclosure.

As shown in fig. 6, the network structure of the acoustic model assembly may include an input layer, a hidden layer structure composed of a DNN layer and an FSMN layer, and a plurality of output layers. The DNN layer structure is well known to those skilled in the art, and will not be described here. For the description of the input layer, the FSMN layer, and the plurality of output layers, reference should be made to the above description, and further description is omitted here.

As shown in fig. 7, the voice wake system of the present disclosure mainly includes a detection module 710, an acoustic prediction module 720, and a keyword detection module 730.

The detection module 710 may detect a user's voice input in real time and may frame the detected voice input to obtain multi-frame audio data.

The acoustic prediction module 720 may predict a status recognition result of each frame of audio data in the plurality of frames of audio data. In the prediction process, the acoustic prediction module 720 may splice audio data in a predetermined range around a target frame to be analyzed in the multi-frame audio data with the target frame as input, and input a pre-trained acoustic model component, where the acoustic model component may predict a status recognition result of at least one frame of audio data in the target frame and the predetermined range, respectively. The multi-frame audio data may then be used as the next target frame to be analyzed as single-frame audio data that is located after and not predicted from the target frame, whereby the acoustic prediction module 720 may iteratively process the following plurality of target frames using the acoustic model component. For the network structure of the acoustic model component, reference is made to the above related description, and no further description is given here.

Based on the state recognition results of the audio data of the plurality of frames in the multi-frame audio data, the keyword detection module 730 may search for a path model matching the state recognition results from the plurality of path models. The plurality of path models may be divided into a keyword model, a complementary model, and a mute model. When the state recognition result is found to be matched with the keyword model, the user can be identified to send out a wake-up instruction, and then the device can be controlled to be started so as to realize voice wake-up of the device.

[ Voice wake-up detection device ]

The voice wake-up detection method of the present disclosure may also be implemented as a voice wake-up detection apparatus.

Fig. 8 is a schematic block diagram showing a structure of a voice wake-up detection apparatus according to an embodiment of the present disclosure. The functional module of the voice wake-up detection device may be implemented by hardware, software or a combination of hardware and software for implementing the principles of the present invention. Those skilled in the art will appreciate that the functional modules depicted in fig. 8 may be combined or divided into sub-modules to implement the principles of the invention described above. Accordingly, the description herein may support any possible combination, or division, or even further definition of the functional modules described herein.

The following is a brief description of the functional modules that the voice wake-up detection apparatus may have and the operations that each functional module may perform, and details related thereto may be referred to the above description in connection with fig. 2 to 6, which are not repeated here.

Referring to fig. 8, the voice wake detection apparatus 800 includes a state recognition module 810 and a wake recognition module 820. The state recognition module 810 is configured to splice an audio data frame in a predetermined range around a target frame in the multi-frame audio data with the target frame as an input, and input a pre-trained acoustic model component, where the acoustic model component is a feedforward sequence memory neural network model (FSMN) component, and an output of the acoustic model component is a state recognition result of at least one frame of audio data in the target frame and the audio data frame in the predetermined range. The state recognition module 810 may take single-frame audio data, which is located after a target frame and is not predicted, of multi-frame audio data as a next target frame, and iteratively process the plurality of target frames after using the acoustic model component.

The wake-up recognition module 820 may recognize whether the multi-frame audio data is a wake-up instruction based on a status recognition result of the audio data of the plurality of frames in the multi-frame audio data. For example, the wake-up recognition module 820 may compare the state recognition result of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word to recognize whether the multi-frame audio data is a wake-up instruction. As an example, wake-up identification module 820 may look up a path model from a plurality of path models that matches the state identification results of the audio data of a plurality of frames to identify whether the multi-frame audio data is a wake-up instruction or not, different path models corresponding to different identification results. The path model may include a wake instruction model, a complementary white model, and a mute model, among others.

In the present disclosure, the audio data frame within the predetermined range may include: an audio data frame located in a first preset range before the target frame in the multi-frame audio data; and/or frames of audio data in a second predetermined range following the target frame in the multi-frame audio data.

As shown in fig. 8, the voice wake-up detection apparatus 800 may optionally further include a detection module 830 and a framing module 840, which are shown in dashed boxes. The detecting module 830 is configured to detect a voice input of a user in real time, and the framing module 840 is configured to frame the detected voice input to obtain multi-frame audio data.

As shown in fig. 4, in the present embodiment, the network structure of the acoustic model component may include: an input layer; a hidden layer structure; and the plurality of output layers are used for respectively predicting the analysis results of the audio data of different frames in the input.

The hidden layer structure may include a plurality of hidden layers, wherein a memory module for storing history information and future information useful for judging a current target frame is provided between at least two adjacent hidden layers. The output of the memory module is used as the input of the next hidden layer, and the output of the memory module comprises the output of the current hidden layer, the output of the hidden layer with the preset lookahead order and the output of the hidden layer with the preset lookahead order.

The computational expression of the hidden layer is shown below.

Representing the output of the memory module,/->

The offset is indicated as being a function of the offset,

output representing the first hidden layer, +.>

And->

Is the coding coefficient of the memory module. />

[ computing device ]

FIG. 9 is a schematic diagram of a data processing computing device that may be used to implement the above-described audio analysis and wake-on-speech detection method according to an embodiment of the present invention.

Referring to fig. 9, a computing device 900 includes a memory 910 and a processor 920.

Processor 920 may be a multi-core processor or may include multiple processors. In some embodiments, processor 920 may include a general-purpose host processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, the processor 920 may be implemented using custom circuitry, for example, an application specific integrated circuit (ASIC, application Specific Integrated Circuit) or a field programmable gate array (FPGA, field Programmable Gate Arrays).

Memory 910 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions required by the processor 920 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 910 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks may also be employed. In some implementations, memory 910 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual-layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 910 has stored thereon a processable code that, when processed by the processor 920, causes the processor 920 to perform the audio analysis and voice wake-up detection methods described above.

The audio analysis and voice wake detection methods, apparatus and computing devices according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for detecting voice wakeup, comprising:

inputting an audio data frame in a preset range near a target frame in multi-frame audio data and the target frame into an acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, the output of the acoustic model component is a state identification result of the target frame and a state identification result of at least one frame of audio data in audio data frames in the preset range, and the at least one frame of audio data comprises one or more frames positioned behind the target frame;

taking single-frame audio data which is positioned behind the target frame and is not processed first in the multi-frame audio data as a next target frame, and iteratively using the acoustic model component to process a plurality of target frames behind; and

comparing the state recognition results of the audio data of a plurality of frames in the multi-frame audio data with a preset wake-up word to recognize whether the multi-frame audio data is a wake-up instruction or not.

2. The voice wakeup detection method according to claim 1, wherein the frames of audio data within the predetermined range include:

An audio data frame located in a first preset range before the target frame in the multi-frame audio data; and/or

And an audio data frame which is positioned in a second preset range behind the target frame in the multi-frame audio data.

3. The voice wakeup detection method according to claim 1, further comprising:

detecting voice input of a user in real time; and

and framing the detected voice input to obtain the multi-frame audio data.

4. The voice wake-up detection method according to claim 1, wherein the step of comparing the state recognition result of the audio data of the plurality of frames in the multi-frame audio data with a preset wake-up word comprises:

searching a path model matched with the state recognition results of the audio data of the frames from a plurality of preset path models to recognize whether the multi-frame audio data is a wake-up instruction or not, wherein different path models correspond to different recognition results.

5. The voice wakeup detection method of claim 4, wherein the path model includes:

a wake instruction model;

a whitening model; and

and (5) a mute model.

6. The method of claim 1, wherein the acoustic model component comprises:

an input layer;

a hidden layer structure; and

and the output layers are used for respectively predicting the analysis results of the audio data of different frames in the input.

7. The method for detecting voice wakeup according to claim 6, wherein,

the hidden layer structure comprises a plurality of hidden layers, wherein a memory module is arranged between at least two adjacent hidden layers and is used for storing history information and future information which are useful for judging the current target frame.

8. The method for detecting voice wakeup according to claim 7, wherein,

the output of the memory module is used as the input of the next hidden layer,

the output of the memory module comprises the output of the current hidden layer, the output of the hidden layer with the preset lookback order and the output of the hidden layer with the preset lookback order.

9. The method for detecting voice wakeup according to claim 8, wherein,

an input representing the 1+1th hidden layer, which is obtained by a nonlinear transformation of the activation function f, U ^l The weight is represented by a weight that,

Representing the output of the memory module,/->

The offset is indicated as being a function of the offset,

output representing the first hidden layer, +.>

And->

Is the coding coefficient of the memory module.

10. A voice wake-up detection apparatus, comprising:

the state recognition module is used for inputting an audio data frame in a preset range near a target frame in multi-frame audio data and the target frame into the acoustic model component, wherein the acoustic model component is a feedforward sequence memory neural network model component, the output of the acoustic model component is a state recognition result of the target frame and a state recognition result of at least one frame of audio data in the audio data frame in the preset range, the at least one frame of audio data comprises one or more frames positioned behind the target frame, the state recognition module takes a single frame of audio data which is positioned behind the target frame and is not predicted for the first frame in the multi-frame audio data as a next target frame to be analyzed, and the acoustic model component is used for processing a plurality of target frames; and

The wake-up recognition module is used for comparing the state recognition results of the audio data of a plurality of frames in the multi-frame audio data with preset wake-up words so as to recognize whether the multi-frame audio data is a wake-up instruction or not.

11. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method of any of claims 1-9.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1 to 9.