CN111508493A

CN111508493A - Voice wake-up method and device, electronic equipment and storage medium

Info

Publication number: CN111508493A
Application number: CN202010312299.XA
Authority: CN
Inventors: 宋天龙
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-08-07
Anticipated expiration: 2040-04-20
Also published as: CN111508493B

Abstract

The application discloses a voice awakening method, a voice awakening device, electronic equipment and a storage medium, and relates to the technical field of voice processing, wherein the method comprises the following steps: acquiring input voice collected by an audio collector; matching the input voice based on a first voice matching model to obtain a first probability output, wherein the first probability output is used for indicating the probability that the input voice contains the specified text; acquiring at least one probability output of the first voice matching model before the current first probability output as a second probability output; fusing the first probability output and the second probability output to obtain an updated first probability output; outputting the updated first probability as a first matching result of the first speech matching model for matching the input speech; and if the first matching result indicates that the input voice contains the specified text, waking up the terminal. According to the method and the device, the historical output and the current output are fused, so that the keyword identification accuracy can be improved, and the false awakening rate is reduced.

Description

Voice wake-up method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of voice processing technologies, and in particular, to a voice wake-up method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of voice processing technology, in the daily life of people, a voice conversation function is already present in a terminal, and a user can further control the terminal by inputting specific voice, such as waking up a bright screen, waking up unlocking, waking up and starting the voice conversation function, and the like. The terminal may receive multiple voices simultaneously, and in order to discriminate the voice of the user for controlling the terminal, it is general to detect whether the voice contains a wakeup word, and if so, to perform wakeup again. However, in practical use, the terminal is often awoken by mistake when the user does not speak the awaking word, i.e. the awoken rate of the voice awaking is higher at present.

Disclosure of Invention

The embodiment of the application provides a voice awakening method and device, electronic equipment and a storage medium, and can reduce the false awakening rate of voice awakening at a terminal.

In a first aspect, an embodiment of the present application provides a voice wake-up method, which is applied to a terminal, where the terminal is provided with an audio collector, and the method includes: acquiring input voice collected by the audio collector; matching the input speech based on a first speech matching model, resulting in a first probability output indicating a probability that the input speech includes the specified text; obtaining at least one probability output of the first speech matching model output before the current first probability output as a second probability output; fusing the first probability output with the second probability output to obtain an updated first probability output; outputting the updated first probability as a first matching result of the first speech matching model matching the input speech; and if the first matching result indicates that the input voice contains the specified text, awakening the terminal.

In a second aspect, an embodiment of the present application provides a voice wake-up device, which is applied to a terminal, the terminal is provided with an audio collector, and the device includes: the voice acquisition module is used for acquiring the input voice acquired by the audio acquisition device; a first output module, configured to match the input speech based on the first speech matching model, resulting in a first probability output, where the first probability output is used to indicate a probability that the input speech includes the specified text; a second output module, configured to obtain at least one probability output of the first speech matching model output before the current first probability output, as a second probability output; an output update module for fusing the first probability output with the second probability output to obtain an updated first probability output; a result obtaining module, configured to output the updated first probability as a first matching result of the first speech matching model matching the input speech; and the terminal awakening module is used for awakening the terminal if the first matching result indicates that the input voice contains the specified text.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs being configured to perform the voice wake-up method provided by the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the voice wake-up method provided in the first aspect.

According to the voice awakening method, the voice awakening device, the electronic equipment and the storage medium, input voice collected by the audio collector is obtained, then the input voice is matched based on the first voice matching model, first probability output used for indicating whether the input voice contains the specified text or not is obtained, at least one probability output, output before the current first probability output, of the first voice matching model is obtained and serves as second probability output, then the first probability output and the second probability output are fused to obtain updated first probability output, the updated first probability output serves as a first matching result of the first voice matching model for matching the input voice, and if the first matching result indicates that the input voice contains the specified text, the terminal is awakened. Therefore, the method and the device have the advantages that the current first probability output of the first voice matching model and the second probability output of the historical output are fused to obtain the first matching result of whether the input voice contains the specified text or not, the keyword identification accuracy can be improved, the keyword detection jumping is effectively inhibited, and the false awakening rate is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows an application scenario diagram of a voice wakeup method according to an embodiment of the present application.

Fig. 2 shows a flowchart of a voice wake-up method according to an embodiment of the present application.

Fig. 3 shows a flowchart of a voice wake-up method according to another embodiment of the present application.

Fig. 4 is a schematic diagram illustrating a MFCC feature extraction process according to an exemplary embodiment of the present application.

Fig. 5 shows a schematic structural diagram of a convolutional neural network provided in an exemplary embodiment of the present application.

Fig. 6 shows a flowchart of step S230 in fig. 3 according to an exemplary embodiment of the present application.

Fig. 7 is a diagram illustrating an attention weight extraction process according to an exemplary embodiment of the present application.

Fig. 8 illustrates a flowchart of step S231 in fig. 6 according to an exemplary embodiment of the present application.

Fig. 9 shows a schematic diagram of a pooling process involved in an exemplary embodiment of the present application.

FIG. 10 shows a schematic diagram of an attention-scaling process provided by an exemplary embodiment of the present application.

Fig. 11 illustrates a flowchart of step S250 in fig. 3 according to an exemplary embodiment of the present application.

Fig. 12 illustrates a flowchart of step S254 in fig. 11 according to an exemplary embodiment of the present application.

FIG. 13 is a diagram illustrating a history fusion process according to an exemplary embodiment of the present application.

Fig. 14 shows a flowchart of a voice wake-up method according to another embodiment of the present application.

Fig. 15 is a flowchart illustrating a voice wake-up method according to still another embodiment of the present application.

Fig. 16 shows a flowchart of step S490 in fig. 15 according to an exemplary embodiment of the present application.

Fig. 17 shows a block diagram of a voice wake-up apparatus provided in an embodiment of the present application.

Fig. 18 shows a block diagram of an electronic device according to an embodiment of the present application.

Fig. 19 illustrates a storage unit for storing or carrying program codes for implementing the voice wake-up method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The existing voice wake-up method is generally implemented based on voice recognition of a wake-up word, but in the existing voice wake-up scheme, when a terminal is woken up through voice, most of the voice input by a user is recognized, whether the wake-up word is included is detected, and if the wake-up word is included, wake-up operations such as unlocking, screen lighting and the like can be executed. For example, if the terminal pre-stores the wake-up word "small europe", when the user says "small europe", the terminal obtains the voice input, and if the wake-up word included in the voice input is recognized, the terminal performs a screen-on operation.

However, the inventor has found that, in practical use, even if voice wake-up like the above is adopted, a false wake-up situation that the terminal is woken up when the user inputs voice without including a wake-up word often occurs, which may cause unnecessary waste of terminal resources, power consumption, and the like, and may also perform a false operation, which affects user experience.

Based on the above problem, embodiments of the present application provide a voice wake-up method, an apparatus, an electronic device, and a computer-readable storage medium, where an input voice collected by an audio collector is obtained, the input voice is matched based on a first voice matching model, a first probability output indicating whether the input voice contains a specified text is obtained, at least one probability output, output by the first voice matching model before the current first probability output, is obtained as a second probability output, the first probability output and the second probability output are fused to obtain an updated first probability output, the updated first probability output is used as a first matching result for the first voice matching model to match the input voice, and if the first matching result indicates that the input voice contains the specified text, the terminal is woken up. Therefore, the method and the device have the advantages that the current first probability output of the first voice matching model and the second probability output of the historical output are fused to obtain the first matching result of whether the input voice contains the specified text or not, the keyword identification accuracy can be improved, the keyword detection jumping is effectively inhibited, and the false awakening rate is reduced.

For convenience of detailed description, an application scenario to which the embodiments of the present application are applied is described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic view illustrating an application scenario of a voice wake-up method provided in an embodiment of the present application, where the application scenario includes a voice wake-up system 10 provided in the embodiment of the present application. The voice wake-up system 10 includes: a terminal 100 and a server 200.

The terminal 100 may be, but not limited to, a mobile phone, a tablet computer, an MP3 player (Moving picture experts Group Audio L player iii, motion video compression standard Audio layer 3), an MP4 player (Moving picture experts Group Audio L player iv, motion video compression standard Audio layer 4), a personal computer, or a wearable electronic device.

In this embodiment, the terminal 100 is provided with an audio collector, such as a microphone, which can collect voice through the audio collector.

The server 200 may be a traditional server, a cloud server, a server cluster composed of a plurality of servers, or a cloud computing service center.

In some possible embodiments, the device for processing the input voice may be disposed in the server 200, and after the terminal 100 acquires the input voice, the input voice may be sent to the server 200, and the server 200 processes the input voice and returns a processing result to the terminal 100, so that the terminal 100 may perform a subsequent operation according to the processing result.

The device for processing the input voice may be a voice matching device.

In some possible embodiments, the means for processing the input speech may further comprise a voiceprint recognition means for voiceprint recognition of the input speech.

As an embodiment, the voice matching device may be disposed in the server 200, the voiceprint recognition device may be disposed in the terminal 100, and the server 100 may return the voice matching result to the terminal 100, determine whether to perform voiceprint recognition based on the voice matching result by the terminal 100, and perform voiceprint recognition and subsequent operations when voiceprint recognition is required.

In another embodiment, the installation positions of the voice matching device and the voiceprint recognition device may be interchanged, that is, the voice matching device may be installed in the terminal 100, and the voiceprint recognition device may be installed in the server 200, so that after the terminal 100 performs voice matching based on the voice matching device, if the voice matching is performed, the input voice may be transmitted to the server 200, the server 200 is instructed to perform voiceprint recognition on a voiceprint of the input voice based on the voiceprint recognition device, and a voiceprint recognition result is returned to the terminal 100, so that the terminal 100 may determine whether to wake up the terminal based on the voiceprint recognition result.

As still another embodiment, both the voice matching device and the voiceprint recognition device may be provided to the server 200, and the server 200 may return a voiceprint recognition result to the terminal 100 so that the terminal 100 may determine whether to wake up the terminal based on the voiceprint recognition result.

In other possible embodiments, the device for processing the input voice may also be disposed on the terminal 100, so that the terminal 100 does not need to rely on establishing communication with the server 200, and can also process the input voice to obtain the processing result, and then the voice wake-up system 10 may only include the terminal 100.

The information processing method, apparatus, electronic device and storage medium provided by the embodiments of the present application will be described in detail by specific embodiments.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a voice wake-up method according to an embodiment of the present application, which can be applied to the terminal. The flow shown in fig. 2 will be described in detail below. The voice wake-up method may include:

step S110: and acquiring input voice acquired by the audio acquisition device.

The terminal can be provided with an audio collector and can also be connected with an external audio collector, and the connection here can be wireless connection or wired connection, which is not limited herein. In some embodiments, if the connection is Wireless, the terminal may be provided with a Wireless communication module, such as a Wireless Fidelity (WiFi) module, a Bluetooth (Bluetooth) module, and the like, and may obtain the input voice collected by the audio collector based on the Wireless communication module.

In some embodiments, the terminal may collect sound through an audio collector, such as a microphone, to obtain input voice collected by the audio collector. Because the consumption that utilizes the audio collector to carry out the pickup is lower, consequently, the audio collector can be in the on-state always and carry out the pickup. In some embodiments, the audio collector may buffer the collected audio at regular time and send the buffered audio to the processor to process the collected audio.

Step S120: and matching the input voice based on the first voice matching model to obtain a first probability output.

Wherein the first probability output is indicative of a probability that the input speech includes the specified text.

The first voice matching model can be obtained through training of first training data, wherein the first training data can comprise a plurality of positive sample voices and a plurality of negative sample voices, the positive sample voices are voices containing specified texts, the negative sample voices are voices not containing the specified texts, therefore, matching of the input voices is performed through the first voice matching model, whether the input voices contain the specified texts or not can be judged, matching verification is performed on the input voices, a first probability output is obtained, and the first probability output can be used for indicating the probability that the input voices contain the specified texts.

The designated text may be preset by a program or customized by a user, and is not limited herein. For example, the designated text may be "small europe", and the like, and is not limited herein. Then in one example, the positive sample speech may be speech corresponding to "how much weather today" in small europe, and the negative sample speech may be speech corresponding to "how much weather". In addition, in some embodiments, the designated text may also be referred to as a wakeup word, which is not limited in this application embodiment.

In some embodiments, the user may set a specific text in the terminal in advance, for example, the specific text may be input on a wakeup word setting page of the terminal, and may be input a specific voice corresponding to the specific text, or may be input only the text content of the specific text.

In one embodiment, the user may input a specified voice corresponding to the specified text, so that the terminal acquires the specified voice based on the wakeup word setting page to train the voice wakeup algorithm.

In one specific example, the user may enter the wake word setting page through a subsequent series of operations to set the wake word, set-secure-smart unlock-set the digital password-wake word setting, the terminal may display the wake word setting page, prompt the user to enter the wake word, the user may speak the wake word, e.g., "step by step", and the terminal may acquire the corresponding voice data as training data to train the voice wake algorithm.

In addition, in some embodiments, in order to improve the recognition accuracy of the voice wake-up algorithm, the terminal may prompt the user to repeatedly input a wake-up word for multiple times, and finally, send the voice data input for multiple times as training data to the voice wake-up algorithm for training, and prompt the user when the training is completed. After the training is finished, whether the input voice contains the specified text can be detected by using a voice awakening algorithm.

In some embodiments, the first node may be constructed by a Neural Network, and the Neural Network may be, but not limited to, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like, which is not limited in this embodiment.

In some embodiments, when matching the input speech based on the first speech matching model, the input speech may be preprocessed to obtain multi-frame speech segments, and the multi-frame speech segments may be matched based on the first speech matching model.

In an example, the input speech may be framed according to a preset length to obtain a multi-frame speech segment, where the length of each frame of speech segment may be smaller than or equal to the preset length, where the preset length may be determined as needed or may be self-defined, for example, 0.5s, and then the input speech is preprocessed to obtain a speech segment of which each frame does not exceed 0.5s, and the pre-processed multi-frame speech segments are sequentially input into the first speech matching model to obtain a plurality of probability outputs, where the current speech segment corresponds to the first probability output.

Step S130: at least one probability output of the first speech matching model that is output before the current first probability output is obtained as a second probability output.

In some embodiments, the terminal may store a first probability output for each output of the first speech matching model. Then after obtaining the current first probability output, the probability output of the previous M times of output can be obtained as the second probability output, and the size of the second probability output is M × C. Wherein, M is greater than or equal to 1, it can be understood that the specific value thereof can be determined according to actual needs, which is not limited in this embodiment. In the foregoing example, the second probability output may be a probability output corresponding to at least one speech segment preceding the current speech segment.

Step S140: and fusing the first probability output and the second probability output to obtain an updated first probability output.

Because the length of the input voice segment corresponding to each probability output is generally smaller than the length of the keyword, for example, the time length of the voice corresponding to the text specified by the user input, that is, the length of the keyword can be between 1s and 2s, and the voice actually input into the first voice matching model is the voice segment of the input voice after being framed according to the preset length, wherein the preset length is smaller than the length of the keyword, for example, about 0.5s, for the recognition of one input voice, the input voice is split into multiple frames of voice segments and input into the first voice matching model to obtain multiple frames of probability output, the current probability output is the first probability output, the historical probability output is the second probability output, after the first probability output and the second probability output are obtained, the fusion judgment can be performed on the multiple frames of the input voice obtained by the first voice matching model by fusing the first probability output and the second probability output, an updated first probability output is obtained. Therefore, the current output (namely the first probability output obtained currently) considers the historical output (namely the second probability output), and the first probability output is updated by combining the first probability output and the second probability output, so that the keyword recognition in the continuous voice can be solved.

It should be noted that, in this embodiment, specific numerical values of the preset length and the keyword length are not limited, and only the preset length is smaller than the keyword length.

In some embodiments, the manner of fusing the first probability output and the second probability output includes, but is not limited to, taking a maximum value, a minimum value, an average value, and the like of the first probability output and the second probability output, and the averaging may further include a weighted average and the like, which is not limited in this embodiment.

In other embodiments, the first probability output may also be subjected to feature extraction, and the feature extraction result is compared with a preset value, and if the feature extraction result is higher than the preset value, the first probability output is used as an updated first probability output, and if the feature extraction result is lower than or equal to the preset value, the updated first probability output is determined according to the second probability output. The preset value may be determined according to actual needs, for example, may be obtained by training based on the first training data, or may be user-defined. Therefore, the problem of keyword detection jumping can be effectively solved, and the false awakening rate is reduced. Specific implementation manners can be seen in the following examples, which are not described herein again.

Step S150: and outputting according to the updated first probability to obtain a first matching result of the first voice matching model for matching the input voice.

Obtaining an updated first probability output, which is a scalar, through the foregoing steps, and in some embodiments, if the updated first probability output is greater than a preset output threshold, determining that the input speech includes the specified text and obtaining a corresponding first matching result, that is, the first matching result indicates that the input speech includes the specified text; if the updated first probability output is smaller than or equal to the preset output threshold, it is determined that the input voice does not contain the designated text and a corresponding first matching result is obtained, that is, the first matching result indicates that the input voice does not contain the designated text, and the first voice matching model can wait for the subsequent input voice to start new verification.

The preset output threshold value can be determined according to actual needs, can be preset for a program, can be customized for a user, and is not limited herein.

Step S160: and if the first matching result indicates that the input voice contains the specified text, waking up the terminal.

In some embodiments, if the first matching result indicates that the input voice contains a specified text, the terminal may be awakened to perform a predetermined operation. As an embodiment, the terminal may pre-store a mapping relationship table between a current state of the terminal and a preset operation, where the current state of the terminal includes, but is not limited to, a screen state (whether to turn off a screen or not, whether to lock a screen), a currently running application, a current time, and the like, and is not limited herein. If the first matching result indicates that the input voice contains the specified text, the current state of the terminal can be obtained, so that the corresponding preset operation can be determined according to the current state of the terminal, and the terminal can be awakened to execute the preset operation. The preset operation may include, but is not limited to, a screen-up operation, an unlocking operation, and a voice assistant activation operation, which is not limited in this embodiment.

In some embodiments, if the voiceprint recognition is verified, the terminal may be awakened to switch from the screen-off state to a non-screen-off state, where the non-screen-off state may include a to-be-unlocked state in which the screen is lit and the unlocking interface is displayed, and may further include an unlocked state in which the screen is lit and the unlocking interface is not displayed.

In other embodiments, if the voiceprint identification passes the verification, the terminal can be awakened and the unlocking operation can be executed, so that the user can directly unlock the terminal through voice, and accurate, safe and convenient unlocking can be realized based on the method. For example, after the user says "small europe and small europe" to the terminal and the voiceprint identification passes the verification, the terminal screen may be lighted up and an unlocked interface may be displayed, an interface before the screen is locked last time may be displayed, and a desktop may also be displayed, which is not limited herein.

It is understood that the above is only an example, and the method provided by the present embodiment is not limited to the above scenario, but is not exhaustive here for reasons of space.

Because the existing related technology mainly recognizes isolated words to realize voice awakening, namely, each section of audio only contains one awakening word, such as 'little cloth' and 'little ohm' and needs to accurately cut input voice sent into an algorithm, the related technology has a poor keyword recognition effect on continuous voice, such as 'little ohm in the european community' weather so that the related technology cannot accurately recognize the keyword 'little ohm in the european community' and cannot divide the keyword 'little ohm in the european community' and 'weather so that the keyword' little ohm in the european community 'and the' weather so 'cannot be recognized to realize awakening so as to process natural language to trigger corresponding operation on' weather so as to 'weather so' after awakening.

The voice wake-up method provided by this embodiment includes obtaining a current first probability output by the first voice matching model, obtaining at least one probability output by the first voice matching model before the current first probability output, as a second probability output, where a length of an input voice segment corresponding to each probability output is generally smaller than a length of a keyword, for example, a time length of a voice corresponding to a user inputting a specified text, that is, a length of the keyword may be between 1s and 2s, and a voice actually input to the first voice matching model is a voice segment obtained by framing an input voice according to a preset length, where the preset length is smaller than the length of the keyword, for example, may be about 0.5s, so as to recognize an input voice, splitting the input voice into multiple frames to input the first voice matching model, obtaining multiple frame results, where a current result is the first probability output, the historical result is the second probability output, the first probability output and the second probability output of the historical output are fused, and the multi-frame result of the first voice matching model obtained by the input voice can be fused and judged to obtain the first matching result which finally indicates whether the input voice contains the specified text or not, so that the keyword detection jumping and the false awakening can be effectively inhibited, the problem of continuous voice keyword recognition is solved, the keyword recognition accuracy is improved, and the false awakening rate is reduced.

Referring to fig. 3, fig. 3 is a flowchart illustrating a voice wake-up method according to another embodiment of the present application, which can be applied to the terminal, and the voice wake-up method includes:

step S210: and acquiring input voice acquired by the audio acquisition device.

Step S220: and extracting acoustic features of the input voice, and performing convolution operation on the acoustic features through the first voice matching model to obtain the convolution neural network output.

The first voice matching model can be constructed on the basis of a convolutional neural network, if the terminal is in a screen-off state, the terminal can extract acoustic features of input voice, and convolution operation is performed on the acoustic features through the first voice matching model to obtain convolutional neural network output.

In one embodiment, the first speech matching model may first perform feature extraction on the input speech to perform feature generation and feature dimensionality reduction on the input speech to obtain an acoustic feature of the input speech, where the acoustic feature may be a Mel Frequency Cepstrum coefficient (Mel Frequency Cepstrum C)_oefficient, MFCC).

Referring to fig. 4, a schematic diagram of a MFCC feature extraction process according to an exemplary embodiment of the present application is shown. As shown in fig. 4, the input speech passes through the preprocessing module 401, the windowing module 402, the fourier transform module 403, and the MFCC extraction module 404 in sequence, and MFCC features corresponding to the input speech can be obtained as acoustic features thereof.

The preprocessing module 401 may be a high-pass filter, and optionally, the mathematical expression thereof may be: h (z) ═ 1-az^-1Where h (z) represents the filtered speech data of the input speech, and a is a correction coefficient, which may be generally 0.95-0.97.

Further, the windowing module 402 may be configured to smooth the filtered speech data and smooth the edges of the frame signal, and optionally, the windowing module 402 may employ a hamming window function to smooth the edges, and optionally, the function expression of the hamming window may be

Where n is an integer, 0, 1, 2, 3.. M, M is the number of points in the fourier transform, and optionally, M may be 512.

Further, a spectrum corresponding to the smoothed voice data can be obtained by the fourier transform module 403, and then mel filtering is performed by the MFCC extraction module 404 to convert the spectrum into a mel spectrum conforming to the auditory sense of human ears, and optionally, a function expression adopted by the mel filtering may be:

wherein, F_mel(f) And f is a frequency point after Fourier transform.

Optionally, after obtaining the mel spectrum through the above processing, the obtained F can be firstly aligned through logarithm extraction_mel(f) Taking the logarithm, then performing Discrete Cosine Transform (DCT) processing, and taking the finally obtained DCT coefficient as the extracted MFCC characteristic. MFCC features of the input speech can thus be extracted as acoustic features.

It should be noted that the above parameter is only an example, and in other examples, other parameters may also be selected, which is not limited in this embodiment.

After the acoustic features of the input voice are extracted, convolution operation can be carried out on the acoustic features through the first voice matching model, and convolution neural network output is obtained. Then, as an example, in the first speech matching model, the MFCC extraction module 504 may be followed by a convolutional neural network to perform a convolution operation on the acoustic features output by the MFCC extraction module 504 to obtain a convolutional neural network output.

In some embodiments, the first speech matching model may include N sets of convolutional layers, Batch Normalization layer, and linear active layer connected in sequence, and the set of convolutional layers, Batch Normalization layer (BN) layer, and linear active layer connected in sequence is used as a set of convolutional blocks, then extracting acoustic features of the input speech, and performing a convolution operation on the acoustic features through the first speech matching model to obtain an output of the convolutional neural network, and a specific embodiment of obtaining the output of the convolutional neural network by extracting the acoustic features of the input speech as an input of a 1 st set of convolutional blocks, and sequentially processing the output of the upper set through the N sets of convolutional blocks to obtain an output of an nth set of convolutional blocks as an output of the convolutional neural network, wherein the input of the ith set of convolutional blocks is obtained by fusing an output of an i-1 th set of convolutional blocks and an input of an N-i +1 th set of convolutional blocks, where N is ∈ N ≧ ∈ N and N is {1, 2, i ═ i.

In one specific example, the structure of the convolutional neural network in the first speech matching model can be as shown in fig. 5, the convolutional neural network comprises n groups of sequentially connected convolutional blocks 500, and each convolutional block 500 comprises a sequentially connected convolutional layer 501, a batch normalization layer 502 and a linear activation layer 503. The acoustic features extracted by the above method can be used as the input of the first group of convolution blocks, and the convolution neural network output can be obtained by the n groups of convolution blocks 500.

The convolutional layer 501 is a convolutional neural network layer, which is a neural network layer using convolution as a main calculation mode, optionally, the data size of the acoustic features is C × R × 1, where C is the number of feature columns, R is the number of feature rows, and the number of channels is 1, and the extracted acoustic features are sequentially input into the convolutional layer 501 to calculate local features, optionally, the calculation formula is as follows:

in the formula (1), I represents input, W represents weight corresponding to convolution, bias is a bias term, and the result obtained through convolution layer calculation is a 3-dimensional feature with the size of c × r × 1.

The batch normalization layer 502 is a network layer that can perform effective adaptive normalization on each layer output, and optionally, the calculation formula is as follows:

β^(k)＝E[x^(k)]formula (2)

In the above equations (2) to (5), x is the previous layer output, β and γ are adaptive parameters, and k represents the batch size, the average value of the previous layer output is calculated based on the equation (2), the standard deviation of the previous layer output is calculated based on the equation (3), normalization processing is performed based on the equation (4), and the data obtained after the normalization processing is performed based on the equation (5)

And (k) obtaining y by reconstruction, inputting x into the batch normalization layer 502 to calculate the variance and the mean value to obtain adaptive parameters gamma and β, and calculating the calculated adaptive parameters gamma and β in the model reasoning process to realize batch normalization.

The linear active layer 503 may be configured to linearly boost the output characteristic, and optionally, the calculation formula is as follows:

y ═ f (x), f ═ max (λ × x, 0) formula (6)

At this time, the partial feature y with a positive value is output, and the positive-value feature x needs to be multiplied by a factor lambda as a linear enhancement means.

The U-shaped residual structure module 504 is a layered structure that separates and combines features of each layer, and performs feature fusion on a first group (i ═ 1) and a final group (i ═ N), and performs feature fusion on a second group (i ═ 2) and a penultimate group (i ═ N-1), …, so that the input of the ith group of convolution blocks is obtained by fusing the output of the i-1 group of convolution blocks and the input of the N-i +1 group of convolution blocks, where N is ∈ N and N is greater than or equal to 2, and i is ═ 1, 2, …, N, whereby, based on the U-shaped residual structure module 504, the entire feature information stream can be retained and calculated, and the low-level features and high-level features in the inference process are subjected to multi-scale fusion, so that the network can be designed deeper, and the output feature expression capability is ensured, thereby further improving the accuracy of keyword detection.

Referring to fig. 5, the convolutional neural network deepens the longitudinal dimension of the model by repeatedly applying the convolutional layer 501, the batch normalization layer 502, the linear activation layer 503 and the U-shaped residual structure 504, so that the features of the model can be abstracted and extracted for a plurality of times to perform effective classification, the accuracy of keyword detection is improved, the dimension of the output of the model is continuously reduced, the problem that the network is too deep and difficult to train is solved, and the final output of the convolutional neural network based on the U-shaped residual structure, namely the convolutional neural network output, is obtained after the model is overlapped for a plurality of times.

At present, in order to reduce the false wake-up rate, a wake-up algorithm needs to be improved, so that the wake-up algorithm is more accurate and complex and continuously runs on a terminal processor, but a large power consumption burden is caused to the terminal, and although the influence on a terminal (for example, a smart sound box) used by plugging in electricity may not be large, for a terminal (for example, a mobile phone, a tablet computer, and the like) used by not plugging in electricity, the battery consumption is accelerated, so that the standby time of the terminal is reduced. From this, the longitudinal dimension of the model is constantly deepened through above-mentioned operation to this embodiment, carry out a lot of abstractions and draw in order to carry out effective classification to the model characteristic, improve the accuracy that detects the keyword, constantly reduce the dimension of model output through U type residual error structure 504 simultaneously, control the size of model, make when improving first pronunciation matching model to keyword identification accuracy, can also let first pronunciation matching model in the lower low-power consumption module operation of the lower nevertheless consumption of computational performance, thereby when reducing the mistake awakening rate, still can reduce the terminal when applying to the low-power consumption module and realize the required consumption of voice awakening, be favorable to responding to the awakening of user under the scene of the whole day.

Step S230: and matching the output of the convolutional neural network with the acoustic characteristics corresponding to the specified text to obtain a first probability output and a first probability output.

In some embodiments, the convolutional neural network output may be matched with acoustic features corresponding to the specified text to determine whether the input speech includes the specified text according to the convolutional neural network output, for example, the convolutional neural network output may be input to a classifier, such as a Softmax classifier, to obtain a probability indicating that the input speech includes the specified text, and the first probability output is used as the first probability. In other examples, other classifiers may also be employed, without limitation.

In other embodiments, step S230 may also include steps S231 to S233, so as to introduce an attention mechanism on the basis of the convolutional neural network, so as to perform attention optimization on the convolutional neural network, so as to deal with the problem that the model precision is lost when the sequence is too long, and improve the accuracy of detecting the keyword. Specifically, referring to fig. 6, fig. 6 is a schematic flowchart illustrating a flow of step S230 in fig. 3 in an exemplary embodiment of the present application, where step S230 includes:

step S231: and extracting attention weight of the output of the convolutional neural network according to channels to obtain an attention weight vector corresponding to the output of the convolutional neural network.

In some embodiments, when performing attention weight extraction on the convolutional neural network output according to channels, the attention weight vector corresponding to the convolutional neural network output may be obtained by sequentially passing the convolutional neural network output through a Pooling (firing) layer, a convolutional layer, a Fully Connected (FC) layer, and a nonlinear activation layer.

In an exemplary embodiment, the flow of the aforementioned attention weight extraction may be as shown in fig. 7. In order to explain the flow shown in fig. 7 in detail, please refer to fig. 8, and fig. 8 shows a schematic flow chart of step S231 in fig. 6 according to an exemplary embodiment of the present application, in this embodiment, step S231 may include:

step S2311: and sorting the characteristic values of the characteristics of each channel output by the convolutional neural network from big to small through a pooling layer, and extracting the characteristic values of a plurality of bits in front of each channel as the characteristic value of the characteristics of each channel after pooling to obtain the pooled characteristics.

The pooling layer can be used for the pooling layer to reduce the dimension of the extracted features, so that the features are reduced, the network calculation complexity is simplified, and overfitting is avoided to a certain extent; on the one hand, the salient features can be retained.

In this embodiment, the pooling layer may employ TopN pooling to perform TopN pooling on the input features of the pooling layer, i.e., an operation that performs pooling extraction on the first N maxima of the vector. Specifically, as shown in fig. 7, if the data size output by the convolutional neural network is C × H × W, that is, the input of the pooling layer is a feature with the data size of C × H × W, where C represents the number of input channels, H represents the input height, and W represents the input width, the feature with the data size of C × N × 1, that is, the pooled feature, can be obtained by TopN pooling.

In one embodiment, the TopN pooling process can be seen in fig. 9, as shown in fig. 9, the input of the pooling layer is C × H × W, the feature size of each channel C (C ∈ C) is H × W, the feature values in the features are sorted in descending order, the first N-bit feature is extracted as the permitted pooling value of the channel, and the foregoing operation is performed on each channel in sequence, so as to obtain the feature with the output size of C × N × 1, i.e., the pooled feature.

In addition, in some other embodiments, the Pooling layer may employ Max Pooling (Max Pooling), Mean Pooling (Mean Pooling), and the like, and is not limited herein. Taking the maximum pooling as an example, the convolutional neural network output can be divided into a plurality of regions according to the characteristics of each channel, the maximum value of each region is taken as the region output, and finally the output consisting of the maximum values of the regions is obtained.

Step S2312: and performing feature extraction on the pooled features through the convolutional layer to obtain a one-dimensional vector.

And informing the input convolution layer of the pooling information obtained by the pooling layer, extracting the features of the pooled features, and calculating to obtain a one-dimensional vector with the data size of (C/N) × 1.

Step S2313: and (3) enabling the one-dimensional vector to pass through the full-connection layer and the nonlinear activation layer in sequence to obtain the attention weight vector corresponding to the output of the convolutional neural network.

The fully-connected layer is a neural network layer using weights as a calculation method, and one-dimensional vectors obtained through the pooling layer and the convolutional layer are sequentially input into the fully-connected layer to calculate local features, wherein optionally, the calculation formula may be:

in the formula (7), I represents input, W represents weight corresponding to convolution, bias is a bias term, and a feature with a size of C1 x 1 is obtained through full-connection layer calculation.

In some embodiments, the convolutional layer may be followed by a sequential joining of the two fully-connected layers, as shown in fig. 7. In other possible embodiments, one or more than two fully-connected layers may also be connected, and are not limited herein.

In one embodiment, the nonlinear activation layer may be obtained by using a nonlinear activation function, and the nonlinear activation function may be a Sigmoid function, a Tanh function, or the like. Alternatively, taking Sigmoid function as an example, the calculation formula may be:

y is sigmoid (x) formula (8)

Thus, after the nonlinear activation layer is activated, a one-dimensional vector with the size of C1 is obtained and is used as an attention weight vector corresponding to the output of the convolutional neural network.

Step S232: and carrying out weighting processing on the output of the convolutional neural network according to the attention weight vector to obtain the attention output characteristics.

As shown in fig. 7, after obtaining the attention weight vector corresponding to the output of the convolutional neural network, attention scaling may be performed on the obtained attention weight vector and the input features before the attention weight extraction (i.e., the output of the convolutional neural network), that is, the output of the convolutional neural network may be weighted according to the attention weight vector to obtain the attention output features.

In an exemplary embodiment, please refer to fig. 10, which shows a schematic diagram of the attention scaling process provided in an exemplary embodiment of the present application, and as shown in fig. 10, the algorithm of step S232 is abstracted into a module, i.e., the attention scaling module, and then the input of the attention scaling module is the convolutional neural network output (with size C H W) and the attention weight vector (with size C1).

In one embodiment, the attention weight vector may be updated by a degree quantization structure to obtain an attention update weight, where the update may be based on a predetermined formula. Wherein the predetermined formula may be at least one of the following formulas (9) to (13), and of course, may not be limited to the following formulas:

a_t＝g_BO(h_t)＝b_tformula (9)

a_t＝g_L(h_t)＝w_t ^Th_t+b_tFormula (10)

a_t＝g_sL(h_t)＝w^Th_t+ b formula (11)

a_t＝g_NL(h_t)＝V_t ^Ttanh(w_t ^Th_t+b_t) Formula (12)

a_t＝g_SNL(h_t)＝V^Ttanh(w^Th_t+ b) formula (13)

The above equations (9) - (13) can all achieve the convergence result through end-to-end training, and have respective advantages for models with different feature distributions, and h is_tCharacterizing attention weight vector, a_tCharacterizing attention update weights, where T ∈ (1, T), the attention weight vector h can be obtained by the above formula_tCorresponding attention update weight a_tOptionally by updating the weight a to attention_tCarrying out bitwise averaging to obtain the features with the size of C1 x 1, and obtaining the attention scaling weight a through feature mapping_t', scaling the attention with a weight_t' normalization to obtain vectors

Output j of the convolutional neural network_tAnd vector p_tThe attention output characteristics can be obtained by carrying out weight accumulation according to channels

The size of the attention output feature is C × H × W. Therefore, attention optimization is carried out on the convolutional neural network based on an attention mechanism, so that the output of the optimized convolutional neural network can be fused with low-dimensional features and high-dimensional features, and the first voice matching model has better generalization capability in various scenes.

Step S233: and matching the attention output characteristics with the acoustic characteristics corresponding to the specified text to acquire first probability output.

In some embodiments, the attention output features may be matched with acoustic features corresponding to the specified text to feature map the attention output features with output categories corresponding to the specified text to obtain the first probability output. For example, if the designated text is "small europe", the output type may be "small europe", and the first probability output, which is the probability of whether the attention output feature can be matched with the designated text, may be obtained by performing feature mapping between the attention output feature and the output type.

The first probability output may reflect the degree to which the attention output feature matches the specified text, with the higher the probability generally believed to be the higher the degree of matching.

As an embodiment, the attention output features may be first subjected to feature dimensionality reduction by global pooling, i.e., pooling of attention output features of size C × H × W in height and width. Alternatively, global max pooling may be specifically employed, e.g., may be based on a formula

Implementation, wherein i ∈ H W, β_iIs the size of the pooling window, then

The maximum value of each window is obtained for pooling each pooling window, from which the output characteristic for each channel can be calculated with the size C1 x 1.

Further, in order to perform feature mapping on the output features and the output categories, the output features may be subjected to global normalization, and optionally, may be implemented by the following formula:

thus, a vector k is obtained_tIs a probability estimate for the corresponding output class, i.e. a first probability output, where k_t∈[0，1]。

In some embodiments, if k_tIf the result is larger than the preset result threshold value, the input voice can be judged to contain the appointed text; if k is_tAnd if the result is less than or equal to the preset result threshold value, the input voice can be judged not to contain the specified text.

The preset result threshold value can be set according to actual needs. In one embodiment, the Equal Error Rate (EER) in the training data set, that is, the value of the False Acceptance Rate (FAR) Equal to the False Rejection Rate (FRR), is calculated, and the threshold value when the EER is the minimum is used as the preset result threshold value, so that the False wake-up Rate and the False Rejection Rate of the first speech matching model can be balanced, and thus the full Rate and the precision Rate can be checked, so that the first speech matching model can be used for capturing the speech wake-up words in the all-day scene, and based on the higher full Rate and the relatively lower precision Rate, as many potential user wake-up scenes as possible can be effectively recognized, and the leak detection Rate can be effectively reduced.

In other embodiments, the preset result threshold may also be calculated based on other manners, or may also be user-defined, which is not limited in this embodiment.

Step S240: at least one probability output of the first speech matching model that is output before the current first probability output is obtained as a second probability output.

Step S250: and fusing the first probability output and the second probability output to obtain an updated first probability output.

In some embodiments, step S250 may include steps S251 to S254, specifically, referring to fig. 11, fig. 11 shows a schematic flow chart of step S250 in fig. 3 according to an exemplary embodiment of the present application, and in this embodiment, step S250 may include:

step S251: and performing feature extraction on the second probability output to obtain a first historical feature and a second historical feature.

In some embodiments, a recurrent neural network may be used to perform feature extraction on the second probability output, so as to obtain the first historical feature and the second historical feature.

As an embodiment, when the input sequence is long, a long-Short Term Memory (L ong Short-Term Memory, L STM) network and a Gated Round Unit (GRU) may be specifically used to perform feature extraction on the second probability output, which is not limited in this embodiment.

In addition, in some embodiments, in order to better understand the context environment and eliminate ambiguity, a Bidirectional Recurrent Neural Network (Bi-RNN) may be specifically used to perform feature extraction on the second probability output, so that context dependencies in two directions can be learned to perform more effective feature extraction and processing on the sequence information features. Specifically, the second probability output is respectively input to two bidirectional RNNs to obtain a first history feature and a second history feature, where the bidirectional RNNs include a plurality of nodes, and the number of the nodes is not limited in this embodiment and may be determined according to actual needs.

In some examples, the first speech matching model may include first and second bidirectional recurrent neural network layers, and at this time, feature extraction is performed on the second probability output to obtain a first historical feature and a second historical feature.

In some embodiments, the network parameters of the first and second bidirectional cyclic neural network layers are different, and can be set according to actual requirements.

In other embodiments, other neural networks may also be used to perform feature extraction on the second probability output to obtain the first historical feature and the second historical feature, which is not limited in this embodiment.

Step S252: and fusing the first probability output and the first historical characteristic to obtain a historical attention weight vector corresponding to the second probability output.

In some embodiments, the size of the first probability output is the same as the size of the first historical feature, the first probability output and the first historical feature may be multiplied point by point to obtain a one-dimensional vector feature with the size C, and then the one-dimensional vector feature is input to a normalization layer for normalization, in one example, the normalization layer may use a Softmax function, and optionally, the normalized calculation formula may be as follows:

wherein, c_tCharacterizing the t-th eigenvalue, h, in the one-dimensional vector features_tAnd characterizing the normalized characteristic value of the tth characteristic value. Therefore, after the normalization processing is carried out on the fusion feature of the first probability output and the first historical feature through the normalization layer, the vector h can be obtained_tThe corresponding historical attention weight vector is output as the second probability.

Step S253: and carrying out weighting processing on the second historical characteristics according to the historical attention weight vector to obtain historical fusion output.

After the historical attention weight vector is obtained, the second historical feature can be weighted according to the historical attention weight vector to obtain historical fusion output. As an embodiment, the historical attention weight vector and the second historical feature may be multiplied point by point to obtain an optimized historical fusion output based on the attention mechanism.

Step S254: the first probability output is fused with the history fusion output to obtain an updated first probability output.

In some embodiments, the step S254 may specifically include steps S2541 to S2543, specifically, referring to fig. 12, fig. 12 shows a flowchart of the step S254 in fig. 11 provided in an exemplary embodiment of the present application, and in this embodiment, the step S254 may include:

step S2541: and performing feature extraction on the first probability output to obtain an output coefficient corresponding to the first probability output.

In one embodiment, the first probability output may be subjected to feature extraction via the fully-connected layer and the nonlinear active layer to obtain an output coefficient G corresponding to the first probability output.

Step S2542: and if the output coefficient is larger than the preset result threshold, the first probability output is used as the updated first probability output.

Step S2543: and if the output coefficient is smaller than or equal to the preset result threshold, outputting the history fusion output as the updated first probability output.

The calculation of the preset result threshold can be seen in the foregoing embodiments, and is not described herein again. If thre is used to represent the preset result threshold, the update formula of the output coefficient G may be as follows:

in this case, if the history fusion output is memory and the first probability output is input, the formula for calculating the updated first probability output result may be as follows:

result ═ G × input + (1-G) × memory formula (16)

If the output coefficient G is greater than the preset result threshold thre, G is 1, and the updated first probability output result is the first probability output input, as obtained by combining the above equations (15) and (16); if the output coefficient G is less than or equal to the preset result threshold thre, then G is 0, and the updated first probability output result is the history fusion output memory.

Therefore, the fusion processing between the historical memory result (history) and the current output result (first probability output) is realized through the steps, when the output coefficient is larger than the preset result threshold value, the current output result is output as the updated first probability, and when the output coefficient is smaller than or equal to the preset result threshold value, the historical fusion output after the attention optimization is carried out on the historical memory result is output as the updated first probability.

Because the prior related technology mainly recognizes isolated words to realize voice awakening, namely, each section of audio only contains one awakening word, such as 'little cloth', the input voice sent into the algorithm needs to be accurately cut, so that the related technology has a poor keyword recognition effect on continuous voice, such as 'how good the weather is today in the small europe and the small europe', and through the fusion processing of the historical memory result and the current output result, the first voice matching model can effectively inhibit the keyword detection jumping and the false awakening, and the problem of continuous voice keyword recognition is solved.

In an exemplary embodiment, the process of fusing the history memory result (i.e., the second probability output) and the current output result (i.e., the first probability output) may refer to fig. 13, fig. 13 shows a schematic diagram of the history fusion process according to an exemplary embodiment of the present application, and the foregoing description can be seen in the principle and data flow related in the diagram, which is not repeated herein.

In addition, in other exemplary embodiments, the output coefficient may not be updated according to the above equation (15).

Step S260: the updated first probability is output as a first matching result of the first speech matching model matching the input speech.

Step S270: and if the first matching result indicates that the input voice contains the specified text, waking up the terminal.

It should be noted that, for parts not described in detail in this embodiment, reference may be made to the foregoing embodiments, and details are not described herein again.

According to the voice awakening method provided by the embodiment, the first voice matching model is constructed based on the convolutional neural network, the U-shaped residual error structure is introduced on the basis of the convolutional neural network, the model features are abstracted and extracted for multiple times to be effectively classified, the accuracy of keyword detection is improved, and the dimensionality of model output is continuously reduced. In addition, attention optimization is carried out on the model, and the problem that the precision of the model is lost when the sequence is too long is solved. In addition, the historical memory result and the current output result are fused, so that the keyword detection jumping and mistaken awakening can be effectively inhibited, the problem of continuous keyword identification is solved, and the awakening accuracy is improved.

In some embodiments, in order to further reduce the false wake-up rate, an operation of obtaining a first matching result based on the first voice matching model may be used as a primary check, when the first matching result indicates that the input voice includes the specified text, a secondary check is performed based on the second voice matching model to obtain a second matching result, and when the second matching result also indicates that the input voice includes the specified text, the terminal is woken up again. Specifically, referring to fig. 14, fig. 14 shows a schematic flow chart of a voice wake-up method according to another embodiment of the present application, in this embodiment, the method may include:

step S310: and acquiring input voice acquired by the audio acquisition device.

Step S320: and matching the input voice based on the first voice matching model to obtain a first probability output.

Step S330: at least one probability output of the first speech matching model that is output before the current first probability output is obtained as a second probability output.

Step S340: and fusing the first probability output and the second probability output to obtain an updated first probability output.

Step S350: and outputting according to the updated first probability to obtain a first matching result of the first voice matching model for matching the input voice.

Step S360: and if the first matching result indicates that the input voice contains the specified text, matching the input voice based on the second voice matching model to obtain a second matching result.

The second speech matching model may be obtained by training second training data, where the second training data may include multiple positive sample speeches and multiple negative sample speeches, and the second training data may be the same as or different from the first training data, which is not limited in this embodiment. Because the second voice matching model is obtained by training the positive sample voice containing the specified text and the negative sample voice not containing the specified text, the second voice matching model can match the input voice as the first voice matching model, judge whether the input voice contains the specified text, and when the second voice matching model contains the specified text, the obtained second matching result can indicate that the input voice contains the specified text.

In some embodiments, the second speech matching model may be constructed by a Neural Network, and the Neural Network may be, but not limited to, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like, which is not limited in this embodiment.

The matching rules of the first voice matching model and the second voice matching model are different, wherein the matching rules are algorithms for judging whether the input voice contains the specified text, and therefore the matching rules are different and represent that the algorithms of the first voice matching model and the second voice matching model are different. In one embodiment, the first and second speech matching models may be constructed based on the same neural network, but the number of network layers or the network depth may be different, for example, they may be constructed based on CNN, and the number of convolutional layers may be different. As another embodiment, the first and second speech matching models may also be constructed based on completely different neural networks, or may also be constructed based on partially different neural networks, which is not limited in this embodiment. Therefore, after the input voice is checked for the first time through the first voice matching model, the input voice needs to be checked for the second time through the second voice matching model with different matching rules, so that the input voice entering voiceprint recognition can meet at least two matching rules through the voice matching check, the accuracy of keyword recognition can be improved, and the false awakening rate is reduced.

In this embodiment, the complexity of the first speech matching model is lower than the complexity of the second speech matching model. Therefore, the first voice matching model with low complexity is firstly used for checking and matching, and the second voice matching model with high complexity is used for checking and matching when the checking passes, so that the first-stage checking is firstly performed by the simple algorithm, and the second-stage checking is performed by the complex algorithm. The complexity may refer to network complexity, network layer number, and the like, for example, the second voice matching model may be deeper than the first voice matching model, and the network layer number is larger.

In some embodiments, in order to enable the second speech matching model to achieve keyword recognition with higher accuracy, as a way, the convergence condition for determining whether the training of the second speech matching model is completed may be higher than that of the first speech matching model, and in addition, the second training data may be larger in amount than the first training data, and the scene is more complicated.

In some embodiments, the algorithms of the first and second voice matching models can be stored locally in the terminal and run locally, so that the terminal can directly run the first and second voice matching models locally without depending on a network environment and considering the consumption of communication time, which is beneficial to improving the efficiency of voice awakening. In other embodiments, at least one model may also be stored in the server, which is not limited in this embodiment.

Further, in some embodiments, the terminal may include a first chip and a second chip, the first voice matching model is run on the first chip, and the second voice matching model is run on the second chip, wherein the power consumption of the first chip is lower than the power consumption of the second chip. Therefore, the first voice matching model with low complexity runs on a chip with low power consumption, so that the terminal cannot cause too high power consumption even if the terminal is in a screen-off state and continuously works, thereby supporting long-time monitoring of input voice and primary verification, and realizing voice awakening of the terminal with low power consumption.

In an exemplary embodiment, when the terminal does not collect the voice signal based on the audio collector, the second chip may be in a sleep state. When the terminal acquires a voice signal based on the audio collector and acquires corresponding input voice, if the input voice does not pass the primary verification of the first voice matching model, the second chip can still be in a dormant state; if the first voice matching model passes the first-stage verification, the first chip can send an interrupt signal to enable the second chip to be switched from the dormant state to the working state, the first chip can transmit the voice data containing the designated text to the second chip, the first chip can be switched from the monitoring state to the dormant state at the moment, the second voice matching model is operated based on the second chip to carry out secondary verification on the voice data containing the designated text, and a second matching result is obtained.

In another exemplary embodiment, the first chip may also be in a working state all the time, the audio collector continuously monitors and collects the voice signal, and if the voice signal is monitored, the corresponding input voice may be obtained and sent to the first voice matching model to perform a first-level verification on the input voice; if the voice signal is not monitored or the voice signal is monitored but does not pass the primary verification, the audio collector still continues to monitor and collect the voice signal and sends the voice signal to the first voice matching model for primary verification; if the voice signal is monitored and the input voice corresponding to the voice signal passes the primary verification, the input voice containing the appointed text can be transmitted to the second voice matching model, meanwhile, the audio collector can be controlled to stop collecting the audio signal, the input voice is subjected to secondary verification based on the second voice matching model, and a second matching result is obtained.

In an exemplary embodiment, the first chip may be a Digital Signal Processor (DSP), and the second chip may be a RISC microprocessor, such as an arm (advanced RISC machine) chip.

In some examples, the ARM chip is a chip commonly used in the terminal, and generally works as a main processor when the terminal is awakened, the computing performance of the ARM chip is high, the algorithm with higher complexity can be operated, but the power consumption required by the ARM chip in the working state is higher, the memory occupancy rate is higher, if the terminal is in the screen-off state, the ARM chip is still kept in the working state, may result in excessive power consumption and memory usage, but to achieve more accurate wake-up, more complex algorithms are required, therefore, the present embodiment adds a first chip, such as a DSP chip, and running a first speech matching model having a lower complexity level than a second speech matching model on the first chip, therefore, the awakening words of the user are captured under low power consumption, and potential user awakening scenes are effectively identified by adopting high recall ratio and low precision ratio relative to the second voice matching model.

In some embodiments, the first speech matching model is based on the convolutional neural network with the U-shaped residual structure and the attention optimization of the convolutional neural network output, the depth of the model can be increased as much as possible on the basis of hardware limitation, the accuracy rate of keyword recognition is improved, meanwhile, the size of the control model is not too large and the control model can be operated on a low-power-consumption chip.

In one example, the audio collector may be integrated with the first chip, so that low power consumption audio collection may be achieved, which is beneficial for continuously monitoring the surrounding audio.

It should be noted that, after the first voice matching model performs primary verification on the input voice, the whole input voice can be transmitted to the second voice matching model, and a voice segment only containing the specified text can also be intercepted from the input voice so as to transmit the voice segment to the second voice matching model, so that the recognition of the second voice matching model on other voice segments not containing the specified text can be omitted, and the keyword recognition efficiency for the specified text is improved.

And if the terminal is in the non-screen-off state, matching the input voice based on the second voice matching model to obtain a second matching result. Therefore, when the input voice is acquired, different voice awakening schemes can be adopted according to whether the terminal is turned off, and when the terminal is in a non-screen-off state, the input voice is directly matched based on the second voice matching model, and a second matching result is acquired.

In some embodiments, the complexity of the second speech matching model is higher than that of the first speech matching model, and the accuracy of keyword recognition is higher than that of the first speech matching model, so that when the terminal is in a non-screen-off state, the input speech is directly sent to the second speech matching model of the second chip for verification, and the recognition efficiency can be improved while the recognition accuracy is still high.

In addition, in some embodiments, the second voice matching model is run on a second chip with higher power consumption, for example, an ARM chip, and when the terminal is in a non-screen-off state, the second chip is often also in a working state, so that the second voice matching model running on the second chip can be directly used for once verification, and the first voice matching model is not needed, so that the recognition efficiency is improved. And the recognition accuracy rate that the second voice matching model can realize is higher than the first voice matching model, so the power consumption caused by the operation of the first chip can be reduced by directly checking the second chip for one time. In some examples, the first chip may be controlled to be in the sleep state when the terminal is in the non-screen-off state to reduce power consumption as much as possible.

Step S370: and if the second matching result indicates that the input voice contains the specified text, waking up the terminal.

In some embodiments, in order to reduce the power consumption of the terminal and prolong the standby time of the terminal while reducing the false wake-up rate, when the terminal is in a screen-off state, a first voice matching model performs first-stage verification on input voice, and a second voice matching model performs second-stage verification after the first-stage verification is passed; and when the terminal is not in the screen-off state, the first-level verification is not carried out, and the second-level verification is directly carried out on the input voice through the second voice matching model. Specifically, referring to fig. 15, fig. 15 is a schematic flowchart illustrating a voice wakeup method according to another embodiment of the present application, where in this embodiment, the method may include:

step S410: and acquiring input voice acquired by the audio acquisition device.

Step S420: and detecting whether the terminal is in a screen-off state.

The terminal acquires the input voice collected by the audio collector, before processing the input voice, whether the terminal is in a screen-off state can be detected, and if the terminal is in the screen-off state, the step S130 can be executed. The screen-off state refers to normally turning off the backlight and turning off the screen, and when the terminal is in the screen-off state, the terminal can be represented to be in a standby state. In some embodiments, the screen-off state may also be referred to as "screen-off", and the present embodiment has given a definition of the screen-off state, and specific nomenclature thereof is not limited.

As an embodiment, the terminal may obtain the current screen state of the terminal by calling the screen state detection interface. The screen state comprises a screen extinguishing state and a non-screen extinguishing state, and the power consumption of the screen extinguishing state is lower than that of the non-screen extinguishing state.

In some examples, the detection may be made by calling a screen state detection interface. For example, if the terminal runs an Android (Android) system, whether the terminal is in a screen-off state can be determined according to the returned identifier by calling isScreenOn of PowerManager, for example, if the returned identifier is "false", the terminal can be determined to be in the screen-off state; if the returned flag is "true", it may be determined that the terminal is in a non-screen-off state.

In some embodiments, if the terminal is in the non-screen-off state, the voice matching can be performed only once on the input voice, so as to improve the voice wake-up efficiency. The detailed description of the embodiments can be seen in the following examples, which are not repeated herein. In this embodiment, after detecting whether the terminal is in the screen-off state, the method further includes:

if the terminal is in the screen-off state, step S430 may be executed;

if the terminal is not in the screen-off state, step S480 may be executed.

Step S430: and matching the input voice based on the first voice matching model to obtain a first probability output.

Step S440: at least one probability output of the first speech matching model that is output before the current first probability output is obtained as a second probability output.

Step S450: and fusing the first probability output and the second probability output to obtain an updated first probability output.

Step S460: and outputting according to the updated first probability to obtain a first matching result of the first voice matching model for matching the input voice.

Step S470: and judging whether the first matching result indicates that the input voice contains the specified text.

In this embodiment, if the first matching result indicates that the input speech includes the specified text, step S480 may be executed; if the first matching result indicates that the input speech does not include the specified text, the process returns to step S410 to continue to collect the input speech.

In addition, in some possible embodiments, the method may also be ended, which is not limited herein.

Step S480: and matching the input voice based on the second voice matching model to obtain a second matching result.

Step S490: and if the second matching result indicates that the input voice contains the specified text, waking up the terminal.

If the second matching result indicates that the input voice contains the specified text, the input voice passes the secondary verification, the first voice matching model and the second voice matching model both recognize the specified text from the input voice, and the terminal can be awakened at the moment.

As an implementation mode, the second voice matching models can all run on the second chip, so that the recognition with higher accuracy can be realized by adopting larger and deeper models on the basis that hardware supports complex algorithms.

In some embodiments, if the second match result indicates that the input speech does not contain the specified text, the secondary verification of the input speech fails.

As another embodiment, when performing the secondary verification on the input voice based on the second voice matching model, the first chip and the second chip are both in the working state, and if the secondary verification fails, the second chip may be controlled to switch from the working state to the dormant state, so as to reduce power consumption.

According to the voice awakening method provided by the embodiment, when the input voice is acquired, whether the terminal is in the screen-off state is detected, when the screen-off state is detected, the first-stage verification is performed on the input voice based on the first voice matching model, after the first-stage verification is passed, the second-stage verification is performed on the input voice based on the second matching model, and the terminal is awakened after the second-stage verification is passed. Therefore, when the terminal is in the screen-off state, the first voice matching model and the second voice matching model which have different matching rules are used for realizing twice checking, so that the input voice which can wake up the terminal can be successfully waken up only by checking the two different matching rules, the mistaken wakening rate can be greatly reduced, and the terminal is wakened up from the screen-off state to consume larger power consumption, so that the terminal is checked twice when in the screen-off state, and the power consumption of the terminal can be reduced while the mistaken wakening rate is reduced.

In addition, in some embodiments, no matter whether the first voice matching model is used for primary verification or the second voice matching model is used for secondary verification, or the first voice matching model is used for primary verification first and then the second voice matching model is used for secondary verification after the first voice matching model passes the primary verification, when the input voice is judged to contain the specified text, the voice containing the specified text can be subjected to voiceprint recognition, so that the use safety of the terminal is further improved, and unnecessary power consumption or security threats to the terminal caused by the fact that other people awaken the terminal randomly is avoided.

In an embodiment, taking voiceprint recognition after passing the secondary verification as an example, an operation method of voiceprint recognition is described, specifically, please refer to fig. 16, fig. 16 shows a flowchart of step S490 in fig. 15 according to an exemplary embodiment of the present application, in this embodiment, step S490 may include:

step 491: and if the second matching result indicates that the input voice contains the specified text, performing voiceprint recognition on the input voice.

If the second matching result indicates that the input voice contains the specified text, the input voice passes the secondary verification, the first voice matching model and the second voice matching model both recognize the specified text from the input voice, and at the moment, the input voice can be sent to a voiceprint recognition algorithm so as to perform voiceprint recognition on the input voice.

In some embodiments, the terminal may store a voiceprint template in advance, and the number of the voiceprint templates may be multiple, for example, a voiceprint template of user a, a voiceprint template of user B, and the like may be stored. The voiceprint template is used to match the voiceprint characteristics of the input speech. And when the second matching result indicates that the input voice contains the specified text, extracting the voiceprint features in the input voice, matching the voiceprint features in the voiceprint template, if the voiceprint template matched with the voiceprint features exists, judging that the voiceprint recognition passes the verification, and if the voiceprint template matched with the voiceprint features does not exist, judging that the voiceprint recognition does not pass the verification.

In some embodiments, the voiceprint template may be stored through the aforementioned wake-up word setting page, and specifically, a user may enter a voice containing a specified text through the wake-up word setting page, and extract a voiceprint feature from the input voice by the terminal as a voiceprint template corresponding to the specified text, and store the voiceprint template for voiceprint verification during the biometric identification.

In some embodiments, a voiceprint recognition algorithm can be run on the second chip with the second speech matching model.

Step S492: and if the voiceprint identification is verified, awakening the terminal.

If the voiceprint recognition passes the verification, the input voice not only passes the verification of two different matching rules, but also the voiceprint features in the input voice pass the verification, at the moment, the voice awakening is judged to be successful, the terminal is awakened, and the false awakening rate can be greatly reduced.

If the voiceprint identification passes the verification, the terminal can be awakened to execute the preset operation. As an implementation manner, the terminal may store a mapping relationship table between the voiceprint template and the preset operation in advance, and according to the voiceprint template matched with the voiceprint feature of the input voice, the corresponding preset operation may be determined, and then the terminal may be awakened to execute the preset operation. The preset operation may include, but is not limited to, a screen-on operation, an unlocking operation, and the like, which is not limited in this embodiment.

In still other embodiments, if the voiceprint recognition is verified, the environment information may also be obtained, whether the current scene is the designated payment scene is determined, and if the current scene is the designated payment scene, the terminal may be awakened and the payment operation corresponding to the designated payment scene may be completed. In one example, the terminal can pre-input the bus card information, when a user takes a bus, the user can enable the terminal to be close to the card swiping machine and speak the awakening word 'Xiao Ou', the terminal can obtain the NFC signal sent by the card swiping machine at the moment to determine that the current scene is the bus payment scene, and awakens the terminal, payment is completed based on the pre-input bus card information, and therefore convenient and safe payment can be achieved.

It is understood that the above is only an example, and the method provided by the present embodiment is not limited to the above scenario, but is not exhaustive here for reasons of space. In addition, the operations performed by the wake-up terminal may be equally applied to any of the foregoing embodiments.

In some embodiments, a specific implementation manner of step S492 may be to wake up the terminal to execute the target command if the voiceprint recognition is verified, wherein the target command is bound to the voiceprint template matching the voiceprint feature of the input voice. Specifically, the terminal may pre-store a mapping relationship table between the voiceprint template and the control instruction, and if the voiceprint recognition passes the verification, the terminal may obtain the corresponding control instruction as the target instruction according to the voiceprint template matched with the voiceprint feature of the input voice, and wake up the terminal to execute the target instruction. The control instruction may be an unlocking operation, a voice assistant activation, a payment operation, and the like, which is not limited in this embodiment.

In some embodiments, the mapping relationship table may further store a screen state of the terminal, a current display interface, or other current terminal information, and store the screen state, the current display interface, or other current terminal information in correspondence with the voiceprint template and the control instruction, so that when the input voice is recognized through a voiceprint, the corresponding control instruction may be determined according to a voiceprint feature of the input voice and the current terminal information.

In some embodiments, the non-screen-off state may include an unlocked state in which the screen is lit and the unlocking interface is displayed, and may also include an unlocked state in which the screen is lit and the unlocking interface is not displayed. Based on the foregoing embodiment, if the terminal is in the to-be-unlocked state, when the voiceprint recognition passes the verification, the terminal may be awakened to perform an unlocking operation, activate a voice assistant, and the like. If the terminal is in the unlocked state or the terminal currently displays an interface of an application program with a payment function, the terminal can be awakened to execute payment operation and the like when voiceprint recognition passes verification, namely identity authentication based on voiceprint recognition is realized based on the scheme.

In addition, in some embodiments, if the terminal is in a non-screen-off state, it may be detected whether a voiceprint template is pre-stored in the terminal, if the voiceprint template is not pre-stored, the input voice may be recognized, text content corresponding to the input voice is determined as its designated text, a voiceprint feature of the input voice is extracted as a voiceprint template, the designated text corresponding to the input voice and the voiceprint template are stored in the terminal, and the first and second voice matching models and the voiceprint recognition algorithm are trained based on the designated text and the voiceprint template, so that when the input voice is obtained next time, the input voice may be verified, it is determined whether the input voice includes the designated text, and when the input voice includes the designated text, the voiceprint feature corresponding to the input voice is verified based on the pre-stored voiceprint template. Therefore, when the voiceprint template is not stored in the terminal in advance, keyword detection and storage can be carried out on the voice input by the user, and the voice awakening method provided by the embodiment of the application can be realized next time.

As an embodiment, if the voiceprint template is not pre-stored, the audio collector may continuously collect the voice signal, so that the user may repeatedly speak the wakeup word to be stored for a plurality of times for subsequent keyword detection and storage. It can be understood that the more the repetition times are, the more accurate the stored voiceprint template is, the better the subsequent training effect on the first and second speech matching models and the voiceprint recognition algorithm is, and the more stable the recognition is.

In some embodiments, if the voiceprint recognition fails to be verified or the secondary verification fails, the audio collector may be controlled to continue monitoring and collecting the voice signal, and obtain the corresponding input voice and send the corresponding input voice to the first voice matching model.

In other embodiments, if the voiceprint recognition fails to be verified or the secondary verification fails, the second chip may be further controlled to switch from the working state to the sleep state, so as to reduce the larger power consumption introduced by the second chip, and the first chip may be controlled to switch from the sleep state to the monitoring state, and continue to monitor and collect the voice signal.

Therefore, according to the voice awakening method provided by the embodiment, when the input voice is acquired, whether the terminal is in the screen-off state is detected, when the screen-off state is detected, the first-level verification is performed on the input voice based on the first voice matching model, after the first-level verification is passed, the second-level verification is performed on the input voice based on the second matching model, the voiceprint recognition is performed on the voiceprint of the input voice after the second-level verification is also passed, and the terminal is awakened after the voiceprint recognition is passed. Therefore, when the terminal is in the screen-off state, two times of verification are achieved based on the first voice matching model and the second voice matching model with different matching rules, voiceprint recognition is carried out after the two times of verification are passed, input voice capable of waking up the terminal can be successfully woken up only by the verification and the voiceprint recognition of the two different matching rules, the mistaken wake-up rate can be greatly reduced, the terminal is woken up from the screen-off state, large power consumption is consumed, two times of verification are carried out when the screen-off state is achieved, and the power consumption of the terminal can be reduced while the mistaken wake-up rate is reduced.

In a specific example, a first chip with low power consumption runs a first voice matching model to capture input voice in an all-day scene, the input voice is subjected to primary verification, the input voice is transmitted to a second chip running a second voice matching model after the primary verification is passed, the second voice matching model is used for performing secondary verification on the input voice, voice print recognition is performed on the voice containing a specified text only after the secondary verification is passed, and the terminal is awakened after the voice print recognition is verified, so that the all-day low-power consumption primary verification with higher recall ratio can be realized, the secondary verification is performed with higher recall ratio than the primary verification, the terminal can judge whether the input voice contains the specified text more accurately, voice print recognition is performed after the secondary verification is passed, and the terminal is prevented from being awakened randomly by others, the use safety of the terminal is improved.

Referring to fig. 17, a block diagram of a voice wake-up apparatus 1700 according to an embodiment of the present application is shown, where the voice wake-up apparatus 1700 is applicable to the terminal, and the voice wake-up apparatus 1700 includes: the voice obtaining module 1710, the first output module 1720, the second output module 1730, the output updating module 1740, the result obtaining module 1750, and the terminal awakening module 1760, specifically:

the voice acquisition module is used for acquiring the input voice acquired by the audio acquisition device;

a first output module, configured to match the input speech based on a first speech matching model, resulting in a first probability output, where the first probability output is used to indicate a probability that the input speech includes the specified text;

a second output module, configured to obtain at least one probability output of the first speech matching model output before the current first probability output, as a second probability output;

an output update module for fusing the first probability output with the second probability output to obtain an updated first probability output;

a result obtaining module, configured to output the updated first probability as a first matching result of the first speech matching model matching the input speech; and the terminal awakening module is used for awakening the terminal if the first matching result indicates that the input voice contains the specified text.

Further, the output update module includes: history extraction submodule, first history fusion submodule, attention processing submodule and second history fusion submodule, wherein:

the history extraction submodule is used for extracting the characteristics of the second probability output to obtain a first history characteristic and a second history characteristic;

a first history fusion submodule, configured to fuse the first probability output with the first history feature to obtain a history attention weight vector corresponding to the second probability output;

the attention processing submodule is used for carrying out weighting processing on the second historical characteristic according to the historical attention weight vector to obtain historical fusion output;

a second history fusion submodule, configured to fuse the first probability output with the history fusion output to obtain an updated first probability output.

Further, the first speech matching model comprises a first bidirectional recurrent neural network layer and a second bidirectional recurrent neural network layer, and the history extraction sub-module comprises: a first history extraction unit and a second history extraction unit, wherein:

a first history extraction unit, configured to perform feature extraction on the historical attention output through the first bidirectional recurrent neural network layer to obtain a first history feature;

and the second history extraction unit is used for performing feature extraction on the history attention output through the second bidirectional recurrent neural network layer to obtain a second history feature.

Further, the second history fusion sub-module comprises: coefficient extraction unit, first output unit and second output unit, wherein:

a coefficient extraction unit, configured to perform feature extraction on the first probability output to obtain an output coefficient corresponding to the first probability output;

a first output unit configured to output the first probability as the updated first probability if the output coefficient is greater than a preset result threshold;

and the second output unit is used for outputting the history fusion as the updated first probability if the output coefficient is less than or equal to a preset result threshold.

Further, the first speech matching model is a convolutional neural network model, and the first output module includes: an acoustic feature extraction submodule and a first probability output submodule, wherein:

the acoustic feature extraction submodule is used for extracting acoustic features of the input voice, and performing convolution operation on the acoustic features through the first voice matching model to obtain convolution neural network output;

and the first probability output submodule is used for matching the output of the convolutional neural network with the acoustic characteristics corresponding to the specified text to acquire a first probability output.

Further, the first probability output submodule includes: weight extraction unit, weighting processing unit and probability output unit, wherein:

the weight extraction unit is used for extracting attention weight of the output of the convolutional neural network according to channels to obtain an attention weight vector corresponding to the output of the convolutional neural network;

the weighting processing unit is used for carrying out weighting processing on the output of the convolutional neural network according to the attention weight vector to obtain attention output characteristics;

and the probability output unit is used for matching the attention output characteristics with the acoustic characteristics corresponding to the specified text to acquire first probability output.

Further, the terminal wake-up module includes: second grade check-up submodule and second grade awaken the submodule up, wherein:

the secondary verification sub-module is used for matching the input voice based on a second voice matching model to obtain a second matching result if the first matching result indicates that the input voice contains the specified text, wherein the matching rules of the first voice matching model and the second voice matching model are different;

and the secondary awakening sub-module is used for awakening the terminal if the second matching result indicates that the input voice contains the specified text.

Further, the terminal comprises a first chip and a second chip, the first voice matching model runs on the first chip, the second voice matching model runs on the second chip, and the power consumption of the first chip is lower than that of the second chip.

Further, the first chip is a DSP chip, and the second chip is an ARM chip.

Further, the secondary wake-up sub-module includes: voiceprint recognition unit and voiceprint wake-up unit, wherein:

a voiceprint recognition unit, configured to perform voiceprint recognition on the input speech if the second matching result indicates that the input speech includes the specified text;

and the voiceprint awakening unit is used for awakening the terminal if the voiceprint identification passes the verification.

Further, the first output module includes: put out screen detection submodule and first output submodule, wherein:

the screen-off detection submodule is used for detecting whether the terminal is in a screen-off state;

and the first output sub-module is used for matching the input voice based on a first voice matching model to obtain a first probability output.

Further, after detecting whether the terminal is in the screen-off state, the voice wakeup apparatus 1700 further includes: non-off-screen matching module and non-off-screen awakening module, wherein:

the non-screen-off matching module is used for matching the input voice based on the second voice matching model to obtain a second matching result if the terminal is not in a screen-off state;

and the non-screen-off awakening module is used for awakening the terminal if the second matching result indicates that the input voice contains the specified text.

The voice wake-up device provided in the embodiment of the present application is used to implement the corresponding voice wake-up method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 18, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 1800 may be a smartphone, a tablet computer, an electronic book, a notebook computer, a personal computer, or the like, capable of running an application. The electronic device 1800 in the present application may include one or more of the following components: a processor 1810, memory 1820, and one or more applications, wherein the one or more applications may be stored in the memory 1820 and configured to be executed by the one or more processors 1810, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

The processor 1810 may be implemented in at least one hardware form selected from a DSP chip, an ARM chip, a Field-Programmable Gate Array (FPGA), a Programmable logic Array (Programmable L Array, P L a), etc. the processor 1810 may be implemented by integrating one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, etc., wherein the CPU primarily processes operating systems, user interfaces, and applications, etc., the GPU is responsible for rendering and rendering content, and the modem is used for Processing wireless communications, and the communication may be implemented by integrating the processor 1810 into a single Processing chip.

The Memory 1820 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 1820 may be used to store instructions, programs, code sets, or instruction sets. The memory 1820 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created during use of the electronic device 1800 (e.g., phone books, audiovisual data, chat log data), and so forth.

In some embodiments, the electronic device 1800 is provided with an audio collector, which can be used to collect voice signals and transmit them to the processor 1810 for processing and also to the memory 1820 for data storage.

In some embodiments, the audio collector may be disposed within the processor 1810, for example, the processor 1810 may include a first chip and a second chip, and the audio collector may be integrated with the first chip. As an example, the first chip may be a DSP chip and the second chip may be an ARM chip.

Referring to fig. 19, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 1900 has program code stored therein, which can be called by a processor to execute the method described in the above embodiments.

The computer-readable storage medium 1900 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable and programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1900 includes a non-volatile computer-readable storage medium. The computer-readable storage medium 1900 has storage space for program code 1910 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. The program code 1910 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A voice wake-up method is applied to a terminal, the terminal is provided with an audio collector, and the method comprises the following steps:

acquiring input voice collected by the audio collector;

matching the input speech based on a first speech matching model, resulting in a first probability output indicating a probability that the input speech includes the specified text;

obtaining at least one probability output of the first speech matching model output before the current first probability output as a second probability output;

fusing the first probability output with the second probability output to obtain an updated first probability output;

outputting the updated first probability as a first matching result of the first speech matching model matching the input speech;

and if the first matching result indicates that the input voice contains the specified text, awakening the terminal.

2. The method of claim 1, wherein said fusing the first probability output with the second probability output to obtain an updated first probability output comprises:

performing feature extraction on the second probability output to obtain a first historical feature and a second historical feature;

fusing the first probability output and the first historical feature to obtain a historical attention weight vector corresponding to the second probability output;

weighting the second historical characteristics according to the historical attention weight vector to obtain historical fusion output;

and fusing the first probability output with the history fused output to obtain an updated first probability output.

3. The method of claim 2, wherein the first speech matching model comprises first and second bi-directional recurrent neural network layers, and wherein the extracting the feature of the second probability output to obtain first and second historical features comprises:

performing feature extraction on the historical attention output through the first bidirectional recurrent neural network layer to obtain a first historical feature;

and performing feature extraction on the historical attention output through the second bidirectional recurrent neural network layer to obtain a second historical feature.

4. The method of claim 2, wherein said fusing the first probability output with the historical fused output resulting in an updated first probability output, comprises:

performing feature extraction on the first probability output to obtain an output coefficient corresponding to the first probability output;

if the output coefficient is greater than a preset result threshold, outputting the first probability as the updated first probability;

and if the output coefficient is smaller than or equal to a preset result threshold value, taking the history fusion output as the updated first probability output.

5. The method of claim 1, wherein the first speech matching model is a convolutional neural network model, and wherein matching the input speech based on the first speech matching model results in a first probability output, comprising:

extracting acoustic features of the input voice, and performing convolution operation on the acoustic features through the first voice matching model to obtain convolution neural network output;

and matching the output of the convolutional neural network with the acoustic characteristics corresponding to the specified text to obtain a first probability output.

6. The method of claim 5, wherein matching the convolutional neural network output to acoustic features corresponding to the specified text to obtain a first probability output comprises:

extracting attention weight of the output of the convolutional neural network according to channels to obtain an attention weight vector corresponding to the output of the convolutional neural network;

carrying out weighting processing on the output of the convolutional neural network according to the attention weight vector to obtain attention output characteristics;

and matching the attention output characteristics with the acoustic characteristics corresponding to the specified text to acquire a first probability output.

7. The method according to claim 1, wherein waking up the terminal if the first matching result indicates that the input speech includes the specified text comprises:

if the first matching result indicates that the input voice contains the specified text, matching the input voice based on a second voice matching model to obtain a second matching result, wherein the matching rules of the first voice matching model and the second voice matching model are different;

and if the second matching result indicates that the input voice contains the specified text, awakening the terminal.

8. The method of claim 7, wherein the terminal comprises a first chip and a second chip, wherein the first voice matching model runs on the first chip and the second voice matching model runs on the second chip, and wherein the power consumption of the first chip is lower than the power consumption of the second chip.

9. The method of claim 8, wherein the first chip is a DSP chip and the second chip is an ARM chip.

10. The method according to claim 7, wherein if the second matching result indicates that the input voice contains the specified text, waking up the terminal comprises:

if the second matching result indicates that the input voice contains the specified text, performing voiceprint recognition on the input voice;

and if the voiceprint identification is verified, awakening the terminal.

11. The method according to any of claims 7-10, wherein said matching said input speech based on a first speech matching model resulting in a first probability output comprises:

detecting whether the terminal is in a screen-off state or not;

and matching the input voice based on a first voice matching model to obtain a first probability output.

12. The method of claim 11, wherein after detecting whether the terminal is in a screen-off state, the method further comprises:

if the terminal is not in the screen-off state, matching the input voice based on the second voice matching model to obtain a second matching result;

13. The utility model provides a voice wake-up device which characterized in that is applied to the terminal, the terminal is provided with the audio collector, the device includes:

a first output module, configured to match the input speech based on the first speech matching model, resulting in a first probability output, where the first probability output is used to indicate a probability that the input speech includes the specified text;

a result obtaining module, configured to output the updated first probability as a first matching result of the first speech matching model matching the input speech;

and the terminal awakening module is used for awakening the terminal if the first matching result indicates that the input voice contains the specified text.

14. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-12.

15. A computer-readable storage medium having program code stored therein, the program code being invoked by a processor to perform the method of any of claims 1-12.