CN112669818A

CN112669818A - Voice wake-up method and device, readable storage medium and electronic equipment

Info

Publication number: CN112669818A
Application number: CN202011453041.8A
Authority: CN
Inventors: 单长浩
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-04-16
Anticipated expiration: 2040-12-08
Also published as: CN112669818B

Abstract

Disclosed are a voice wake-up method, apparatus, computer-readable storage medium and electronic device, the method comprising: determining at least one first voice feature corresponding to the voice data through a first feature extraction network; determining a phoneme probability distribution corresponding to each of the at least one first voice feature through a first wake-up model; determining attention characteristics corresponding to at least one first voice characteristic through a second awakening model; and determining an awakening judgment result according to the phoneme probability distribution corresponding to the at least one first voice characteristic and the attention characteristic corresponding to the at least one first voice characteristic. According to the technical scheme, the awakening judgment result is determined by combining the phoneme probability sequence output by the first awakening model and the attention feature output by the second awakening model, so that the accuracy of the awakening judgment result is improved, and the false awakening rate is reduced.

Description

Voice wake-up method and device, readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of voice recognition technologies, and in particular, to a voice wake-up method and apparatus, a readable storage medium, and an electronic device.

Background

With the development of voice recognition technology, voice wake-up (that is, a user wakes up an intelligent terminal by speaking a wake-up word, so that the intelligent terminal enters a state of waiting for a voice instruction or directly executes a predetermined voice instruction) is becoming more and more popular.

At present, when voice awakening is realized, a voice awakening system based on an attention mechanism is mainly trained, and a voice awakening function is realized by utilizing the trained voice awakening system.

However, the above-mentioned attention mechanism in the voice wake-up system is over-confident in the learned knowledge, which results in relatively low performance of the voice wake-up system and relatively high false wake-up rate.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a voice awakening method and device, a computer readable storage medium and an electronic device, wherein an awakening judgment result is determined by combining a phoneme probability sequence output by a first awakening model and an attention feature output by a second awakening model, so that the accuracy of the awakening judgment result is improved, and the false awakening rate is reduced.

According to an aspect of the present disclosure, there is provided a voice wake-up method, including:

determining at least one first voice feature corresponding to the voice data through a first feature extraction network;

a phoneme probability distribution corresponding to each of the at least one first voice feature through a first wake-up model;

determining attention characteristics corresponding to the at least one first voice characteristic through a second awakening model;

and determining an awakening judgment result according to the phoneme probability distribution corresponding to the at least one first voice characteristic and the attention characteristic corresponding to the at least one first voice characteristic.

According to a second aspect of the present disclosure, there is provided a voice wake-up apparatus comprising:

the feature extraction module is used for determining at least one first voice feature corresponding to the voice data through a first feature extraction network;

the first processing module is used for determining phoneme probability distribution corresponding to the at least one first voice feature through a first awakening model;

the second processing module is used for determining attention characteristics corresponding to the at least one first voice characteristic through a second awakening model;

and the awakening module is used for determining an awakening judgment result according to the phoneme probability distribution corresponding to the at least one first voice characteristic and the attention characteristic corresponding to the at least one first voice characteristic.

According to a third aspect of the present disclosure, a computer-readable storage medium is provided, which stores a computer program for executing the above-mentioned voice wake-up method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instruction from the memory and executing the instruction to realize the voice wake-up method.

Compared with the prior art, the voice awakening method, the voice awakening device, the computer readable storage medium and the electronic equipment provided by the disclosure at least have the following beneficial effects:

according to the embodiment of the disclosure, the awakening judgment result is determined by comprehensively considering the phoneme probability distribution output by the first awakening model and the attention feature output by the second awakening model, so that the accuracy of the awakening judgment result is improved, and the false awakening rate is reduced.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a flowchart illustrating a voice wake-up method according to an exemplary embodiment of the disclosure.

Fig. 2 is a flowchart illustrating step 103 in a voice wake-up method according to an exemplary embodiment of the disclosure.

Fig. 3 is a flowchart illustrating step 104 of a voice wake-up method according to an exemplary embodiment of the disclosure.

Fig. 4 is a flowchart illustrating step 1043 in a voice wake-up method according to an exemplary embodiment of the disclosure.

Fig. 5 is a schematic structural diagram of a voice wake-up apparatus according to an exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of a voice wake-up apparatus according to an exemplary embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram of a wake-up unit 5043 in a second schematic structural diagram of a voice wake-up apparatus according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

Summary of the application

The voice awakening means that a user awakens the intelligent terminal by speaking an awakening word, so that the intelligent terminal enters a state of waiting for a voice instruction or directly executes a preset voice instruction. With the development of voice recognition technology, the voice wake-up function is more and more popular. The current voice recognition model comprises a voice recognition framework-based awakening model, an acoustic model-based awakening model and an attention mechanism-based awakening model, and the voice recognition framework-based awakening model or the acoustic model-based awakening model is simple in training, small in calculation amount and poor in model performance; the model is based on an attention mechanism and is simple in training and small in calculation amount, but the false wake-up processing performance is poor due to excessive self-confidence of learned knowledge, and the false wake-up rate is relatively high.

According to the embodiment of the disclosure, the phoneme probability distribution of each voice feature is output through the first awakening model, the attention feature of a plurality of voice features based on the attention probability distribution is output through the second awakening model, and the phoneme probability distribution and the attention probability distribution are comprehensively considered, so that the awakening judgment result can be more accurately determined, and the false awakening rate is reduced.

Exemplary method

The embodiment can be applied to electronic equipment, and specifically can be applied to smart devices, servers or general computers, wherein the smart devices include but are not limited to devices with voice wake-up functions such as mobile phones, sound boxes, automobiles, robots, wearable devices, smart home appliances and the like.

As shown in fig. 1, a voice wake-up method provided in an exemplary embodiment of the present disclosure at least includes the following steps:

step 101, determining at least one first voice feature corresponding to the voice data through a first feature extraction network.

The voice data can be understood as data obtained by preprocessing original voice data acquired by the voice acquisition equipment, so that invalid and redundant voice signals possibly existing in the original voice data are removed, and the subsequent processing efficiency of the voice data is improved. Wherein, the preprocessing includes but is not limited to speech noise reduction, reverberation elimination, speech enhancement, windowing framing, feature extraction (extracting effective information in speech data), and the like; the sound collection device refers to a device having a sound collection function, such as a microphone. On one hand, the electronic device may obtain voice data through a sound collection device configured inside the electronic device, and on the other hand, the electronic device may obtain raw voice data or voice data through an external device and send the raw voice data or voice data to the electronic device, or may obtain raw voice data or voice data from an external storage device connected to the electronic device, where the external storage device may include a floppy disk, a removable hard disk, a usb disk, and the like, and is not limited herein. It should be noted that, as one possible case, each first speech feature may be understood as a frame speech feature, and as another possible case, each first speech feature is obtained based on feature extraction performed on multiple frames of speech data, so that by using context information of the frame speech data, in practical applications, a duration between a start time and an end time corresponding to the first speech feature is usually between 20 milliseconds and 30 milliseconds, and is the frame speech feature. Here, the speech data is essentially a multidimensional vector.

The first speech feature may be understood as a multidimensional vector obtained by performing feature extraction on speech data by the first feature extraction network, where the multidimensional vector is used to characterize the first speech feature, and the characterization may be understood as a representation of the first speech feature. For example, the first speech feature may be an Fbank feature, an MFCC feature, and a PCEN feature, and the method for extracting the Fbank feature, the MFCC feature, and the PCEN feature is the prior art, and will not be described herein in detail. It should be understood that the Fbank, MFCC and PCEN features are only examples, and in practical applications a first speech feature may be determined in combination with actual requirements, the first speech feature being used to characterize the phoneme level information.

The first feature extraction model may be understood as a model in which one multidimensional vector is input and another multidimensional vector is output, and feature extraction of information corresponding to the input multidimensional vector is implemented, and those skilled in the art may understand that the embodiment of the present disclosure is not intended to limit any internal structure of the first feature extraction network, and may be a recurrent neural network, a long-term memory network, or other neural network models.

Specifically, the acquired voice data is input to a first feature extraction network, so that feature features are performed on the voice data to obtain a plurality of first voice features.

Step 102, determining a phoneme probability corresponding to each of the at least one first speech feature through a first wake-up model.

The first wake-up model may be understood as a model that is input as a plurality of first speech features and output as phoneme probability distributions corresponding to each first speech feature, for example, the model may be a Deep Neural Network (DNN) with multiple hidden layers. In some possible cases, the first wake-up model does not have a function of feature extraction, and only calculates the plurality of first speech features to obtain phoneme probability distributions corresponding to the plurality of first speech features respectively. It should be understood that the first wake-up model respectively calculates a phoneme probability distribution corresponding to the first speech feature.

As a feasible implementation manner, when determining a phoneme probability distribution corresponding to a certain speech feature, a plurality of first speech features before the first speech feature and a plurality of first speech features after the first speech feature need to be input into the first wake-up model, and a context corresponding to the first speech feature is considered, so that it is ensured that the phoneme probability distribution corresponding to the first speech feature output by the first wake-up model has a relatively high reference value. As another possible implementation manner, when determining a phoneme probability distribution corresponding to a certain speech feature, the speech feature is input into the first wake-up model, and the first wake-up model outputs the phoneme probability distribution corresponding to the speech feature.

For each first speech feature, the phoneme probability distribution indicates a probability value of a match of the first speech feature to each of a preset number of example phonemes. The example phonemes refer to all phonemes which can be exhausted at present, and taking Chinese as an example, the example phonemes of Chinese can be initials and finals, and the total number of the example phonemes is 83 in total; the match probability value indicates a likelihood that the first speech feature matches the corresponding example element. In one aspect, the phoneme probability distribution may be characterized by independent coding vectors, and the matching relationship between the first speech feature and all the example phonemes is represented by the independent coding vectors, where, taking the number of the example phonemes is n, and a certain first speech feature is matched with a third example phoneme in the n example phoneme sequences, the independent coding vector corresponding to the first speech feature may be obtained as (0, 1, 0.. 0), where 1 is followed by n-3 0 s. On the other hand, the phoneme probability distribution may be a matching probability value of a certain first speech feature and each example phoneme, for example, taking the number of example phonemes as n, for a certain first speech feature, the matching probability value of the first speech feature and each example phoneme is calculated, so that n matching probability values may be obtained, and the phoneme probability distribution may be formed by concatenating the n matching probability values. It should be noted that the embodiment of the present disclosure does not specifically limit the expression manner of the phoneme probability distribution.

Specifically, after a plurality of first voice features are obtained, the plurality of first voice features are input into a first awakening model, the first awakening model performs voice recognition, and phoneme probability distribution corresponding to each first voice feature is output.

And 103, determining attention characteristics corresponding to the at least one first voice characteristic through the second awakening model.

It should be noted that, when the voice durations corresponding to the plurality of first voice features satisfy the preset threshold, the second wake-up model processes the plurality of first voice features to determine the attention features corresponding to the plurality of first voice features, in other words, the plurality of first voice features are input into the second wake-up model together to obtain the attention features. The voice duration corresponding to the first voice features meets the preset threshold value, so that the voice duration corresponding to the first voice features is not smaller than the preset threshold value, and the first voice features can contain the awakening words. The preset threshold is not limited in the embodiment of the present disclosure, and specifically needs to be determined by combining with an actual situation, and optionally, the preset threshold may be selected from 1 second to 2 seconds, for example, 1.5 seconds. The speech duration may be understood as a duration between a start time and an end time corresponding to the plurality of first speech features, where the start time is the earliest time of the plurality of first speech features and the end time is the latest time of the plurality of first speech features. For example, the number of the first speech features is m, that of the ith first speech featureStarting at time t_isThe end time is t_ieFrom t_1s～t_msDetermining the earliest moments of the m first speech features from t_1e～t_meDetermining the latest time of the m first voice features, and assuming that the earliest time of the m first voice features is t_1sThe latest time is t_meThen the speech duration corresponding to the m first speech features is t_1sAnd t_meThe time difference therebetween.

Optionally, the speech durations corresponding to the plurality of first speech features input by the first wake-up model and the second wake-up model are the same, however, the phoneme probability distribution corresponding to a certain speech feature is output by the first wake-up model, and the attention feature corresponding to the plurality of speech features is output by the second wake-up model.

The second wake-up model may be understood as a model in which the input is a plurality of first speech features and the output is attention features, and the second wake-up model is an attention-based model, and it should be understood that this disclosure is not intended to limit the internal structure of the second wake-up model, and any model in which a plurality of speech features are input and attention features corresponding to the plurality of speech features are output is applicable to this disclosure, and alternatively, the second wake-up model may be an attention-based encoding-decoding model.

The plurality of first voice features correspond to an attention feature, the attention feature indicates the influence degree of the plurality of first voice features on the awakening word, and the first voice features are the characterization vectors of the whole word or the whole sentence corresponding to the plurality of first voice features.

Specifically, when the voice durations corresponding to the first voice features meet a preset threshold, the first voice features are input into the second wake-up model, and attention features corresponding to the first voice features are obtained.

Optionally, the first feature extraction network and the first wake-up model are trained together, so that the first feature extraction network can obtain the frame-level speech features, and the second wake-up model is trained according to the network structure and the network parameters in the first feature extraction network, so that the performance of the second wake-up model is ensured under the conditions of meeting the memory and the calculation amount of the electronic device.

Here, the first feature extraction network and the first wake-up model are obtained by training together, which can be understood as training a preset model to obtain a trained model, and the feature extraction network in the trained model is used as the first feature extraction network, and the first wake-up model is a model in the trained model, which takes the first feature extraction network as an input and phoneme probability distribution as an output. The second awakening model is obtained by training according to the network structure and the network parameters in the first feature extraction network, and can be understood as the output of the first feature extraction network as the model input, model training is carried out, and the trained model is used as the second awakening model.

Specifically, during training, the collected sample voice data is used as input, the probability distribution of target phonemes corresponding to the sample voice data is used as supervision data, and model training is performed, so that a first feature extraction network and a first awakening model are obtained. And then, inputting the sample voice data into the first feature extraction network so as to obtain multi-frame sample voice features, taking the multi-frame sample voice features containing awakening words (positive samples) or not containing the awakening words (negative samples) as input, taking the awakening words as supervision data, and performing model training so as to obtain a second awakening model. Here, the model used in model training may be a neural network, and may include existing and future developed, and examples of available existing neural network models that may be used include, but are not limited to, Back Propagation (BP) neural networks, Radial Basis Function (RBF) neural networks, Convolutional Neural Networks (CNN), and the like.

It should be noted that the first speech feature is speech information used for determining a phoneme probability distribution, and is relatively abstract information, and it is difficult to train a model by a supervised method, so that the first feature extraction network and the first wake-up model are trained together.

And 104, determining an awakening judgment result according to the phoneme probability distribution corresponding to the at least one first voice characteristic and the attention characteristic corresponding to the at least one first voice characteristic.

Specifically, the phoneme probability distribution corresponding to each first speech feature is decoded, so that the probability that the plurality of first speech features contain the wakeup word is determined, and the wakeup probability based on the phoneme is obtained. The attention feature is decoded, the probability that the first voice features contain the awakening words is determined, the awakening probability based on attention is obtained, and the awakening judgment result is determined through comparison of the awakening probability based on the phonemes and the awakening probability based on the attention. The awakening judgment result comprehensively considers the phoneme probability distribution corresponding to the first voice characteristics and the attention characteristics corresponding to the first voice characteristics, so that the accuracy is relatively high. And the awakening judgment result is used for determining whether to awaken or not.

The voice wake-up method provided by the embodiment has at least the following beneficial effects:

according to the embodiment, the awakening judgment result is determined by comprehensively considering the phoneme probability distribution output by the first awakening model and the attention feature output by the second awakening model, meanwhile, the phoneme level information is considered by the first awakening model, and the word level or sentence level information is considered by the second awakening model, so that the model has multi-level information, the accuracy of the awakening judgment result is improved, and the false awakening rate is reduced. Meanwhile, the first awakening model and the second awakening model share the first feature extraction network, and shared phoneme level information is extracted, so that the calculated amount and the model parameter amount are reduced.

Fig. 2 is a flowchart illustrating a step of inputting a plurality of first voice features into a second wake-up model and obtaining attention features corresponding to the plurality of first voice features in the embodiment shown in fig. 1.

As shown in fig. 2, on the basis of the embodiment shown in fig. 1, in an exemplary embodiment of the present disclosure, the step of determining the attention characteristic corresponding to the at least one first speech characteristic through the second wake-up model shown in step 103 may specifically include the following steps:

step 1031, obtaining a second voice feature corresponding to each of the at least one first voice feature through a second feature extraction network in the second wake-up model.

The second feature extraction network may be understood as a model in which one multidimensional vector is input and another multidimensional vector is output, and feature extraction of information corresponding to the input multidimensional vector is implemented, which is equivalent to an encoder.

It should be noted that the first feature extraction network to the second feature extraction network implement extraction of speech information at a word level or sentence level from extraction of speech information at a frame level, so that the second speech feature contains more speech information.

Specifically, a plurality of first voice features are substituted into a second feature extraction network in a second awakening model, second voice features corresponding to the plurality of first voice features are obtained, and the second voice features are the characteristic vectors of the first voice features, are used for representing more voice information and are vectors with higher dimensionality. In practical application, the weight vector is multiplied by the first voice feature and then summed to obtain a second voice feature, and the weight vector is obtained through model training.

Step 1032, obtaining the attention feature corresponding to the at least one first voice feature through a second voice feature corresponding to the at least one first voice feature and an attention mechanism network in the second wake-up model.

For each second speech feature, the attention mechanism network is configured to determine an attention weight (i.e., an attention probability distribution) of each second speech feature with respect to the awakening word, and perform weighted summation on each second speech feature and its corresponding attention weight to obtain the attention feature. For example, x₁～x_tRepresenting t first speech features, h₁～h_tRepresenting t second speech features, a, of the t first speech features after passing through a second feature extraction network₁～a_tAttention is paid to attention weights (attention probability distributions) obtained for the t second speech features after the attention mechanism is applied, respectivelyForce characteristic c_tIs h₁×a₁+…+h_t×a_t. Wherein the attention weight may be determined by the following formula:

a_t＝Softmax(QK^T)

where Q represents the query vector, K represents the key vector, T represents the transpose, and Softmax (-) represents a function that maps an input to a real number between 0-1. Here, Q, K is h_tThe first speech feature is obtained by linear transformation, for example, the second speech feature is multiplied by the first weight vector to obtain a corresponding query vector, the second speech feature is multiplied by the second weight matrix to obtain a corresponding key vector, and the first weight matrix and the second weight matrix are obtained by model training.

In this embodiment, a second speech feature representing word-level or sentence-level information is obtained through a second feature extraction network in the second wake-up system, and attention features based on attention weights corresponding to a plurality of second speech features are determined according to an attention mechanism network in the second wake-up model, so that attention degrees of the plurality of second speech features are obtained. The second awakening model can determine attention characteristics with word or sentence level information from a plurality of first voice characteristics with phoneme level information, so that data dimensionality is increased, and accuracy of awakening probability obtained based on the attention characteristics subsequently is ensured.

Fig. 3 is a flowchart illustrating a step of determining a wake-up determination result according to a phoneme probability distribution corresponding to each of the at least one first speech feature and an attention feature corresponding to the at least one first speech feature in the embodiment shown in fig. 1.

As shown in fig. 3, on the basis of the embodiment shown in fig. 1, in an exemplary embodiment of the present disclosure, the step 104 of determining the awakening determination result according to the phoneme probability distribution corresponding to each of the at least one first speech feature and the attention feature corresponding to the at least one first speech feature may specifically include the following steps:

step 1041, obtaining a first wake-up probability according to a phoneme probability distribution corresponding to each of the at least one first voice feature.

The first wake-up probability indicates a first likelihood of wake-up. The first wake word probability is an estimate of the probability of belonging to the wake word, typically ranging between [0, 1 ].

Specifically, a phoneme probability distribution sequence is formed through phoneme probability distributions corresponding to the first voice features respectively based on the time sequences of the first voice features, then a target phoneme sequence in the phoneme probability distribution sequence is determined, and the first wake-up probability is determined based on the target phoneme sequence. As a possible scenario, the target phoneme sequence is a wake-up phoneme sequence, and the probability of the target phoneme sequence is a first wake-up probability. For example, if the wake-up word is a kindred, the wake-up morpheme sequence (target phoneme sequence) is "x iao3 ai4 t ong2 x ue 2", where the numbers 3, 4, and 2 respectively represent the three tones, four tones, and two tones of the chinese syllable, the probability that the wake-up morpheme sequence (target phoneme sequence) exists in the phoneme probability distribution sequence is determined, and the probability is determined as the first wake-up probability. As another possible case, for each phoneme probability distribution, an example phoneme corresponding to the maximum matching probability value in the phoneme probability distribution is determined, and the target phoneme sequence is composed of the example phonemes respectively corresponding to each phoneme probability distribution. As a possible implementation, the similarity between the target phoneme sequence and the awakening phoneme sequence is calculated, and the similarity is determined as the first awakening probability. As another feasible implementation manner, a word-level acoustic model is constructed, wherein the word-level acoustic model determines a probability of each word in a speech waveform, a word sequence corresponding to a target phoneme sequence is obtained based on the word-level acoustic model, a similarity between the word sequence and an awakening word sequence is calculated, and the similarity is determined as a first awakening probability, where a plurality of continuous first speech features need to be spliced, and the spliced features are input into the word-level acoustic model for recognition.

Step 1042, obtaining a second wake-up probability according to the attention feature corresponding to each of the at least one first voice feature.

The second wake-up probability indicates a second likelihood of wake-up. The second wake word probability is an estimate of the probability of belonging to the wake word, typically ranging between [0, 1 ].

Specifically, the attention feature and the arousal probability are subjected to feature mapping to obtain a second arousal probability based on the attention probability distribution. As an example, determining a probability distribution of the attention feature over the plurality of wake words, determining a maximum probability of the attention feature over the plurality of wake words as a second wake probability; as another example, the decoder is used to determine the second wake-up probability, and the decoding result, i.e. the second wake-up probability, is obtained through the intermediate information of the decoding process, the attention feature and the historical decoding result.

And 1043, determining a wakeup judgment result according to the first wakeup probability and the second wakeup probability.

And comprehensively considering the first awakening probability and the second awakening probability to determine an awakening judgment result. Specifically, when the first wake-up probability meets a first preset condition and the second wake-up probability meets a second preset condition, the wake-up judgment result is determined to be wake-up. Wherein, the first awakening probability meeting the first preset condition comprises: the first wake-up probability is greater than a first preset value. The second wake-up probability meeting the second preset condition includes: the second wake-up probability is greater than a second preset value. The sizes of the first preset value and the second preset value need to be determined by combining actual conditions.

The awakening judgment result obtained in the embodiment comprehensively considers the first awakening probability determined by the phoneme probability distribution and the second awakening probability determined by the attention feature, so that the accuracy is relatively high, and the false awakening rate can be reduced.

Fig. 4 is a flowchart illustrating a step of determining a wake-up determination result according to the first wake-up probability and the second wake-up probability in the embodiment shown in fig. 3.

As shown in fig. 4, on the basis of the embodiment shown in fig. 3, in an exemplary embodiment of the present disclosure, the step 1043 of determining the wake-up determination result according to the first wake-up probability and the second wake-up probability may specifically include the following steps:

step 10431, when the second wake-up probability meets a second preset condition, determining whether the first wake-up probability meets a first preset condition.

The false awakening rate of the awakening judgment result determined by the second awakening probability is relatively high, and the false awakening rate of the awakening judgment result determined by the first awakening probability is relatively low, so that whether the second awakening probability meets a second preset condition is judged firstly, if not, the awakening judgment result is not awakened, and if yes, whether the first awakening probability meets a first preset condition is judged, so that the false awakening rate is reduced.

Step 10432, when the first wake-up probability satisfies a first preset condition, determining that a wake-up determination result is wake-up.

When the first awakening probability meets the first preset condition, the awakening judgment result can be determined to be awakening, and the awakening judgment result verifies the awakening result determined by the second awakening probability through the first awakening probability, so that the accuracy of the awakening judgment result is ensured.

According to the embodiment, the awakening result determined by the second awakening probability is verified through the first awakening probability, so that the accuracy of the awakening judgment result is ensured, and the false awakening rate can be reduced.

Exemplary devices

Based on the same concept as the embodiment of the method, the embodiment of the disclosure also provides a voice awakening device.

Fig. 5 shows a first schematic structural diagram of a voice wake-up apparatus according to an exemplary embodiment of the present disclosure.

As shown in fig. 5, an exemplary embodiment of the present disclosure provides a voice wake-up apparatus, including:

a feature extraction module 501, configured to determine at least one first voice feature corresponding to the voice data through a first feature extraction network;

a first processing module 502, configured to determine, through a first wake-up model, a phoneme probability distribution corresponding to each of the at least one first speech feature;

a second processing module 503, configured to determine, through a second wake-up model, an attention feature corresponding to the at least one first speech feature;

the waking module 504 is configured to determine a waking judgment result according to the phoneme probability distribution corresponding to the at least one first speech feature and the attention feature corresponding to the at least one first speech feature.

As shown in fig. 6, in an exemplary embodiment, the second processing module 503 includes:

a feature extraction unit 5031, configured to obtain, through a second feature extraction network in the second wake-up model, second voice features corresponding to the at least one first voice feature respectively;

an attention unit 5032, configured to obtain an attention feature corresponding to the at least one first speech feature through a second speech feature corresponding to each of the at least one first speech feature and an attention mechanism network in the second wake model.

As shown in fig. 6, in an exemplary embodiment, the wake-up module 504 includes:

a first probability determining unit 5041, configured to obtain a first wake-up probability according to a phoneme probability distribution corresponding to each of the at least one first speech feature;

a second probability determining unit 5042, configured to obtain a second wake-up probability according to the attention feature corresponding to the at least one first voice feature;

a waking unit 5043, configured to determine a waking judgment result according to the first waking probability and the second waking probability.

As shown in fig. 7, in an exemplary embodiment, the wake-up unit 5043 includes:

a determining subunit 50431, configured to determine whether the first wake-up probability satisfies a first preset condition when the second wake-up probability satisfies a second preset condition;

a wake-up subunit 50432, configured to determine that the wake-up determination result is wake-up when the first wake-up probability satisfies a first preset condition.

Exemplary electronic device

FIG. 8 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 8, an electronic device 800 includes one or more processors 801 and memory 802.

The processor 801 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 800 to perform desired functions.

Memory 802 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 801 to implement the voice wake-up methods of the various embodiments of the present disclosure described above and/or other desired functions.

In one example, the electronic device 800 may further include: an input device 803 and an output device 804, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

Of course, for simplicity, only some of the components of the electronic device 800 relevant to the present disclosure are shown in fig. 8, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 800 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice wake-up method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice wake-up method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It is also noted that in the apparatus, devices and methods of the present disclosure, various components or steps may be decomposed and/or re-combined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A voice wake-up method, comprising:

determining a phoneme probability distribution corresponding to each of the at least one first voice feature through a first wake-up model;

2. The method of claim 1, wherein the first feature extraction network and the first wake-up model are trained together;

and the second awakening model is obtained by training based on the network structure and the network parameters in the first feature extraction network.

3. The method of claim 1, wherein the determining the attention feature corresponding to the at least one first speech feature by the second wake-up model comprises:

acquiring a second voice feature corresponding to each of the at least one first voice feature through a second feature extraction network in a second awakening model;

and acquiring the attention feature corresponding to the at least one first voice feature through a second voice feature corresponding to the at least one first voice feature and an attention mechanism network in the second wake-up model.

4. The method according to claim 1, wherein the second wake-up network processes the at least one first voice feature when a voice duration corresponding to the at least one first voice feature satisfies a preset threshold.

5. The method according to claim 1, wherein the determining the wake-up determination result according to the phoneme probability corresponding to each of the at least one first speech feature and the attention feature corresponding to the at least one first speech feature comprises:

acquiring a first awakening probability according to the phoneme probability distribution corresponding to the at least one first voice feature;

acquiring a second awakening probability according to the attention feature corresponding to the at least one first voice feature;

and determining a wakeup judgment result according to the first wakeup probability and the second wakeup probability.

6. The method of claim 5, wherein the determining a wake up determination based on the first wake up probability and the second wake up probability comprises:

and when the first awakening probability meets a first preset condition and the second awakening probability meets a second preset condition, determining that the awakening judgment result is awakening.

7. The method of claim 6, wherein the determining that the wake-up determination result is wake-up when the first wake-up probability satisfies a first preset condition and the second wake-up probability satisfies a second preset condition comprises:

when the second awakening probability meets a second preset condition, judging whether the first awakening probability meets a first preset condition;

and when the first awakening probability meets a first preset condition, determining that the awakening judgment result is awakening.

8. A voice wake-up apparatus comprising:

the first processing module is used for determining phoneme probability distribution corresponding to the first voice characteristics through a first awakening model;

9. A computer-readable storage medium, which stores a computer program for executing the voice wake-up method according to any one of the preceding claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the voice wake-up method according to any one of claims 1 to 7.