CN117174082A - Training and executing method, device, equipment and storage medium of voice wake-up model - Google Patents
Training and executing method, device, equipment and storage medium of voice wake-up model Download PDFInfo
- Publication number
- CN117174082A CN117174082A CN202311213749.XA CN202311213749A CN117174082A CN 117174082 A CN117174082 A CN 117174082A CN 202311213749 A CN202311213749 A CN 202311213749A CN 117174082 A CN117174082 A CN 117174082A
- Authority
- CN
- China
- Prior art keywords
- wake
- word
- branch
- output
- stop point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 89
- 238000000034 method Methods 0.000 title claims abstract description 81
- 239000013598 vector Substances 0.000 claims abstract description 87
- 238000013528 artificial neural network Methods 0.000 claims abstract description 54
- 238000012545 processing Methods 0.000 claims abstract description 16
- 230000006870 function Effects 0.000 claims description 44
- 238000004590 computer program Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 12
- 230000000694 effects Effects 0.000 description 9
- 230000003993 interaction Effects 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 238000012795 verification Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000003491 array Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 229910021389 graphene Inorganic materials 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a training method, an executing method, a device, equipment and a storage medium of a voice wake-up model. The voice wake-up model comprises a neural network structure, wake-up word classification branches and start and stop point judgment branches. The training method comprises the following steps: acquiring an audio sample, and performing acoustic feature processing on the audio sample to obtain an acoustic feature frame; inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and training the wake-up word classification branch and the start-stop point judgment branch according to the output vector. The application can improve the detection accuracy of the wake-up word.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a training method, an execution method, an apparatus, a device, and a storage medium for a voice wake-up model.
Background
With the development of computer technology, more and more intelligent devices support voice interaction functions, for example, a large number of devices such as intelligent sound boxes, intelligent televisions, automobiles and the like are configured with voice interaction functions, voice control and operation devices can be directly used, and great convenience is provided for users. During a voice interaction, wake-up words are typically detected as the beginning of a round of interaction. Once the wake-up word is triggered, the device actively gathers the user voice instructions. In order to reduce the sudden triggering of equipment caused by the misidentification of wake-up words, many schemes can add a second-level wake-up word check on the cloud on the basis of the primary wake-up word check of the local device, which requires accurate wake-up word interception.
In the conventional technology, common practices for intercepting wake words include:
firstly, a section of audio with a fixed length is intercepted forwards at the wake-up word triggering position, but the interception with the fixed length is generally set according to the length of a slow language in consideration of different pronunciation speeds of different users, so that a quick wake-up word is triggered to possibly intercept a plurality of irrelevant audios, and the accuracy of secondary wake-up verification is affected.
Second, voice endpoint detection (Voice Activate Detection, VAD) is used to intercept the starting point of the wake-up word, but this method can only be used in a quieter environment, but in a noisy environment, the VAD performance drops rapidly and the starting position of the wake-up word cannot be obtained effectively.
Accordingly, with the rapid development of voice interaction functions, there is a need for a voice wake-up technology that is accurate and capable of efficient operation.
Disclosure of Invention
The application aims to at least solve one of the technical problems in the prior art or related technologies, and therefore, the application provides a training method, an execution method, a device, equipment and a storage medium of a voice wake-up model.
According to a first aspect of the present application, there is provided a training method of a voice wake model, wherein the voice wake model includes a neural network structure, a wake word classification branch and a start-stop point judgment branch, the training method including:
Acquiring an audio sample, and performing acoustic feature processing on the audio sample to obtain an acoustic feature frame;
inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and
training the wake-up word classification branch and the start-stop point judgment branch according to the output vector respectively, wherein the training comprises the following steps:
respectively inputting the output vector to a wake-up word classification branch and a start-stop point judgment branch, outputting the probability that the audio sample contains the wake-up word by the wake-up word classification branch, and outputting the start point position and the end point position of the wake-up word in the audio sample by the start-stop point judgment branch; and
and respectively adjusting the parameters of the wake-up word classification branch and the parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain an updated wake-up word classification branch and a updated start-stop point judgment branch.
As one embodiment of the present application, the wake word classification branch at least includes a first fully-connected layer structure connected to the neural network structure, and the output vector of the neural network structure includes first tag data for the wake word classification branch, where the first tag data relates to a first attribute, and the first tag data is used to characterize whether the output vector includes a wake word and is input to the first fully-connected layer structure as a first supervised learning target.
As one embodiment of the present application, adjusting parameters of the wake word classification branch using output results of the wake word classification branch includes: determining a first loss function of the wake-up word classification branch according to the first attribute of the output vector and the output result of the wake-up word classification branch; and adjusting parameters of the wake word classification branch based on the first loss function.
As one embodiment of the present application, adjusting parameters of the wake word classification branch based on the first loss function includes: and iteratively updating parameters of the wake-up word classification branches according to the calculated value of the first loss function until convergence to obtain updated wake-up word classification branches.
As one embodiment of the application, the first loss function of the wake word classification branch is cross entropy.
As an embodiment of the present application, the calculation formula of the first loss function is:
wherein,labels that are wake words; when the output vector contains wake-up words, +.>When no wake-up word is contained in the output vector, < + >>y is the probability of wake-up word output by the wake-up word classification branch, and the probability ranges from 0 to 1]。
As one embodiment of the present application, the start-stop point judgment branch at least includes a second fully-connected layer structure connected to the neural network structure, and the output vector of the neural network structure includes second tag data for the start-stop point judgment branch, the second tag data relates to a second attribute, and the second tag data is used for characterizing a start point position and an end point position of the wake-up word in the output vector and is input to the second fully-connected layer structure as a second supervised learning target.
As one embodiment of the present application, adjusting the parameters of the start-stop point judgment branch using the output result of the start-stop point judgment branch includes: determining a second loss function of the start-stop point judgment branch according to the second attribute of the output vector and the output result of the start-stop point judgment branch; and adjusting parameters of the start and stop point judgment branch based on the second loss function.
As one embodiment of the present application, adjusting the parameters of the start-stop point judgment branch based on the second loss function includes: and iteratively updating parameters of the start and stop point judgment branch according to the calculated value of the second loss function until convergence to obtain an updated start and stop point judgment branch.
As an embodiment of the present application, the second loss function of the start-stop point judgment branch is a mean square error between the output start point position and end point position and the actual start point position and end point position.
As an embodiment of the present application, the calculation formula of the second loss function is:
L 2 =(s1-k1) 2 +(s2-k2) 2
wherein L is 2 Is a mean square error value; s1 is the true starting point position of the wake-up word in the output vector; s2 is the true end point position of the wake-up word in the output vector; k1 is the starting point position of the branch output judged by the starting point and the ending point; and k2 is the end point position of the branch output judged by the start point and the end point.
As an embodiment of the present application, the training method further includes: and adjusting parameters of the neural network structure by utilizing the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch so as to obtain an updated neural network structure.
As one embodiment of the present application, feature processing an audio sample includes: endpoint detection is carried out on the wake-up words in the audio samples, and from the starting point of the wake-up words, data frames are intercepted according to a time window with a fixed length until the ending point of the wake-up words so as to obtain a plurality of data frames with the same time length; the plurality of data frames are in the form of one-dimensional arrays of data; and extracting the audio frequency spectrum characteristics of the one-dimensional array and outputting a two-dimensional characteristic array.
As one embodiment of the application, in response to the intercepted data frame including a plurality of wake words, a data frame including a first wake word is retained.
As one embodiment of the present application, inputting the acoustic feature frame into the neural network structure to generate the output vector includes: inputting the two-dimensional feature array into a neural network structure, and generating a one-dimensional vector through a convolution algorithm.
According to a second aspect of the present application, there is also provided a method for executing a voice wake model, including:
Acquiring voice data to be detected;
inputting the voice data to be detected into any trained voice wake-up model;
obtaining the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch; and
and outputting the wake-up result of the voice wake-up model based on the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch.
As an embodiment of the present application, the performing method further includes: and sending the output result of the wake-up word classification branch and/or the output result of the start and stop point judgment branch to a voice recognition cloud platform to perform secondary recognition of the wake-up word.
According to a third aspect of the present application, there is also provided a training device for a voice wake model, the voice wake model including a neural network structure, a wake word classification branch and a start-stop point judgment branch, the training device comprising:
the acquisition module is used for acquiring an audio sample and carrying out acoustic feature processing on the audio sample to obtain an acoustic feature frame;
the generating module is used for inputting the acoustic feature frame into the neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and
The training module respectively trains the wake-up word classification branch and the start-stop point judgment branch according to the output vector, and comprises the following components:
respectively inputting the output vector to a wake-up word classification branch and a start-stop point judgment branch, outputting the probability that the output vector contains the wake-up word by the wake-up word classification branch, and outputting the start point position and the end point position of the wake-up word in the output vector by the start-stop point judgment branch; and
and respectively adjusting the parameters of the wake-up word classification branch and the parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain an updated voice wake-up model.
According to a fourth aspect of the present application, there is also provided a voice wake-up device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing any of the training methods described above and/or any of the execution methods described above when executing the computer program.
According to a fifth aspect of the present application, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the training methods described above and/or any of the execution methods described above.
According to a sixth aspect of the present application, there is also provided a computer program which, when executed by a processor, implements any of the training methods described above and/or any of the performing methods described above.
The training method of the voice wake-up model adopts a multitask training method, a wake-up word classification branch and the start and stop point judging branch are arranged at the tail end of a wake-up word detection network in parallel, meanwhile, the classification task of the wake-up word and the regression task of the wake-up word start and stop point are trained, and therefore whether the wake-up word is contained in an audio sample or not is output, and meanwhile, the position of a start point and an end point of the wake-up word in the audio sample is output. Therefore, the voice wake-up model obtained by training in the embodiment of the application can quickly and accurately obtain the starting point and the ending point of the wake-up word, and is helpful for more accurately intercepting the wake-up word in the audio frame. In addition, because the classification task of the wake-up word and the regression task of the wake-up word starting point are trained at the same time, the effect of the neural network structure can be enhanced, and compared with a single training task, the accuracy of wake-up word detection can be further improved.
In addition, the voice awakening execution method provided by the embodiment of the application is based on the trained voice awakening model, and can output the initial point position and the end point position of the awakening word in the audio sample while outputting whether the awakening word is contained in the audio sample, so that the accuracy of awakening word detection is obviously improved, and the voice awakening performance is greatly improved. In addition, the application can more accurately intercept the wake-up word in the audio frame in the scene needing cloud secondary verification, so that the application can particularly show advantages in the scene needing cloud secondary verification.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of a voice wake-up model according to an embodiment of the present application.
Fig. 2 is a flowchart of a training method of a voice wake-up model according to an embodiment of the present application.
Fig. 3 is a flowchart illustrating a training method of wake word classification branches in a voice wake model according to an embodiment of the present application.
Fig. 4 is a flowchart of a training method of a start-stop point judgment branch of a voice wake-up model according to an embodiment of the present application.
Fig. 5 is another flow chart of a training method of a voice wake-up model according to an embodiment of the present application.
Fig. 6 is a flowchart illustrating a method for executing a voice wake-up model according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of a training device for a voice wake-up model according to an embodiment of the present application.
Fig. 8 is a schematic structural diagram of a voice wake-up device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The embodiment of the application provides a training method and an executing method of a voice awakening model, which are applied to the technical fields of machine learning and voice in the artificial intelligence field, and particularly can be applied to scenes needing voice interaction, such as intelligent sound box voice interaction, intelligent vehicle-mounted voice interaction, intelligent television voice interaction and the like, so as to improve the detection accuracy of awakening words.
In order to facilitate understanding of the technical scheme provided by the application, an exemplary application scenario of the embodiment of the application is described below.
The training method of the voice wake-up model provided by the embodiment of the application can be executed by the training device of the voice wake-up model provided by the embodiment of the application, and the training device of the voice wake-up model provided by the embodiment of the application can be a terminal device or a server. The training method of the voice wake-up model provided by the embodiment of the application can be applied to the terminal equipment, for example, the training method can be realized through a processor, an application program or a webpage in the terminal equipment, and the terminal equipment and the server have data communication, so that the embodiment of the application is not limited. The embodiment of the application does not limit the specific type of the terminal equipment, and for example, the terminal equipment can be a smart phone, a personal computer, a tablet personal computer, a wearable device, a vehicle-mounted terminal and the like. The embodiment of the application does not limit the types and the number of the servers, for example, the servers can be single independent servers or can be server clusters, and the embodiment of the application is only taken as an example and is not limited to the example.
After training the voice wake model, the voice wake model may be applied to a terminal device having a voice recognition function requirement to perform the voice wake function. For example, the voice wake-up model may be applied to terminal devices such as a smart speaker, a smart home device, a smart phone, a vehicle-mounted terminal, and a wearable device, which is not limited in the embodiment of the present application.
Fig. 1 is a schematic structural diagram of a voice wake-up model according to an embodiment of the present application. As shown in fig. 1, the voice wake-up model provided by the embodiment of the application includes an acoustic feature processing unit, a neural network structure, a wake-up word classification branch and a start-stop point judgment branch. The acoustic feature processing unit is used for extracting acoustic features of the input audio signals and inputting the extracted acoustic features into the neural network structure. The output of the neural network structure is used as the input of a wake-up word classification branch and a start-stop point judgment branch, and is used for simultaneously training the classification tasks of the wake-up word and the non-wake-up word of the wake-up word classification branch and calculating the regression tasks of the start point and the end point of the wake-up word of the start-stop point judgment branch. Advantageously, the wake-up word classification branch and the start-stop point judgment branch share the output of a neural network structure, which contributes to saving the calculation cost.
Fig. 2 is a flowchart of a training method of a voice wake-up model according to an embodiment of the present application.
As shown in fig. 2, the training method of the voice wake model provided by the embodiment of the application includes the following steps:
step S201: and acquiring an audio sample, and performing acoustic feature processing on the audio sample to obtain an acoustic feature frame.
In this step, the audio samples include wake-up word samples and non-wake-up word samples. In some examples, the wake word sample may be manually recorded audio containing wake words. In addition, the diversity and complexity of the audio samples can be increased by adding noise, changing the position of wake-up words in the audio and the like at a later stage, so that the training effect is enhanced. The sample of non-wake words may be an audio file that does not contain wake words, which may be human voice in a daily scene, such as audio of a movie television show, or may be generated by automatic collection by a computer.
Before training the voice wake model, the type of each audio sample can be marked by a machine recognition or manual recognition mode, namely whether the audio sample contains wake words or not is marked.
Meanwhile, because the start and stop point training branches are arranged in the training method of the embodiment, before training the voice wake-up model, the start and stop points of wake-up words in the wake-up word sample can be accurately marked by a machine recognition or manual recognition mode. The starting point of the wake-up word is a starting time point (also called a starting point) and an ending time point (also called an ending point) of the wake-up word in the audio sample. It will be appreciated that in some examples, the non-wake word samples need not be labeled with a start point and an end point.
In this step, performing acoustic feature processing on the audio sample includes: for an input audio sample, starting from a starting point of a wake-up word, intercepting data frames according to a time window with a fixed length until an ending point of the wake-up word is sounded, so as to obtain a plurality of data frames with the same time length from the audio sample. The plurality of data frames may be in the form of a one-dimensional array of data. In some examples, a start point and an end point of a wake word may be identified from audio samples based on a voice endpoint detection (VAD) technique. In some examples, in response to the intercepted data frame including a plurality of the wake-up words, the data frame including the first wake-up word is retained, thereby avoiding repeated computations to save computing resources and increase recognition speed.
Further, acoustic feature extraction is performed on the data frames of the one-dimensional array, and a two-dimensional feature array is output. In some examples, the data frames of the one-dimensional array may be acoustically feature extracted and feature enhanced using Mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC) to output a two-dimensional feature array. Specifically, in the MFCC feature extraction method, first, audio data is subjected to Fast Fourier Transform (FFT), and then, spectra of different frequency bands are subjected to filtering and compression processing according to human auditory features, so as to obtain MFCC features. In other examples, a static Filter network (Filter Bank, FBANK), normalized energy coefficients (Power-Normalized Coefficients, PNCC), etc. may also be employed for audio spectral feature extraction, which is not particularly limited by the present application.
For example, for an input audio sample, starting from the beginning of the audio information identified in the audio sample as containing a wake-up word, a data frame is truncated through a time window every a seconds until the end of the audio information containing a wake-up word. Based on the method, m data frames with the same length can be intercepted from the audio sample, and the data frames are used as training samples of the input neural network in the form of one-dimensional array data. Further, the MFCC coefficients may be used to perform feature extraction and feature enhancement on the input training samples, and output n×m 32-bit two-dimensional feature arrays as acoustic feature frames for subsequent use.
Step S202: and inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector.
In this step, the neural network structure may be a backbone (backbone) structure of the wake word recognition network. The Backbone structure may be constructed using convolutional neural networks, and in general, may be constructed of convolutional layers and pooled layers connected. The backhaul structure may be a convolutional neural network (convolutional neural network, CNN), a recurrent neural network (recurrent neural network, RNN), an attention (attention) network, and other types of identification network structures, to which the present application is not particularly limited.
After the acoustic feature frame is input into the Backbone structure, the Backbone structure can generate a one-dimensional vector through compression of a convolution algorithm and then output. The one-dimensional vector is used as a common input for a subsequent wake-up word classification branch and a start-stop point judgment branch.
It will be appreciated that, before training the voice wake model, the audio samples include wake word samples and non-wake word samples, i.e. whether the labels contain wake words, as mentioned in the previous step S201. And, for the wake word samples, the start point and end point of the wake word have been marked. Accordingly, in this step, the output vector of the neural network structure has a first attribute and/or a second attribute, where the first attribute is whether the wake-up word is included, and the second attribute is a start point position and an end point position of the wake-up word in the output vector.
Step S203: and training the wake-up word classification branch and the start-stop point judgment branch according to the output vector.
In this step, the wake word classification branch is used for training and judging whether the task is a classification task of wake words. In the training process, the wake-up word classification branch uses all audio samples to train, specifically, the input of the wake-up word classification branch is one-dimensional vector data obtained after the processing of a neural network structure, the one-dimensional vector data comprises two types of samples of wake-up words and non-wake-up words, and the probability that the sample is the wake-up word and the probability that the sample is not the wake-up word are output through training. It will be appreciated that the sum of the two probabilities is 1.
The starting and ending point judging branch is used for expanding a regression task for calculating the starting point and the ending point of the wake-up word and outputting the calculated starting and ending point of the wake-up word. In the training process, the start-stop point judging branch only inputs a sample containing a wake-up word for training, specifically, the start-stop point judging branch inputs one-dimensional vector data obtained after the processing of the neural network structure, and the position of the start point and the position of the end point of the wake-up word in the sample are output through training.
With continued reference to fig. 2, step S203 includes the steps of:
step S2031: respectively inputting output vectors of the neural network structure into a wake-up word classification branch and a start-stop point judgment branch, outputting the probability that an audio sample contains the wake-up word by the wake-up word classification branch, and outputting the start point position and the end point position of the wake-up word in the audio sample by the start-stop point judgment branch; and
step S2032: and respectively adjusting the parameters of the wake-up word classification branch and the parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain an updated wake-up word classification branch and a updated start-stop point judgment branch.
In steps S2031 and S2032, taking the wake word classification branch as an example, in some examples, it includes at least a first fully-connected layer structure connected to the neural network structure, and the output vector of the neural network structure includes first tag data for waking the word classification branch, where the first tag data relates to a first attribute, and the first tag data is used to characterize whether the output vector includes a wake word and is input to the first fully-connected layer structure as a first supervised learning target.
In some examples, referring to fig. 3, which is a flowchart of a training method of a wake word classification branch of a voice wake model according to an embodiment of the present application, the adjusting parameters of the wake word classification branch by using the output result of the wake word classification branch in step S2032 includes:
step S301: and determining a first loss function of the wake-up word classification branch according to the first attribute of the output vector and the output result of the wake-up word classification branch.
In this step, the first loss function of the wake word classification branch is cross entropy.
Step S302: parameters of the wake word classification branch are adjusted based on the first penalty function.
In this step, the parameters of the wake-up word classification branch are iteratively updated according to the calculated value of the first loss function until convergence is reached, so as to obtain an updated wake-up word classification branch.
In one example, the first loss function is calculated as:
wherein,labels that are wake words; when the output vector contains wake-up words, +.>When no wake-up word is contained in the output vector, < + >>y is the probability of wake-up word output by the wake-up word classification branch, and the probability ranges from 0 to 1]。
Based on the above, the calculated and outputted wake-up word probability can be compared with the information of whether the marked audio sample in the audio sample contains the wake-up word or not, so that on one hand, the method can be used for evaluating the training effect, on the other hand, the method can be used for adjusting the parameters to optimize the wake-up word classification branches, and the wake-up word classification branches after parameter optimization are used as updated wake-up word classification branches. In this embodiment, the parameters may be adjusted so that the probability of the wake word outputted by the wake word classification branch is close to or equal to the label of the labeled wake word.
In step S2031 and step S2032, taking the start-stop point judgment branch as an example, it includes at least a second full-connection layer structure connected to the neural network structure, and the output vector of the neural network structure includes second tag data for the start-stop point judgment branch, where the second tag data relates to a second attribute, and the second tag data is used to characterize a start point position and an end point position of the wake-up word in the output vector and is input to the second full-connection layer structure as a second supervised learning target.
In some examples, referring to fig. 4, which is a flowchart of a training method of a start-stop point judgment branch of a voice wake-up model according to an embodiment of the present application, the adjusting parameters of the start-stop point judgment branch by using an output result of the start-stop point judgment branch in step S2032 includes:
step S401: and determining a second loss function of the start-stop point judgment branch according to the second attribute of the output vector and the output result of the start-stop point judgment branch.
In this step, the second loss function of the start-stop point judgment branch is the mean square error between the output start point position and end point position and the actual start point position and end point position.
Step S402: and adjusting parameters of the start and stop point judgment branch based on the second loss function.
In this step, the parameters of the start-stop point judgment branch are iteratively updated according to the calculated value of the second loss function until convergence is reached, so as to obtain an updated start-stop point judgment branch.
In one example, the second loss function is calculated as:
L 2 =(s1-k1) 2 +(s2-k2) 2
wherein L is 2 Is a mean square error value;
s1 is the true starting point position of the wake-up word in the output vector;
s2 is the true end point position of the wake-up word in the output vector;
k1 is the starting point position of the branch output judged by the starting point and the ending point;
and k2 is the end point position of the branch output judged by the start point and the end point.
Based on the method, the starting point and the ending point of the wake-up word which are calculated and output can be compared with the starting point and ending point information marked in the audio sample, so that on one hand, the method can be used for evaluating training effect, on the other hand, the method can be used for adjusting parameters to optimize the starting point and ending point judgment branch, and the starting point and ending point judgment branch after parameter optimization is used as the updated starting point and ending point judgment branch. In this embodiment, the parameters may be adjusted so that the start point position k1 and the end point position k2 of the start point judgment branch output are close to or equal to the actual start point position s1 and the actual end point position s2 of the wake-up word in the output vector, respectively.
Referring again to fig. 1, in some examples, the output of the wake word classification branch and/or the output of the start-stop point judgment branch may be used to adjust parameters of the neural network structure to obtain an updated neural network structure. Illustratively, the neural network structure may employ parameters of the error back propagation algorithm in the neural network structure that is being trained Cheng Zhongxiu, such that the reconstruction error loss of the neural network structure is smaller and smaller until the error loss converges.
For a better understanding of the present application, a more detailed embodiment is listed below, and as shown in fig. 5, an embodiment of the present application provides a training method for a voice wake model, which includes the following steps:
step S501: and generating wake word samples and non-wake word samples.
In this step, the wake-up word of the voice wake-up model is set to be "Hi Le Xin", and the other audio samples are all non-wake-up words. The wake-up word may be sampled in a quiet, low noise environment by manually recording the voice "Hi Le Xin". In order to enhance the training effect, after the wake-up word sample is collected, the complexity of the wake-up word sample is increased and the training effect is improved by adding noise, adjusting the position of the wake-up word in the sample audio and the like. A sample of non-wake words is obtained by intercepting other audio containing human voice but not wake words, such as a television series, a movie, etc.
Step S502: marking the starting and ending points of the wake-up words.
In this step, the starting and ending points of a given wake-up word in the audio sample are required for training purposes. In this regard, the time starting point and the time ending point of the wake-up word in the audio sample can be marked by means of machine labeling or manual labeling. The sample of non-wake words may not need to be annotated.
Step S503: and intercepting the data frame.
In this step, a window of 1s length is passed, moving from the start point of the wake-up word in the audio sample at intervals of 32ms, intercepting a plurality of data frames, and optionally uploading to the cloud for secondary recognition of the wake-up word.
In one example, in order to avoid repeated computation to save computing resources and increase recognition speed, among the several intercepted data frames containing wake-up words, only the first data frame may be uploaded to the cloud for secondary recognition of the wake-up words.
Step S504: feature extraction and feature enhancement of a data frame.
In this step, feature extraction and feature enhancement are performed on a one-dimensional array in a speech sample through MFCC, and 50×32 feature two-dimensional arrays of 32 bits are output.
Step S505: outputting one-dimensional vector by convolution calculation
In the step, 50 x 32 characteristic two-dimensional array samples with 32 bits are input into a neural network structure, and one-dimensional vectors are output after convolution calculation.
Step S506: training wake-up word classifying branch and start-stop point judging branch
In the step, one-dimensional vectors output by the neural network structure are respectively input into a wake-up word classification branch and a start-stop point judgment branch, and classification tasks of wake-up words and non-wake-up words and regression tasks of calculating start points and end points of the wake-up words are operated at the same time.
The wake-up word classifying branch and the start-stop point judging branch both adopt full-connection layer structures, also called linear layer structures, and are used for outputting the last layer of the neural network for subsequent operation. The loss function of the wake-up word classification branch is cross entropy. The start and stop point judging branch loss function is the mean square error between the output start point position and end point position and the real start point position and end point position. Accordingly, the specific data types and parameter adjustment manners of the wake-up word classification branch and the start-stop point judgment branch are referred to the specific description about the first loss function and the second loss function above, and are not described herein. The two tasks need to be trained separately in consideration of the difference of parameters used by the wake-up word classification branch and the start-stop point judgment branch in subsequent operations.
The wake word classification branch is used for training classification tasks of wake words. The input of the wake-up word classification branch comprises one-dimensional vector data samples of wake-up words and non-wake-up words, two groups of data are output after training, wherein the two groups of data are respectively the probability that the sample is the wake-up word and the probability that the sample is not the wake-up word, and the sum of the two probabilities is 1. The output result of the wake-up word classification branch is compared with the audio information to evaluate the training effect of the model and is used for adjusting parameters to optimize the wake-up word classification branch.
The starting point and ending point judging branches are used for training regression tasks of the starting point and the ending point of the wake-up word. The input of the start and stop judging branch only comprises one-dimensional vector data samples of the wake-up word, and two groups of data are output after training, wherein the two groups of data are respectively the start point position of the wake-up word and the end point position of the wake-up word. The output result of the wake-up word classification branch is compared with the audio information to evaluate the training effect of the model and is used for adjusting parameters to optimize the start-stop point judgment branch.
The embodiment of the application is proved to be effective by a large number of tests, for example, on ESP32-S3 with very limited resources, the recognition rate of more than 95% is achieved by using 5% of calculation resources of CPU, and the Alexa performance test of Amazon is passed.
In summary, the training method of the voice wake-up model in the embodiment of the application adopts a multitask training method, a wake-up word classification branch and the start and stop point judgment branch are arranged at the tail end of the wake-up word detection network in parallel, and meanwhile, the classification task of the wake-up word and the regression task of the wake-up word start and stop point are trained, so that whether the wake-up word is contained in an audio sample or not is output, and meanwhile, the starting point position and the ending point position of the wake-up word in the audio sample are also output. Therefore, the voice wake-up model obtained by training in the embodiment of the application can quickly and accurately obtain the starting point and the ending point of the wake-up word, and is helpful for more accurately intercepting the wake-up word in the audio frame. In addition, because the classification task of the wake-up word and the regression task of the wake-up word starting point are trained at the same time, the effect of the neural network structure can be enhanced, and compared with a single training task, the accuracy of wake-up word detection can be further improved. In addition, the embodiment of the application can more accurately intercept the wake-up words, so that the embodiment of the application can particularly show advantages in the scene requiring cloud secondary verification.
Based on the above embodiments, the trained voice wake-up model may be preset in the voice wake-up device, so that the voice wake-up device has a voice wake-up function. Correspondingly, the embodiment of the application provides a method for executing a voice wake-up model, as shown in fig. 6, which comprises the following steps:
step S601: and acquiring voice data to be detected.
Step S602: and inputting the voice data to be detected into the trained voice wake-up model.
Step S603: and obtaining the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch.
Step S604: and outputting the wake-up result of the voice wake-up model based on the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch.
In some embodiments, the method for executing the voice wake model further includes: and sending the output result of the wake-up word classification branch and/or the output result of the start and stop point judgment branch to a voice recognition cloud platform to perform secondary recognition of the wake-up word.
The voice awakening execution method provided by the embodiment of the application is based on the trained voice awakening model, and can output the initial point position and the end point position of the awakening word in the audio sample while outputting whether the awakening word is contained in the audio sample, so that the accuracy of awakening word detection is obviously improved, and the voice awakening performance is greatly improved. In addition, the application can more accurately intercept the wake-up word in the audio frame in the scene needing cloud secondary verification, so that the application can particularly show advantages in the scene needing cloud secondary verification.
Referring to fig. 7, the embodiment of the application also provides a training device of the voice wake-up model, which comprises an acquisition module, a generation module and a training module. The acquisition module is used for acquiring an audio sample and performing acoustic feature processing on the audio sample to obtain an acoustic feature frame. The generation module is used for inputting the acoustic feature frame into the neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector. The training module is used for training the wake-up word classification branch and the start-stop point judgment branch according to the output vector, and comprises the following steps: respectively inputting the output vector to a wake-up word classification branch and a start-stop point judgment branch, outputting the probability that the output vector contains the wake-up word by the wake-up word classification branch, and outputting the start point position and the end point position of the wake-up word in the output vector by the start-stop point judgment branch; and respectively adjusting the parameters of the wake-up word classification branch and the parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain an updated voice wake-up model.
In this embodiment, the specific limitation, implementation principle and beneficial effects of the training device of the voice wake-up model can be referred to the above description of the training method of the voice wake-up model, which is not repeated here. The various modules described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Referring to fig. 8, an embodiment of the present application further provides a voice wake-up device, which includes a memory and a processor, where the memory stores a computer program, and the processor may implement the training step of the voice wake-up model and/or the executing step of the voice wake-up model in the foregoing embodiments when executing the computer program.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor can implement the training step of the voice wake-up model and/or the executing step of the voice wake-up model in the above embodiments.
The embodiment of the application also provides a computer program which can realize the training step of the voice wake-up model and/or the executing step of the voice wake-up model in the above embodiments when being executed by a processor.
In the above embodiments, the implementation principles and beneficial effects of the voice wake-up device, the computer readable storage medium and the computer program of the voice wake-up model may be referred to the above description of the training method and the execution method of the voice wake-up model, which are not repeated here.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magneto resistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
While various embodiments of the various aspects of the present application have been described for the purposes of this disclosure, it should not be construed that the teachings of this disclosure are limited to these embodiments. Features disclosed in one particular embodiment are not limited to that embodiment, but may be combined with features disclosed in a different embodiment. For example, one or more features and/or operations of the method according to the application described in one embodiment may also be applied in another embodiment, alone, in combination, or in whole. It will be understood by those skilled in the art that there are many more alternative embodiments and variations possible and that various changes and modifications may be made to the above system without departing from the scope of the application as defined in the following claims.
Claims (20)
1. The method for training the voice wake-up model is characterized in that the voice wake-up model comprises a neural network structure, wake-up word classification branches and start and stop point judgment branches, and comprises the following steps:
acquiring an audio sample, and performing acoustic feature processing on the audio sample to obtain an acoustic feature frame;
inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and
Training the wake-up word classification branch and the start-stop point judgment branch according to the output vector, wherein the training comprises the following steps:
the output vector is respectively input into the wake-up word classification branch and the start-stop point judgment branch, the probability that the audio sample contains the wake-up word is output by the wake-up word classification branch, and the start point position and the end point position of the wake-up word in the audio sample are output by the start-stop point judgment branch; and
and respectively adjusting parameters of the wake-up word classification branch and parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain the updated wake-up word classification branch and the updated start-stop point judgment branch.
2. The method of claim 1, wherein the wake word classification branch comprises at least a first fully-connected layer structure connected to the neural network structure, an output vector of the neural network structure comprising first tag data for the wake word classification branch, the first tag data relating to a first attribute, the first tag data being used to characterize whether the wake word is contained in the output vector and input to the first fully-connected layer structure as a first supervised learning objective.
3. The method of claim 1, wherein adjusting parameters of the wake word classification branch using output results of the wake word classification branch comprises:
determining a first loss function of the wake-up word classification branch according to the first attribute of the output vector and the output result of the wake-up word classification branch;
and adjusting parameters of the wake-up word classification branches based on the first loss function.
4. The method of claim 3, wherein the adjusting parameters of the wake word classification branch based on the first loss function comprises:
and iteratively updating parameters of the wake-up word classification branches according to the calculated value of the first loss function until convergence to obtain the updated wake-up word classification branches.
5. A method according to claim 3, wherein the first loss function of the wake word classification branch is cross entropy.
6. The method of claim 5, wherein the first loss function is calculated as:
wherein,labels that are wake words; when the wake word is included in the output vector, and (2)>When the wake word is not included in the output vector, and (2)>
And y is the wake-up word probability output by the wake-up word classification branch, and the range of y is [0,1].
7. The method of claim 1, wherein the start-stop decision branch comprises at least a second fully-connected layer structure connected to a neural network structure, an output vector of the neural network structure comprising second tag data for the start-stop decision branch, the second tag data relating to a second attribute, the second tag data being used to characterize a start point position and an end point position of the wake-up word in the output vector and being input to the second fully-connected layer structure as a second supervised learning objective.
8. The method of claim 1, wherein adjusting parameters of the start-stop point determination branch using output results of the start-stop point determination branch comprises:
determining a second loss function of the start-stop point judging branch according to a second attribute of the output vector and an output result of the start-stop point judging branch;
and adjusting parameters of the start and stop point judgment branch based on the second loss function.
9. The method of claim 8, wherein the adjusting the parameters of the start-stop decision branch based on the second loss function comprises:
and iteratively updating the parameters of the start-stop point judgment branch according to the calculated value of the second loss function until convergence to obtain the updated start-stop point judgment branch.
10. The method of claim 8, wherein the second loss function of the start-stop point judgment branch is a mean square error between the output start point position and end point position and the actual start point position and end point position.
11. The method of claim 10, wherein the second loss function is calculated as:
L 2 =(s1-k1) 2 +(s2-k2) 2
wherein L is 2 Is a mean square error value;
s1 is the true starting point position of the wake-up word in the output vector;
s2 is the true end point position of the wake-up word in the output vector;
k1 is the starting point position of the branch output judged by the starting point and the ending point;
and k2 is the end point position of the branch output judged by the start point and the stop point.
12. The method of claim 1, further comprising: and adjusting parameters of the neural network structure by utilizing the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch so as to obtain an updated neural network structure.
13. The method of claim 1, wherein the characterizing the audio sample comprises:
detecting the end point of the wake-up word in the audio sample, starting from the starting point of the wake-up word, intercepting the data frames according to a time window with a fixed length until the end point of the wake-up word to obtain a plurality of data frames with the same time length; the plurality of data frames are in the form of one-dimensional array data; and
And extracting the audio frequency spectrum characteristics of the one-dimensional array, and outputting a two-dimensional characteristic array.
14. The method of claim 13, wherein, in response to the intercepted data frame including a plurality of the wake words, a data frame including a first wake word is retained.
15. The method of claim 13, wherein the inputting the acoustic feature frame into a neural network structure to generate an output vector comprises:
inputting the two-dimensional feature array into the neural network structure, and generating a one-dimensional vector through a convolution algorithm.
16. A method of executing a voice wake model, comprising:
acquiring voice data to be detected;
inputting the voice data to be detected into a voice awakening model trained by any one of claims 1-15;
obtaining the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch; and
and outputting the wake-up result of the voice wake-up model based on the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch.
17. The method of claim 16, further comprising:
and sending the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch to a voice recognition cloud platform to perform secondary recognition of the wake-up word.
18. A training device for a voice wake model, the voice wake model comprising a neural network structure, wake word classification branches and start and stop point judgment branches, the device comprising:
the acquisition module is used for acquiring an audio sample and carrying out acoustic feature processing on the audio sample to obtain an acoustic feature frame;
the generation module is used for inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and
the training module is used for training the wake-up word classification branch and the start-stop point judgment branch according to the output vector, and comprises the following steps:
the output vector is respectively input into the wake-up word classification branch and the start-stop point judgment branch, the probability that the output vector contains the wake-up word is output by the wake-up word classification branch, and the start point position and the end point position of the wake-up word in the output vector are output by the start-stop point judgment branch; and
And respectively adjusting parameters of the wake-up word classification branch and parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain an updated voice wake-up model.
19. A voice wakeup apparatus comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor when executing the computer program implements the training method of any one of claims 1 to 15 and/or the execution method of claim 16 or 17.
20. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the training method of any one of claims 1 to 15 and/or the execution method of claim 16 or 17.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311213749.XA CN117174082A (en) | 2023-09-19 | 2023-09-19 | Training and executing method, device, equipment and storage medium of voice wake-up model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311213749.XA CN117174082A (en) | 2023-09-19 | 2023-09-19 | Training and executing method, device, equipment and storage medium of voice wake-up model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117174082A true CN117174082A (en) | 2023-12-05 |
Family
ID=88931728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311213749.XA Pending CN117174082A (en) | 2023-09-19 | 2023-09-19 | Training and executing method, device, equipment and storage medium of voice wake-up model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117174082A (en) |
-
2023
- 2023-09-19 CN CN202311213749.XA patent/CN117174082A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
JP7434137B2 (en) | Speech recognition method, device, equipment and computer readable storage medium | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
US10403266B2 (en) | Detecting keywords in audio using a spiking neural network | |
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
US10373609B2 (en) | Voice recognition method and apparatus | |
US20160189730A1 (en) | Speech separation method and system | |
CN110782872A (en) | Language identification method and device based on deep convolutional recurrent neural network | |
US11282501B2 (en) | Speech recognition method and apparatus | |
CN112183107B (en) | Audio processing method and device | |
CN111899757A (en) | Single-channel voice separation method and system for target speaker extraction | |
CN114708857A (en) | Speech recognition model training method, speech recognition method and corresponding device | |
CN105448302A (en) | Environment adaptive type voice reverberation elimination method and system | |
CN118230722B (en) | Intelligent voice recognition method and system based on AI | |
CN118280371B (en) | Voice interaction method and system based on artificial intelligence | |
Zhang et al. | Temporal Transformer Networks for Acoustic Scene Classification. | |
WO2024114303A1 (en) | Phoneme recognition method and apparatus, electronic device and storage medium | |
JP2021157145A (en) | Inference device and learning method of inference device | |
CN117174082A (en) | Training and executing method, device, equipment and storage medium of voice wake-up model | |
CN115019760A (en) | Data amplification method for audio and real-time sound event detection system and method | |
Shen | Application of transfer learning algorithm and real time speech detection in music education platform | |
JP7345667B2 (en) | Small footprint multichannel keyword spotting | |
CN111192569B (en) | Double-microphone voice feature extraction method and device, computer equipment and storage medium | |
Zhipeng et al. | Voiceprint recognition based on BP Neural Network and CNN | |
CN116705013B (en) | Voice wake-up word detection method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |