WO2023193394A1 - 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 - Google Patents
语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023193394A1 WO2023193394A1 PCT/CN2022/115175 CN2022115175W WO2023193394A1 WO 2023193394 A1 WO2023193394 A1 WO 2023193394A1 CN 2022115175 W CN2022115175 W CN 2022115175W WO 2023193394 A1 WO2023193394 A1 WO 2023193394A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- module
- audio data
- training
- wake
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 197
- 238000000034 method Methods 0.000 title claims abstract description 74
- 230000006870 function Effects 0.000 claims abstract description 96
- 230000037007 arousal Effects 0.000 claims description 44
- 238000012545 processing Methods 0.000 claims description 23
- 238000000605 extraction Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 19
- 108091026890 Coding region Proteins 0.000 claims description 18
- 230000004044 response Effects 0.000 claims description 10
- 230000015654 memory Effects 0.000 claims description 9
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000013135 deep learning Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 13
- 230000007246 mechanism Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 13
- 238000004891 communication Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000004378 air conditioning Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
Definitions
- the present disclosure relates to the field of deep learning and speech technology in the field of artificial intelligence, and in particular to the training of a speech wake-up model, wake-up methods, devices, equipment and storage media.
- Voice wake-up as a switch for voice interaction, plays an important part in it.
- the voice wake-up function has the problem of multiple devices being woken up at the same time. For example, devices of the same brand often support the same wake-up word, which will lead to the embarrassing situation of multiple devices responding to one wake-up.
- the present disclosure provides a voice wake-up model training, wake-up method, device, equipment and storage medium.
- a training method for a voice arousal model including:
- the basic model includes an encoding module and a decoding module
- voice wake-up training data perform voice wake-up training on the first model according to the voice wake-up training data, and obtain the first model when the model loss function converges;
- the first model when the model loss function converges is used as the voice arousal model.
- a voice wake-up method including:
- the coding module based on the voice arousal model encodes the FBank features corresponding to the audio data to obtain the feature coding sequence corresponding to the audio data; uses connection timing classification CTC decoding to determine that the score in the feature coding sequence is greater than or equal to the preset The target feature encoding sequence of values;
- the decoding module based on the voice awakening model decodes and analyzes the target feature encoding sequence and the semantic tag sequence to determine whether to wake up the terminal device.
- a training device for a voice arousal model including:
- the first training module is used to perform speech recognition training on the basic model according to the speech recognition training data, and obtain the model parameters of the basic model when the model loss function converges;
- the basic model includes an encoding module and a decoding module;
- a model configuration module configured to respond to a user-initiated model configuration instruction and update the configuration parameters of the decoding module in the basic model based on the model parameters of the basic model to obtain the first model;
- the second acquisition module is used to acquire voice wake-up training data
- a second training module configured to perform voice wake-up training on the first model based on the voice wake-up training data, to obtain the first model when the model loss function converges
- a model generation module configured to use the first model when the model loss function converges as the voice arousal model.
- a voice wake-up device including:
- the receiving module is used to receive audio data input by the user
- a feature extraction module used to extract features from the audio data to obtain the FBank features corresponding to the audio data
- the first processing module is used as a coding module based on the speech arousal model to encode the FBank features corresponding to the audio data to obtain the feature coding sequence corresponding to the audio data;
- the second processing module is used to use CTC decoding to determine the target feature coding sequence with a score greater than or equal to the preset value in the feature coding sequence;
- the acquisition module is used to obtain the semantic tag sequence corresponding to the user-defined wake word
- the third processing module is configured to decode and analyze the target feature encoding sequence and the semantic tag sequence based on the decoding module of the voice arousal model, and determine whether to wake up the terminal device.
- an electronic device including:
- the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method described in the first aspect or the second aspect. .
- a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first or second aspect.
- a computer program product comprising: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can obtain Reading the storage medium reads the computer program, and the at least one processor executes the computer program to cause the electronic device to perform the method described in the first aspect or the second aspect.
- Technology according to the present disclosure improves the accuracy of customized voice wake-up.
- Figure 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure
- Figure 2 is a schematic structural diagram of a voice arousal model provided by an embodiment of the present disclosure
- Figure 3 is a schematic flowchart of a training method for a voice arousal model provided by an embodiment of the present disclosure
- Figure 4 is a schematic flowchart of creating speech recognition training data provided by an embodiment of the present disclosure
- Figure 5 is a schematic flowchart of creating voice wake-up training data provided by an embodiment of the present disclosure
- Figure 6 is a schematic flowchart of a voice wake-up method provided by an embodiment of the present disclosure
- Figure 7 is a schematic structural diagram of a voice arousal model provided by an embodiment of the present disclosure.
- Figure 8 is a schematic structural diagram of a training device for a voice arousal model provided by an embodiment of the present disclosure
- Figure 9 is a schematic structural diagram of a voice wake-up device provided by an embodiment of the present disclosure.
- FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
- the problem of multiple devices being woken up at the same time can be solved by customizing wake words. For example, users can set different custom wake words for different devices according to their needs. However, the accuracy of voice wake-up based on custom wake-up words is not high.
- the present disclosure provides a voice wake-up model training method and a voice wake-up method, which are applied in the field of deep learning and voice technology in the field of artificial intelligence to improve the accuracy of customized voice wake-up.
- FIG 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure. As shown in Figure 1, this application scenario involves the model training stage and the model application stage.
- the model training device trains the first model based on the first training data set.
- the first training data set includes training data for speech recognition, and the first model has a speech recognition function.
- the second model is obtained by adjusting the model configuration parameters of the first model.
- the model training device trains a second model based on a second training data set, where the second training data set includes training data for voice wake-up, and the second model has a voice wake-up function.
- the trained second model is used as the final voice arousal model.
- model structures of the first model and the second model are consistent.
- the difference between the two is that the parameter configuration of the model output part is different, for example, the dimensional parameters are different.
- the first model has a speech recognition function, and the output of the first model is syllable information corresponding to the audio data, such as an atonal syllable sequence.
- the second model has a voice wake-up function, and the output of the second model is a binary wake-up result of whether to wake up or not. It can be seen that the dimensional parameters of the model output parts of the first model and the second model are different.
- the second trained model is preset in the voice wake-up device, that is, the voice wake-up model in Figure 1.
- the voice wake-up device receives the audio data input by the user, and undergoes the preprocessing process of the audio data and the processing of the voice wake-up model. Analyze and finally get the result of whether it is awakened or not.
- the speech arousal model may adopt an encoder-decoder architecture, or other model structures including an encoder and a decoder.
- the encoder can also be called the encoding module, and the decoder can also be called the decoding module.
- FIG 2 is a schematic structural diagram of a voice arousal model provided by an embodiment of the present disclosure.
- the voice arousal model provided by this embodiment includes an encoding module and a decoding module.
- the encoding module includes the Convolutional Neural Networks (CNN) module and the Recurrent Neural Networks (RNN) module.
- Figure 2 shows 2 CNN modules and 2 RNN modules.
- the data processing process of the encoding module first passes through 2 CNN modules, and then passes through 2 RNN modules.
- the encoding module is used to encode audio features and obtain encoded feature data.
- the decoding module includes the attention mechanism module, RNN module, full connection (full) module and normalization (softmax) module.
- Figure 2 shows two RNN modules.
- the data processing process of the decoding module one channel of data is input to the attention mechanism module through the RNN module, and the other channel of data is directly input to the attention mechanism module. After being processed by the attention mechanism module, it is then passed through the RNN module. , fully connected module and normalized module, and finally output the result of whether to wake up.
- the above RNN module can be replaced by a Long Short Term Memory Network (Long Short Term Memory, LSTM) module.
- Long Short Term Memory Long Short Term Memory, LSTM
- the speech arousal model includes two channels of input data.
- the first input data is the feature sequence corresponding to the audio data, such as filter bank (FBank) features, Mel-frequency cepstral coefficients (Mel-frequency cepstral coefficients, MFCC) characteristics, etc.
- the second input data is a sequence of semantic tags corresponding to the audio data.
- model training device or voice wake-up device in the embodiment of the present disclosure may be a terminal device, a server or a virtual machine, etc., or a distributed computer system composed of one or more servers and/or computers. wait.
- the terminal device includes but is not limited to: smart phones, laptop computers, desktop computers, platform computers, vehicle-mounted devices, smart wearable devices, etc., which are not limited in the embodiments of the present disclosure.
- the server can be an ordinary server or a cloud server.
- the cloud server is also called a cloud computing server or a cloud host. It is a host product in the cloud computing service system.
- the server can also be a distributed system server or a server combined with a blockchain.
- a training data set for two rounds of model training is first constructed, that is, a training data set for speech recognition and a training data set for voice awakening are constructed.
- pre-speech recognition is first performed.
- Training after adjusting the model configuration parameters, perform secondary training on voice wake-up based on the pre-trained model and the voice wake-up training data set, and finally generate a voice wake-up model that can be used to recognize user-defined wake-up words.
- the voice arousal model can adopt the encoder-decoder structure, and use the attention mechanism in the decoder part.
- the voice wake-up model obtained during the above training process improves the recognition accuracy of the custom voice wake-up model compared with the existing wake-up solution. While reducing the false alarm rate, the recognition accuracy is close to the level of customized wake-up, which can meet the needs of vehicles, smart homes, etc. Wake-up needs of other scenes.
- FIG. 3 is a schematic flowchart of a training method for a voice arousal model provided by an embodiment of the present disclosure. This method is explained using the model training device in Figure 1 as the execution subject. As shown in Figure 3, the training method of the voice arousal model may include the following steps:
- Step 301 Obtain speech recognition training data.
- the speech recognition training data includes FBank features, semantic tag sequences, and syllable sequences corresponding to the first audio data.
- the first audio data is any piece of audio data input by the user that includes a custom wake-up word.
- the FBank feature corresponding to the first audio data is obtained by performing feature extraction on the first audio data.
- the semantic tag sequence corresponding to the first audio data is used to indicate the semantic information of the first audio data.
- the syllable sequence corresponding to the first audio data is a frame-level aligned atonal syllable sequence.
- Step 302 Perform speech recognition training on the basic model based on the speech recognition training data, and obtain the model parameters of the basic model when the model loss function converges.
- the FBank features corresponding to the first audio data in the speech recognition training data and the semantic label sequence corresponding to the first audio data are used as inputs of the basic model, and the syllables corresponding to the first audio data in the speech recognition training data are The sequence is used as the output of the basic model for model training.
- the model parameters when the basic model loss function converges are obtained.
- the internal structure of the basic model can be referred to Figure 2 and will not be described again here.
- Step 303 In response to the model configuration instruction initiated by the user, based on the model parameters of the basic model, update the configuration parameters of the decoding module in the basic model to obtain the first model.
- the configuration parameters include the output dimensions of the model. It should be noted that other parameters of the updated model can be set according to actual needs, and this embodiment does not impose any restrictions on this.
- updating the configuration parameters of the decoding module in the basic model includes: modifying the output dimension of the decoding module in the basic model at the output end. Since the output results of the speech arousal task include arousal and non-awakening results, the output dimension of the decoding module in the basic model needs to be modified to two dimensions at the output end.
- the above modification logic can be written into the configuration information of the device. Based on the device configuration information and the trained basic model, the configuration parameters of the decoding module in the trained basic model are updated.
- Step 304 Obtain voice wake-up training data.
- the voice wake-up training data includes positive example training data and negative example training data, where the positive example training data includes the FBank features, semantic tag sequences and wake-up tags corresponding to the first audio data, and the negative example training data includes the first audio data FBank features corresponding to the data, randomly generated semantic tag sequences and non-awakening tags.
- the positive training data of the speech awakening training data is constructed based on the speech recognition training data. Different from the speech recognition training data, the positive training data of the speech awakening training data replaces the syllable sequence corresponding to the first audio data with the awakening label.
- the first audio data is also called positive audio data in the voice wake-up training stage.
- the audio data corresponding to the negative example training data of the voice wake-up training data is also the audio data input by the user (that is, the negative example audio data), except that the audio data does not contain the user-defined wake-up word. It should be pointed out that for the negative audio data, a semantic label sequence can be randomly generated, as long as the semantic label sequence is different from the semantic label sequence corresponding to the negative audio data.
- Step 305 Perform voice wake-up training on the first model according to the voice wake-up training data to obtain the first model when the model loss function converges.
- the FBank feature corresponding to the first audio data in the voice wake-up training data and the semantic label sequence corresponding to the first audio data are used as the input of the first model
- the wake-up label is used as the output of the first model
- the voice The FBank features corresponding to the first audio data in the wake-up training data and the randomly generated semantic label sequence are used as the input of the first model
- the non-wake-up labels are used as the output of the first model for model training.
- both the basic model and the first model include an encoding module and a decoding module.
- the encoding module corresponds to a loss function
- the decoding module corresponds to a loss function. Therefore, the basic model and the first model both include two loss functions.
- the loss function corresponding to the encoding module of the basic model is the same as the loss function corresponding to the encoding module of the first model.
- both use connectionist temporal classification (ctc) loss function; the decoding module of the basic model corresponds to The loss function is the same as the loss function corresponding to the decoding module of the first model, for example, a cross-entropy error (ce) loss function is used.
- the loss function corresponding to the encoding module of the basic model may be different from the loss function corresponding to the encoding module of the first model, and the loss function corresponding to the decoding module of the basic model corresponds to the decoding module of the first model.
- the loss function can be different.
- Step 306 Use the first model when the model loss function converges as the voice arousal model.
- the training method of the speech wake-up model shown in this embodiment obtains the created speech recognition training data and the speech wake-up training data. First, perform speech recognition training on the basic model according to the speech recognition training data, and obtain the model of the basic model when the model loss function converges. parameters; then based on the model configuration instructions initiated by the user, the configuration parameters of the decoding module in the basic model are updated to obtain the first model; finally, the first model is trained on voice wake-up based on the voice wake-up training data, and the first model when the model loss function converges is obtained model as the final speech arousal model.
- the above model training solution is based on the model parameters of speech recognition training. By adjusting the configuration parameters of the model decoding module and performing voice wake-up training, it can improve the convergence speed of voice wake-up model training, improve the recognition accuracy of the voice wake-up model, and reduce the false alarm rate.
- the configuration parameters of the decoding module in the basic model are updated to obtain the first model, including: in response to the user-initiated model configuration Instructions: Based on the model parameters of the basic model, update the configuration parameters of the fully connected sub-module and the normalized sub-module of the decoding module in the basic model to obtain the first model.
- updating the configuration parameters of the fully connected sub-module and normalized sub-module of the decoding module in the basic model includes: updating the output dimensions of the fully connected sub-module and normalized sub-module of the decoding module in the basic model to Two dimensions.
- This embodiment realizes the adjustment of the model function by adjusting the model parameters of the basic model, from the speech recognition function to the voice wake-up function, and then through the second round of model training, that is, voice wake-up training, to achieve the purpose of voice wake-up.
- the loss function of the basic model and the loss function of the first model both include the ctc loss function and the ce loss function.
- ctc loss function used to train the base model or the encoding module of the first model.
- ce loss function used to train the base model or the decoding module of the first model.
- speech recognition training is performed on the basic model according to the speech recognition training data, and the model parameters of the basic model when the model loss function converges are obtained, including: a coding module of the basic model according to the speech recognition training data.
- performing voice wake-up training on the first model according to the voice wake-up training data to obtain the first model when the model loss function converges includes: performing voice wake-up training on the first model based on the voice wake-up training data.
- the encoding module and decoding module of the model are jointly trained; when the ctc loss function corresponding to the encoding module and the ce loss function corresponding to the decoding module both converge, the first model is obtained.
- model training is performed on the basic model and the first model respectively, which can accelerate the convergence speed of model training.
- Figure 4 is a schematic flowchart of creating speech recognition training data provided by an embodiment of the present disclosure. This process may use the model training device in Figure 1 as the execution subject, or may use other devices independent of the model training device as the execution subject. As shown in Figure 4, the creation process of speech recognition training data can include the following steps:
- Step 401 Receive first audio data input by the user, where the first audio data is audio data including a custom wake-up word.
- Step 402 Perform feature extraction on the first audio data to obtain FBank features corresponding to the first audio data.
- performing feature extraction on the first audio data refers to performing feature extraction on the first audio data frame by frame.
- the time domain signal is obtained.
- the time domain signal first needs to be converted into a frequency domain signal.
- the signal can be converted from the time domain to the frequency domain through Fourier transform. After completing the Fourier transform, what is obtained is a frequency domain signal.
- the energy of each frequency band range is different, and the energy spectrum of different phonemes is different.
- the FBank features corresponding to the first audio data are obtained.
- Step 403 Obtain the semantic tag sequence and syllable sequence corresponding to the first audio data.
- semantic information is obtained by performing semantic recognition on the first audio data, and based on a preset tag library, a semantic tag sequence corresponding to the semantic information is obtained, that is, a semantic tag sequence corresponding to the first audio data.
- the tag library includes the number corresponding to each Chinese character, that is, the corresponding relationship between each Chinese character and number. For example, “you” and “good” correspond to 8 and 9 respectively.
- a digital sequence corresponding to the semantic information can be generated based on the tag library, and the digital sequence is a semantic tag sequence.
- the semantic information obtained through semantic recognition is "Xiao Man, what is the weather today?"
- "Xiao" corresponds to 0
- Man corresponds to 7
- Today corresponds to 2
- Sky corresponds to 1
- Qi corresponds to 3.
- "Ru” corresponds to 5
- He corresponds to 6
- the corresponding semantic label sequence that can be generated is ⁇ 0,7,2,1,1,3,5,6 ⁇ .
- the user-defined wake-up word is "Xiaoman”.
- the syllables corresponding to each frame in the first audio data are obtained, and the syllable sequence corresponding to the first audio data is obtained based on the preset.
- the tag library includes the numbers corresponding to the syllables of each Chinese character, that is, the correspondence between the syllables and numbers of each Chinese character.
- the syllables "ni" and “hao” correspond to 8 and 9 respectively.
- a syllable sequence corresponding to the audio data can be generated based on the tag library.
- the syllable corresponding to "xiaoman" in the audio data is xiaoxiaoxiaomanman
- the syllables "xiao" and “xiao” in the tag library man” corresponds to 0 and 7 respectively
- the corresponding syllable sequence can be generated as ⁇ 0,0,0,7,7 ⁇ .
- step 402 and step 403 can be executed sequentially or simultaneously, and this embodiment does not impose any limitation on this.
- Step 404 Use the FBank features, semantic label sequence and syllable sequence corresponding to the first audio data as a set of training data for speech recognition training.
- the FBank feature corresponding to the first audio data is used as the input of the encoding module of the basic model
- the semantic tag sequence corresponding to the first audio data is used as the input of the decoding module of the basic model
- the syllable sequence corresponding to the first audio data is used as the basis.
- the output of the model's decoding module is used as the input of the encoding module of the basic model.
- the MFCC features corresponding to the first audio data can also be extracted, and the MFCC features, semantic label sequences and syllable sequences corresponding to the first audio data can be used as a set of training data for speech recognition training.
- the MFCC feature is based on the FBank feature and then performs discrete cosine transform (DCT). Compared with the FBank feature, it has better discriminability, but the calculation amount is larger.
- DCT discrete cosine transform
- the speech recognition training data shown in this embodiment includes multiple sets of training data, and each set of training data includes user-defined audio data, semantic tag sequences and syllable sequences corresponding to the audio data.
- each set of training data includes user-defined audio data, semantic tag sequences and syllable sequences corresponding to the audio data.
- FIG. 5 is a schematic flowchart of creating voice wake-up training data provided by an embodiment of the present disclosure. This process may use the model training device in Figure 1 as the execution subject, or may use other devices independent of the model training device as the execution subject. As shown in Figure 5, the creation process of voice wake-up training data can include the following steps:
- Step 501 Use the FBank features, semantic tag sequence and wake-up tag corresponding to the first audio data input by the user as a set of positive example data for voice wake-up training.
- the first audio data is audio data including a customized wake word.
- the positive example data of the voice wake-up training data is created. It is only necessary to replace the syllable sequence corresponding to the first audio data with the wake-up tag to obtain a set of positive example data. .
- the FBank feature corresponding to the first audio data is used as the input of the encoding module of the first model
- the semantic label sequence corresponding to the first audio data is used as the input of the decoding module of the first model
- the wake-up label is used as the input of the first model.
- Step 502 Receive second audio data input by the user.
- the second audio data is audio data that does not include the user-defined wake-up word.
- Step 503 Perform feature extraction on the second audio data to obtain FBank features corresponding to the second audio data.
- step 402 of the above embodiment which will not be described again here.
- Step 504 Use the FBank features corresponding to the second audio data, the randomly generated semantic label sequence and the non-awakening labels as a set of negative example data for voice awakening training.
- the FBank feature corresponding to the second audio data is used as the input of the encoding module of the first model
- the randomly generated semantic label sequence is used as the input of the decoding module of the first model
- the non-awakening label is used as the input of the first model.
- the output of the decoding module is used as the input of the encoding module of the first model.
- the randomly generated semantic tag sequence is different from the semantic tag sequence corresponding to the second audio data.
- negative example training data is constructed.
- the semantic information of the negative example audio data is "How is the weather today" and does not include the user-defined wake-up word "Xiaoman".
- the negative example audio data corresponds to
- the semantic tag sequence is ⁇ 2,1,1,3,5,6 ⁇ .
- the randomly generated semantic tag sequence only needs to be different from ⁇ 2,1,1,3,5,6 ⁇ .
- semantic label sequence corresponding to the semantic information of the positive audio data can also be used, such as ⁇ 0,7,2,1,1,3,5,6 ⁇ in the example of step 403, as a construct based on the negative audio data.
- MFCC features corresponding to the first audio data/second audio data can also be extracted.
- the MFCC features, semantic label sequences and wake-up tags corresponding to the first audio data are used as a set of positive example data for voice wake-up training, and the MFCC features, randomly generated semantic tag sequences and non-wake-up tags corresponding to the second audio data are used as A set of negative example data for speech wake-up training.
- the voice wake-up training data shown in this embodiment includes multiple sets of positive example data and multiple sets of negative example data.
- Each set of positive example data includes user-defined audio data, the semantic tag sequence corresponding to the audio data, and the wake-up tag.
- Each set of negative example data includes Example data include audio data without user-defined wake words, randomly generated semantic tag sequences, and non-wake tags.
- the first model is trained so that the model can accurately determine whether to wake up the device and improve the wake-up recognition effect of custom audio.
- the trained voice wake-up model can be preset in a voice wake-up device, such as smart speakers, TVs, mobile phones and other devices, so that the device has a customized voice wake-up function.
- a voice wake-up device such as smart speakers, TVs, mobile phones and other devices.
- the data processing process of the voice wake-up device will be described below with reference to Figure 6.
- FIG. 6 is a schematic flowchart of a voice wake-up method provided by an embodiment of the present disclosure. This process can use the voice wake-up device in Figure 1 as the execution subject. As shown in Figure 6, the voice wake-up method may include the following steps:
- Step 601 Receive audio data input by the user.
- Step 602 Perform feature extraction on the audio data to obtain FBank features corresponding to the audio data.
- FBank features corresponding to the audio data.
- Step 603 The encoding module based on the speech arousal model encodes the FBank features corresponding to the audio data to obtain the feature coding sequence corresponding to the audio data.
- the FBank features corresponding to the audio data are used as the input of the encoding module of the speech arousal model.
- the FBank features corresponding to the audio data are input into the CNN module at the bottom of the encoding module. After two CNN modules, two After processing by the RNN module, the feature encoding sequence corresponding to the audio data is output.
- Step 604 Use CTC decoding to determine the target feature coding sequence whose score is greater than or equal to the preset value in the feature coding sequence.
- CTC decoding is based on a sliding window of preset length, such as a sliding window of 2s. Decoding starts from the feature coding sequence corresponding to the starting position of the sliding window, and audio segments with decoding scores greater than or equal to the preset value are obtained, and then The feature encoding sequence corresponding to the audio clip, that is, the target feature encoding sequence, is used as the input of the decoding module of the speech arousal model.
- the preset value can be set reasonably according to the actual application, and this embodiment does not specifically limit this.
- FIG. 7 is a schematic structural diagram of a voice arousal model provided by an embodiment of the present disclosure. As shown in Figure 7, the processing module for CTC decoding is placed between the encoding module and the decoding module of the speech arousal model. After filtering by the processing module, the target feature encoding sequence with a score greater than or equal to the preset value is output. To the attention mechanism module of the decoding module of the speech arousal model.
- Step 605 Obtain the semantic tag sequence corresponding to the user-defined wake word.
- the semantic tag sequence corresponding to the user-defined wake-up word is pre-stored in the voice wake-up device, and the acquisition method may refer to step 403 of the above embodiment.
- Step 606 The decoding module based on the voice wake-up model decodes and analyzes the target feature encoding sequence and the semantic label sequence to determine whether to wake up the terminal device.
- the target feature coding sequence and the semantic tag sequence corresponding to the user-defined wake-up word are used as input to the decoding module of the speech wake-up model.
- the semantic tag sequence corresponding to the user-defined wake-up word is input into the decoding module.
- the RNN module (the RNN module on the right side of the attention mechanism module) inputs the target feature encoding sequence into the attention mechanism module of the decoding module. After being processed by the attention mechanism module, RNN module, fully connected module and normalization module, it outputs The result of awakening or not.
- the voice wake-up method shown in this embodiment is based on a trained voice wake-up model.
- the voice wake-up model adopts an encoder-decoder architecture, in which the model decoding part includes an attention mechanism module, which greatly improves the performance of customized voice wake-up.
- the voice wake-up solution shown in the embodiment of the present disclosure is based on a voice wake-up model including an attention mechanism module to determine whether to wake up the device.
- Table 1 and Table 2 are obtained.
- Table 1 shows the test statistics of the convolutional neural network CNN-deep neural network (Deep Neural Networks, DNN) custom wake-up and the custom wake-up containing the attention mechanism.
- Table 2 shows the test statistics of customized wake-up and custom wake-up including attention mechanism. It should be pointed out that customized wake-up means that the manufacturer presets the device wake-up word before the device leaves the factory, and there is no function to change the wake-up word. Generally, the performance of customized wake-up is better than that of custom wake-up.
- the custom wake-up scheme in this case improves the internal noise accuracy by 13.6% and reduces false alarms by more than 70%.
- internal noise refers to the noise generated by the equipment itself
- external noise refers to the total noise generated by the environment in which the equipment is located.
- External noise includes background noise and point noise.
- Background noise includes, for example, air conditioning noise and car noise, which are stationary noises; point noise is noise with a clear direction and is non-stationary noise.
- FIG. 8 is a schematic structural diagram of a voice arousal model training device provided by an embodiment of the present disclosure.
- the voice awakening model training device provided in this embodiment may be an electronic device or a device in an electronic device.
- the voice arousal model training device 800 provided by the embodiment of the present disclosure may include:
- the first acquisition module 801 is used to acquire speech recognition training data
- the first training module 802 is used to perform speech recognition training on the basic model according to the speech recognition training data, and obtain the model parameters of the basic model when the model loss function converges;
- the basic model includes an encoding module and a decoding module;
- the model configuration module 803 is configured to respond to the model configuration instruction initiated by the user and update the configuration parameters of the decoding module in the basic model based on the model parameters of the basic model to obtain the first model;
- the second acquisition module 804 is used to acquire voice wake-up training data
- the second training module 805 is used to perform voice wake-up training on the first model according to the voice wake-up training data, and obtain the first model when the model loss function converges;
- the model generation module 806 is used to use the first model when the model loss function converges as the voice arousal model.
- the model configuration module 803 includes: a model parameter update submodule, configured to respond to a user-initiated model configuration instruction and update the basic model based on the model parameters of the basic model.
- the configuration parameters of the fully connected sub-module and the normalized sub-module of the decoding module in the model are used to obtain the first model.
- the model configuration module 803 includes: a model parameter update submodule, configured to respond to a user-initiated model configuration instruction and update the basic model based on the model parameters of the basic model.
- the output dimensions of the fully connected submodule and normalized submodule of the decoding module in the model are updated to two dimensions.
- the model loss function includes a ctc loss function and a ce loss function; the ctc loss function is used to train the encoding module of the basic model or the first model; ce loss function, used to train the base model or the decoding module of the first model.
- the first training module 802 includes:
- the first joint training sub-module is used to jointly train the encoding module and the decoding module of the basic model according to the speech recognition training data; the ctc loss function corresponding to the encoding module and the ce loss corresponding to the decoding module When the functions all converge, the model parameters of the basic model are obtained.
- the second training module 805 includes:
- the second joint training sub-module is used to jointly train the encoding module and the decoding module of the first model according to the speech wake-up training data; the ctc loss function corresponding to the encoding module and the decoding module corresponding When the ce loss functions all converge, the first model is obtained.
- the first acquisition module 801 includes:
- the first receiving sub-module is used to receive the first audio data input by the user, where the first audio data is audio data containing a custom wake-up word;
- the first feature extraction submodule is used to perform feature extraction on the first audio data to obtain the FBank features corresponding to the first audio data;
- the first acquisition sub-module is used to acquire the semantic tag sequence and syllable sequence corresponding to the first audio data
- the first creation sub-module is used to use the FBank features, semantic tag sequences and syllable sequences corresponding to the first audio data as a set of training data for the speech recognition training.
- the second acquisition module 804 includes:
- the second creation sub-module is used to use the FBank features, semantic tag sequences and wake-up tags corresponding to the first audio data input by the user as a set of positive example data for the voice wake-up training; the first audio data is a Audio data defining the wake word;
- the second receiving submodule is used to receive the second audio data input by the user, perform feature extraction on the second audio data, and obtain the FBank features corresponding to the second audio data; the second audio data does not include all Describe the audio data of the custom wake word;
- the third creation submodule is used to use the FBank features corresponding to the second audio data, the randomly generated semantic label sequence and the non-awakening label as a set of negative example data for the voice awakening training; the randomly generated semantics
- the tag sequence is different from the semantic tag sequence corresponding to the second audio data.
- the speech wake-up model training device provided in this embodiment can be used to execute the model training method in the foregoing method embodiment. Its implementation principles and technical effects are similar and will not be described in detail here.
- FIG 9 is a schematic structural diagram of a voice wake-up device provided by an embodiment of the present disclosure.
- the voice wake-up device provided in this embodiment may be an electronic device or a device in an electronic device.
- the voice wake-up device 900 provided by the embodiment of the present disclosure may include:
- Feature extraction module 902 is used to extract features from the audio data to obtain the FBank features corresponding to the audio data;
- the first processing module 903 is used as a coding module based on the voice arousal model to encode the FBank features corresponding to the audio data to obtain the feature coding sequence corresponding to the audio data;
- the second processing module 904 is used to use CTC decoding to determine the target feature coding sequence with a score greater than or equal to the preset value in the feature coding sequence;
- the acquisition module 905 is used to obtain the semantic tag sequence corresponding to the user-defined wake-up word
- the third processing module 906 is configured to decode and analyze the target feature encoding sequence and the semantic label sequence based on the decoding module of the voice wake-up model, and determine whether to wake up the terminal device.
- the voice wake-up device provided in this embodiment can be used to perform the voice wake-up method in the foregoing method embodiment. Its implementation principles and technical effects are similar and will not be described in detail here.
- the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
- the present disclosure also provides a computer program product.
- the computer program product includes: a computer program.
- the computer program is stored in a readable storage medium.
- At least one processor of the electronic device can read from the readable storage medium.
- Taking a computer program at least one processor executes the computer program so that the electronic device executes the solution provided by any of the above embodiments.
- FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
- Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
- the device 1000 includes a computing unit 1001 that can execute according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random access memory (RAM) 1003 Various appropriate actions and treatments. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored.
- Computing unit 1001, ROM 1002 and RAM 1003 are connected to each other via bus 1004.
- An input/output (I/O) interface 1005 is also connected to bus 1004.
- I/O interface 1005 Multiple components in the device 1000 are connected to the I/O interface 1005, including: input unit 1006, such as a keyboard, mouse, etc.; output unit 1007, such as various types of displays, speakers, etc.; storage unit 1008, such as a magnetic disk, optical disk, etc. ; and communication unit 1009, such as a network card, modem, wireless communication transceiver, etc.
- the communication unit 1009 allows the device 1000 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
- Computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
- the computing unit 1001 performs various methods and processes described above, such as the training method of the voice wake-up model or the voice wake-up method.
- the training method of the speech arousal model or the speech arousal method may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1008.
- part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009.
- the computer program When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the voice awakening model or the voice awakening method described above may be performed.
- the computing unit 1001 may be configured to perform the training method of the voice wake-up model or the voice wake-up method in any other suitable manner (eg, by means of firmware).
- Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC system
- CPLD complex programmable logic device
- computer hardware firmware, software, and/or combinations thereof.
- These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
- the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
- An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
- An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
- Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
- the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM portable compact disk read-only memory
- magnetic storage device or any suitable combination of the above.
- the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
- a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or a trackball
- Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
- the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
- the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
- Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
- the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability.
- the server can also be a distributed system server or a server combined with a blockchain.
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
- Telephone Function (AREA)
Abstract
本公开提供了一种语音唤醒模型的训练、唤醒方法、装置、设备及存储介质,涉及人工智能领域,尤其涉及深度学习、语音技术等领域。具体实现方案为:获取创建的语音识别训练数据以及语音唤醒训练数据,首先根据语音识别训练数据对基础模型进行训练,得到模型损失函数收敛时的基础模型的模型参数;随后基于模型配置指令更新基础模型中解码模块的配置参数,得到第一模型;再根据语音唤醒训练数据对第一模型进行训练,在模型损失函数收敛时,获得训练好的语音唤醒模型。上述方案可提升语音唤醒模型训练的收敛速度,基于上述语音唤醒模型对音频数据进行处理分析,可提高识别精度,降低误报率。
Description
本公开要求于2022年04月06日提交中国专利局、申请号为202210356735.2、申请名称为“语音唤醒模型的训练、唤醒方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
本公开涉及人工智能领域中的深度学习和语音技术领域,尤其涉及一种语音唤醒模型的训练、唤醒方法、装置、设备及存储介质。
随着人工智能的发展,越来越多的电子设备开始支持语音交互功能。语音唤醒作为语音交互的开关,在其中占据着重要的组成部分。目前,语音唤醒功能存在多设备同时被唤醒的问题,例如同一品牌的设备往往支持同一个唤醒词,就会导致一次唤醒,多个设备响应的尴尬情况。
发明内容
本公开提供了一种语音唤醒模型的训练、唤醒方法、装置、设备及存储介质。
根据本公开的第一方面,提供了一种语音唤醒模型的训练方法,包括:
获取语音识别训练数据,根据所述语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数;所述基础模型包括编码模块和解码模块;
响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型;
获取语音唤醒训练数据,根据所述语音唤醒训练数据,对所述第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型;
将模型损失函数收敛时的第一模型作为所述语音唤醒模型。
根据本公开的第二方面,提供了一种语音唤醒方法,包括:
接收用户输入的音频数据;
对所述音频数据进行特征提取,得到所述音频数据对应的滤波器组FBank特征;
基于语音唤醒模型的编码模块,对所述音频数据对应的FBank特征进行编码,得到所述音频数据对应的特征编码序列;采用连结时序分类ctc解码确定所述特征编码序列中得分大于或等于预设值的目标特征编码序列;
获取用户自定义唤醒词对应的语义标签序列;
基于所述语音唤醒模型的解码模块,对所述目标特征编码序列以及所述语义标签序列进行解码分析,确定是否唤醒所述终端设备。
根据本公开的第三方面,提供一种语音唤醒模型的训练装置,包括:
第一训练模块,用于根据所述语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数;所述基础模型包括编码模块和解码模块;
模型配置模块,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型;
第二获取模块,用于获取语音唤醒训练数据;
第二训练模块,用于根据所述语音唤醒训练数据,对所述第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型;
模型生成模块,用于将模型损失函数收敛时的第一模型作为所述语音唤醒模型。
根据本公开的第四方面,提供一种语音唤醒装置,包括:
接收模块,用于接收用户输入的音频数据;
特征提取模块,用于对所述音频数据进行特征提取,得到所述音频数据对应的FBank特征;
第一处理模块,用于基于语音唤醒模型的编码模块,对所述音频数据对应的FBank特征进行编码,得到所述音频数据对应的特征编码序列;
第二处理模块,用于采用ctc解码确定所述特征编码序列中得分大于或等于预设值的目标特征编码序列;
获取模块,用于获取用户自定义唤醒词对应的语义标签序列;
第三处理模块,用于基于所述语音唤醒模型的解码模块,对所述目标特征编码序列以及所述语义标签序列进行解码分析,确定是否唤醒终端设备。
根据本公开的第五方面,提供了一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行第一方面或第二方面所述的方法。
根据本公开的第六方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行第一方面或第二方面所述的方法。
根据本公开的第七方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序,所述计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从所述可读存储介质读取所述计算机程序,所述至少一个处理器执行所述计算机程序使得电子设备执行第一方面或第二方面所述的方法。
根据本公开的技术提高了自定义语音唤醒的精度。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1为本公开实施例提供的一种应用场景的示意图;
图2为本公开实施例提供的一种语音唤醒模型的结构示意图;
图3为本公开实施例提供的语音唤醒模型的训练方法的流程示意图;
图4为本公开实施例提供的创建语音识别训练数据的流程示意图;
图5为本公开实施例提供的创建语音唤醒训练数据的流程示意图;
图6为本公开实施例提供的语音唤醒方法的流程示意图;
图7为本公开实施例提供的一种语音唤醒模型的结构示意图;
图8为本公开实施例提供的语音唤醒模型的训练装置的结构示意图;
图9为本公开实施例提供的语音唤醒装置的结构示意图;
图10为本公开实施例提供的电子设备的结构示意图。
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
通过自定义唤醒词可以解决多设备同时被唤醒的问题,例如用户根据需求对不同设备设置不同的自定义唤醒词。然而,基于自定义唤醒词的语音唤醒的精度不高。
本公开提供一种语音唤醒模型的训练方法和语音唤醒方法,应用于人工智能领域中的深度学习和语音技术领域,以达到提升自定义语音唤醒的精度。
为了便于理解本公开提供的技术方案,首先结合图1对本公开实施例的应用场景进行说明。
图1为本公开实施例提供的一种应用场景的示意图。如图1所示,该应用场景涉 及模型训练阶段和模型应用阶段。
在模型训练阶段,模型训练设备基于第一训练数据集训练第一模型。其中第一训练数据集包括用于语音识别的训练数据,第一模型具有语音识别功能。通过调整第一模型的模型配置参数,获得第二模型。模型训练设备基于第二训练数据集训练第二模型,其中第二训练数据集包括用于语音唤醒的训练数据,第二模型具有语音唤醒功能。将训练好的第二模型作为最终的语音唤醒模型。
需要说明的是,第一模型和第二模型的模型结构保持一致,两者的区别是模型输出部分的参数配置不同,例如维度参数不同。
可以理解,第一模型具有语音识别功能,第一模型的输出为音频数据对应的音节信息,例如无调音节序列。第二模型具有语音唤醒功能,第二模型的输出为是否唤醒的二分类唤醒结果。由此可见,第一模型和第二模型的模型输出部分的维度参数不同。
在模型应用阶段,语音唤醒设备中预置训练好的第二模型,即图1中的语音唤醒模型,语音唤醒设备接收用户输入的音频数据,经音频数据的预处理过程以及语音唤醒模型的处理分析,最终得到是否唤醒的结果。
本公开实施例中,语音唤醒模型可采用编码器-解码器(encoder-decoder)架构,或者包括编码器和解码器的其他模型结构。其中编码器也可称为编码模块,解码器也可称为解码模块。
下面结合图2对语音唤醒模型的内部结构进行详细说明。
图2为本公开实施例提供的一种语音唤醒模型的结构示意图。如图2所示,本实施例提供的语音唤醒模型包括编码模块和解码模块。
其中,编码模块包括卷积神经网络(Convolutional Neural Networks,CNN)模块和循环神经网络(Recurrent Neural Networks,RNN)模块。图2示出了2个CNN模块和2个RNN模块,编码模块的数据处理过程:先经过2个CNN模块,再经过2个RNN模块。编码模块用于对音频特征进行编码,得到编码后的特征数据。
其中,解码模块包括注意力(attention)机制模块,RNN模块,全连接(full)模块以及归一化(softmax)模块。图2示出了两个RNN模块,解码模块的数据处理过程:一路数据经RNN模块输入注意力机制模块,另一路数据直接输入注意力机制模块,经注意力机制模块处理后,再经过RNN模块、全连接模块以及归一化模块,最终输出是否唤醒的结果。
可选的,在一些实施例中,上述RNN模块可以替换为长短期记忆网络(Long Short Term Memory,LSTM)模块。
需要说明的是,语音唤醒模型包括两路输入数据,第一输入数据为音频数据对应的特征序列,例如滤波器组(FilterBank,FBank)特征、梅尔频率倒谱系数(Mel-frequency cepstral coefficients,MFCC)特征等。第二输入数据为音频数据对应的语义标签序列。
需要说明的是,本公开实施例中模型训练设备或语音唤醒设备可以是终端设备,也可以是服务器或者虚拟机等,还可以是一个或多个服务器和/或计算机等组成的分布式计算机系统等。
其中,该终端设备包括但不限于:智能手机、笔记本电脑、台式电脑、平台电脑、车载设备、智能穿戴设备等,本公开实施例不作限定。服务器可以为普通服务器或者云服务器,云服务器又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。
本公开实施例中,首先构造两轮模型训练的训练数据集,即构造语音识别的训练数据集以及语音唤醒的训练数据集,基于基础模型和语音识别的训练数据集,先进行语音识别的预训练,经模型配置参数调整后,再基于预训练得到的模型和语音唤醒的训练数据集进行语音唤醒的二次训练,最终生成可用于识别用户自定义唤醒词的语音唤醒模型。其中语音唤醒模型可采用encoder-decoder结构,且在decoder部分使用注意力机制。上述训练过程得到的语音唤醒模型,相比现有唤醒方案,提升了自定义语音唤醒模型的识别精度,在降低误报率的同时,识别精度逼近定制唤醒的水平,可满足包括车载、智能家居等场景的唤醒需求。
下面,结合上述图1所示的应用场景,通过具体实施例对本公开的技术方案进行详细说明。需要说明的是,下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。
图3为本公开实施例提供的语音唤醒模型的训练方法的流程示意图。该方法以图1中的模型训练设备作为执行主体进行解释说明。如图3所示,该语音唤醒模型的训练方法可以包括如下步骤:
步骤301、获取语音识别训练数据。
本实施例中,语音识别训练数据包括第一音频数据对应的FBank特征、语义标签序列以及音节序列,第一音频数据为用户输入的包含自定义唤醒词的任意一条音频数据。
其中,第一音频数据对应的FBank特征是对第一音频数据进行特征提取得到的。第一音频数据对应的语义标签序列用于指示第一音频数据的语义信息。第一音频数据对应的音节序列为帧级别对齐的无调音节序列。
步骤302、根据语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数。
本实施例中,将语音识别训练数据中的第一音频数据对应的FBank特征以及第一音频数据对应的语义标签序列作为基础模型的输入,将语音识别训练数据中的第一音频数据对应的音节序列作为基础模型的输出,进行模型训练,在基础模型损失函数收敛时,获取基础模型损失函数收敛时的模型参数。其中,基础模型的内部结构可参照图2,此处不再赘述。
步骤303、响应于用户发起的模型配置指令,基于基础模型的模型参数,更新基础模型中解码模块的配置参数,得到第一模型。
其中,配置参数包括模型的输出维度。需要说明的是,可以根据实际需求,设置更新模型的其他参数,对此本实施例不作任何限制。
作为一种示例,更新基础模型中解码模块的配置参数,包括:将基础模型中解码模块在输出端的输出维度进行修改。由于语音唤醒任务的输出结果包括唤醒和不唤醒两种结果,因此,需要将基础模型中解码模块在输出端的输出维度修改为二维。
上述修改逻辑可写入设备的配置信息中,基于设备配置信息,在训练好的基础模型的基础上,更新训练好的基础模型中解码模块的配置参数。
步骤304、获取语音唤醒训练数据。
本实施例中,语音唤醒训练数据包括正例训练数据和负例训练数据,其中正例训练数据包括第一音频数据对应的FBank特征、语义标签序列以及唤醒标签,负例训练数据包括第一音频数据对应的FBank特征,随机生成的语义标签序列以及不唤醒标签。
语音唤醒训练数据的正例训练数据是基于语音识别训练数据构造的,与语音识别训练数据不同的是,语音唤醒训练数据的正例训练数据将第一音频数据对应的音节序列替换为唤醒标签。第一音频数据在语音唤醒训练阶段,又称为正例音频数据。
语音唤醒训练数据的负例训练数据对应的音频数据同样为用户输入的音频数据(即,负例音频数据),只是该音频数据不包含用户自定义的唤醒词。需要指出的是,针对负例音频数据,可随机生成一个语义标签序列,只要该语义标签序列与该负例音频数据对应的语义标签序列不同即可。
步骤305、根据语音唤醒训练数据,对第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型。
本实施例中,将语音唤醒训练数据中的第一音频数据对应的FBank特征以及第一音频数据对应的语义标签序列作为第一模型的输入,将唤醒标签作为第一模型的输出, 以及将语音唤醒训练数据中的第一音频数据对应的FBank特征以及随机生成的语义标签序列作为第一模型的输入,将不唤醒标签作为第一模型的输出,进行模型训练,在第一模型损失函数收敛时,获取第一模型损失函数收敛时的模型参数。其中,第一模型的内部结构可参照图2,此处不再赘述。
需要说明的是,基础模型和第一模型训练时,可采用相同的损失函数。就模型结构而言,基础模型和第一模型均包括编码模块和解码模块,编码模块对应一个损失函数,解码模块对应一个损失函数,因此基础模型和第一模型均包括两个损失函数。
作为一种示例,基础模型的编码模块对应的损失函数与第一模型的编码模块对应的损失函数相同,例如均采用连结时序分类(connectionist temporal classification,ctc)损失函数;基础模型的解码模块对应的损失函数与第一模型的解码模块对应的损失函数相同,例如均采用交叉熵误差(cross-entropy error,ce)损失函数。
可选的,在一些实施例中,基础模型的编码模块对应的损失函数与第一模型的编码模块对应的损失函数可以不同,基础模型的解码模块对应的损失函数与第一模型的解码模块对应的损失函数可以不同。
步骤306、将模型损失函数收敛时的第一模型作为语音唤醒模型。
本实施例示出的语音唤醒模型的训练方法,获取创建的语音识别训练数据以及语音唤醒训练数据,首先根据语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数;随后基于用户发起的模型配置指令,更新基础模型中解码模块的配置参数,得到第一模型;最后根据语音唤醒训练数据对第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型,将其作为最终的语音唤醒模型。上述模型训练方案基于语音识别训练的模型参数,通过调整模型解码模块的配置参数,进行语音唤醒训练,可提升语音唤醒模型训练的收敛速度,提高语音唤醒模型的识别精度,降低误报率。
可选的,在一些实施例中,响应于用户发起的模型配置指令,基于基础模型的模型参数,更新基础模型中解码模块的配置参数,得到第一模型,包括:响应于用户发起的模型配置指令,基于基础模型的模型参数,更新基础模型中解码模块的全连接子模块和归一化子模块的配置参数,得到第一模型。作为一种示例,更新基础模型中解码模块的全连接子模块和归一化子模块的配置参数,包括:将基础模型中解码模块的全连接子模块和归一化子模块的输出维度更新为二维。
本实施例通过调整基础模型的模型参数,实现对模型功能的调整,由语音识别功能调整为语音唤醒功能,再经过第二轮模型训练,即语音唤醒训练,实现语音唤醒目 的。
可选的,在一些实施例中,基础模型的损失函数和第一模型的损失函数均包括ctc损失函数和ce损失函数。其中
ctc损失函数,用于训练基础模型或第一模型的编码模块。
ce损失函数,用于训练基础模型或第一模型的解码模块。
本实施例的一个可选实施例中,根据语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数,包括:根据语音识别训练数据对基础模型的编码模块和解码模块进行联合训练;在编码模块对应的ctc损失函数和解码模块对应的ce损失函数均收敛时,得到基础模型的模型参数。
本实施例的一个可选实施例中,根据语音唤醒训练数据,对第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型,包括:根据所述语音唤醒训练数据,对第一模型的编码模块和解码模块进行联合训练;在编码模块对应的ctc损失函数和解码模块对应的ce损失函数均收敛时,得到第一模型。
本实施例中,基于上述两种损失函数,分别对基础模型和第一模型进行模型训练,可加速模型训练的收敛速度。
图4为本公开实施例提供的创建语音识别训练数据的流程示意图。该过程可以以图1中的模型训练设备作为执行主体,也可以以独立于该模型训练设备的其他设备作为执行主体。如图4所示,语音识别训练数据的创建过程,可以包括如下步骤:
步骤401、接收用户输入的第一音频数据,第一音频数据为包含自定义唤醒词的音频数据。
步骤402、对第一音频数据进行特征提取,得到第一音频数据对应的FBank特征。
本实施例中,对第一音频数据进行特征提取是指对第一音频数据逐帧进行特征提取。第一音频数据分帧后得到的是时域信号,为了提取FBank特征,首先需要将时域信号转换为频域信号,可通过傅里叶变换将信号从时域转换至频域。在完成傅里叶变换后,得到的是频域信号,每个频带范围的能量大小不同,不同音素的能量谱不同。再通过梅尔Mel滤波和对数运算,得到第一音频数据对应的FBank特征。
步骤403、获取第一音频数据对应的语义标签序列和音节序列。
本实施例中,通过对第一音频数据进行语义识别,得到语义信息,基于预设的标签库,获取语义信息对应的语义标签序列,即第一音频数据对应的语义标签序列。
标签库包括每个中文字对应的数字,即每个中文字与数字的对应关系,例如“你”和“好”分别对应8和9。
示例性的,在确定音频数据对应的语义信息后,基于标签库可生成语义信息对应的数字序列,该数字序列即语义标签序列。例如,经语义识别得到语义信息为“小满,今天天气如何”,标签库中“小”对应0,“满”对应7,“今”对应2,“天”对应1,“气”对应3,“如”对应5,“何”对应6,可生成对应的语义标签序列为{0,7,2,1,1,3,5,6}。本示例中,用户自定义唤醒词为“小满”。
本实施例中,通过对第一音频数据进行帧级别分析,获取第一音频数据中每帧对应的音节,基于预设的得到第一音频数据对应的音节序列。
标签库包括每个中文字的音节对应的数字,即每个中文字的音节与数字的对应关系,例如音节“ni”和“hao”分别对应8和9。
示例性的,在确定音频数据中每帧对应的音节后,基于标签库可生成音频数据对应的音节序列,例如音频数据中“小满”对应的音节为xiaoxiaoxiaomanman,标签库中音节“xiao”和“man”分别对应0和7,可生成对应的音节序列为{0,0,0,7,7}。
需要指出的是,步骤402和步骤403可以顺序执行,也可以同时执行,对此本实施例不作任何限制。
步骤404、将第一音频数据对应的FBank特征、语义标签序列以及音节序列,作为语音识别训练的一组训练数据。
本实施例中,第一音频数据对应的FBank特征作为基础模型的编码模块的输入,第一音频数据对应的语义标签序列作为基础模型的解码模块的输入,第一音频数据对应的音节序列作为基础模型的解码模块的输出。
可选的,在一些实施例中,也可以提取第一音频数据对应的MFCC特征,将第一音频数据对应的MFCC特征、语义标签序列以及音节序列,作为语音识别训练的一组训练数据。
其中,MFCC特征是在FBank特征的基础上再进行离散余弦变换(DCT),与FBank特征相比,具有更好的判别度,但计算量更大。
本实施例示出的语音识别训练数据包括多组训练数据,每组训练数据包括用户自定义的音频数据,音频数据对应的语义标签序列以及音节序列。通过对基础模型进行训练,使得模型能够准确识别用户自定义的音频数据,提升对自定义音频的识别效果。
图5为本公开实施例提供的创建语音唤醒训练数据的流程示意图。该过程可以以图1中的模型训练设备作为执行主体,也可以以独立于该模型训练设备的其他设备作为执行主体。如图5所示,语音唤醒训练数据的创建过程,可以包括如下步骤:
步骤501、将用户输入的第一音频数据对应的FBank特征、语义标签序列以及唤 醒标签,作为语音唤醒训练的一组正例数据。第一音频数据为包含自定义唤醒词的音频数据。
本实施例中,基于语音识别训练数据中的已有数据,创建语音唤醒训练数据的正例数据,只需要将第一音频数据对应的音节序列替换为唤醒标签,即可得到一组正例数据。
对于任意一组正例数据,第一音频数据对应的FBank特征作为第一模型的编码模块的输入,第一音频数据对应的语义标签序列作为第一模型的解码模块的输入,唤醒标签作为第一模型的解码模块的输出。
步骤502、接收用户输入的第二音频数据。
本实施例中,第二音频数据为不包括用户自定义唤醒词的音频数据。
步骤503、对第二音频数据进行特征提取,得到第二音频数据对应的FBank特征。对第二音频数据进行特征提取的方法可参照上文实施例的步骤402,此处不再赘述。
步骤504、将第二音频数据对应的FBank特征,随机生成的语义标签序列以及不唤醒标签,作为语音唤醒训练的一组负例数据。
对于任意一组负例数据,第二音频数据对应的FBank特征作为第一模型的编码模块的输入,随机生成的语义标签序列作为第一模型的解码模块的输入,不唤醒标签作为第一模型的解码模块的输出。
其中,随机生成的语义标签序列与第二音频数据对应的语义标签序列不同。示例性的,基于步骤403的示例,构造负例训练数据,例如负例音频数据的语义信息为“今天天气如何”,不包括用户自定义的唤醒词“小满”,该负例音频数据对应的语义标签序列为{2,1,1,3,5,6}。随机生成的语义标签序列只要与{2,1,1,3,5,6}不同即可。
应理解,也可以使用正例音频数据的语义信息对应的语义标签序列,例如步骤403示例中的{0,7,2,1,1,3,5,6},作为基于负例音频数据构造的一组负例训练数据中的语义标签序列。
可选的,在一些实施例中,也可以提取第一音频数据/第二音频数据对应的MFCC特征。将第一音频数据对应的MFCC特征、语义标签序列以及唤醒标签,作为语音唤醒训练的一组正例数据,将第二音频数据对应的MFCC特征,随机生成的语义标签序列以及不唤醒标签,作为语音唤醒训练的一组负例数据。
本实施例示出的语音唤醒训练数据包括多组正例数据以及多组负例数据,每组正例数据包括用户自定义的音频数据,该音频数据对应的语义标签序列以及唤醒标签,每组负例数据包括不含用户自定义唤醒词的音频数据,随机生成的语义标签序列以及 不唤醒标签。基于语音唤醒训练数据,对第一模型进行训练,使得模型能够准确判断是否唤醒设备,提升对自定义音频的唤醒识别效果。
基于上述各个实施例,可以将训练好的语音唤醒模型预置在语音唤醒设备中,例如智能音箱、电视、手机等设备,使得设备具有自定义语音唤醒的功能。下面结合图6对语音唤醒设备的数据处理过程进行说明。
图6为本公开实施例提供的语音唤醒方法的流程示意图。该过程可以以图1中的语音唤醒设备作为执行主体。如图6所示,该语音唤醒方法可以包括如下步骤:
步骤601、接收用户输入的音频数据。
步骤602、对音频数据进行特征提取,得到音频数据对应的FBank特征。对音频数据进行特征提取的方法可参照上文实施例的步骤402,此处不再赘述。
步骤603、基于语音唤醒模型的编码模块,对音频数据对应的FBank特征进行编码,得到音频数据对应的特征编码序列。
本实施例中,将音频数据对应的FBank特征作为语音唤醒模型的编码模块的输入,参照图2,将音频数据对应的Fbank特征输入编码模块最底层的CNN模块,经两个CNN模块、两个RNN模块处理后,输出音频数据对应的特征编码序列。
步骤604、采用ctc解码确定特征编码序列中得分大于或等于预设值的目标特征编码序列。
本实施例中,ctc解码是基于预设长度的滑动窗,例如2s的滑动窗,从滑动窗起始位置对应的特征编码序列开始解码,获取解码得分大于或等于预设值的音频片段,将该音频片段对应的特征编码序列,即目标特征编码序列,作为语音唤醒模型的解码模块的输入。其中预设值可根据实际应用合理设置,对此本实施例不作具体限定。
需要指出的是,语音唤醒模型的训练过程不包括本步骤,可将ctc解码作为单独的处理模块。图7为本公开实施例提供的一种语音唤醒模型的结构示意图。如图7所示,将用于ctc解码的处理模块置于语音唤醒模型的编码模块和解码模块之间,经该处理模块过滤后,将得分大于或等于预设值的目标特征编码序列,输出至语音唤醒模型的解码模块的注意力机制模块中。
步骤605、获取用户自定义唤醒词对应的语义标签序列。
用户自定义唤醒词对应的语义标签序列是预先存储在语音唤醒设备中的,其获取方法可参照上文实施例的步骤403。
步骤606、基于语音唤醒模型的解码模块,对目标特征编码序列以及语义标签序列进行解码分析,确定是否唤醒终端设备。
本实施例中,将目标特征编码序列和用户自定义唤醒词对应的语义标签序列,作为语音唤醒模型的解码模块的输入,参照图2,将用户自定义唤醒词对应的语义标签序列输入解码模块的RNN模块(注意力机制模块右侧的RNN模块),将目标特征编码序列输入解码模块的注意力机制模块,经注意力机制模块、RNN模块、全连接模块以及归一化模块处理后,输出是否唤醒的结果。
本实施例示出的语音唤醒方法,基于训练好的语音唤醒模型,该语音唤醒模型采用encoder-decoder架构,其中模型解码部分包括注意力机制模块,极大地提升了自定义语音唤醒的性能。
本公开实施例示出的语音唤醒方案是基于包含注意力机制模块的语音唤醒模型来确定是否唤醒设备。通过实验测试,得到表1和表2,其中表1示出了卷积神经网络CNN-深度神经网络(Deep Neural Networks,DNN)自定义唤醒与包含注意力机制的自定义唤醒的测试统计数据,表2示出了定制唤醒与包含注意力机制的自定义唤醒的测试统计数据。需要指出的是,定制唤醒是指设备出厂前制造商预置了设备唤醒词,没有改唤醒词的功能,通常定制唤醒性能优于自定义唤醒。
表1
表2
由表1可知,本案自定义唤醒方案相较于CNN-DNN方案,在内噪精度提升13.6%的同时,误报降低了超过70%。
由表2可知,本案自定义唤醒方案相较于定制唤醒方案,在误报基本持平的情况下,内噪精度降低了0.1%,外噪精度降低了2.9%。本案自定义唤醒方案的精度逼近定制唤醒水平。
需要说明的是,内噪是指设备自身产生的噪声,外噪是指设备所处环境产生的噪 声总和。外噪包括底噪和点噪,底噪包括例如空调噪音、车噪,属于平稳噪音;点噪是有明确方向的噪音,属于非平稳噪音。
图8为本公开实施例提供的语音唤醒模型的训练装置的结构示意图。本实施例提供的语音唤醒模型的训练装置可以为一种电子设备或者为电子设备中的装置。如图8所示,本公开实施例提供的语音唤醒模型的训练装置800可以包括:
第一获取模块801,用于获取语音识别训练数据;
第一训练模块802,用于根据所述语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数;所述基础模型包括编码模块和解码模块;
模型配置模块803,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型;
第二获取模块804,用于获取语音唤醒训练数据;
第二训练模块805,用于根据所述语音唤醒训练数据,对所述第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型;
模型生成模块806,用于将模型损失函数收敛时的第一模型作为所述语音唤醒模型。
本实施例的一个可选实施例中,所述模型配置模块803,包括:模型参数更新子模块,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中所述解码模块的全连接子模块和归一化子模块的配置参数,得到所述第一模型。
本实施例的一个可选实施例中,所述模型配置模块803,包括:模型参数更新子模块,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,将所述基础模型中所述解码模块的全连接子模块和归一化子模块的输出维度更新为二维。
本实施例的一个可选实施例中,所述模型损失函数包括ctc损失函数和ce损失函数;所述ctc损失函数,用于训练所述基础模型或所述第一模型的编码模块;所述ce损失函数,用于训练所述基础模型或所述第一模型的解码模块。
本实施例的一个可选实施例中,所述第一训练模块802,包括:
第一联合训练子模块,用于根据所述语音识别训练数据对所述基础模型的编码模块和解码模块进行联合训练;在所述编码模块对应的ctc损失函数和所述解码模块对应的ce损失函数均收敛时,得到所述基础模型的模型参数。
本实施例的一个可选实施例中,所述第二训练模块805,包括:
第二联合训练子模块,用于根据所述语音唤醒训练数据,对所述第一模型的编码模块和解码模块进行联合训练;在所述编码模块对应的ctc损失函数和所述解码模块对应的ce损失函数均收敛时,得到所述第一模型。
本实施例的一个可选实施例中,所述第一获取模块801,包括:
第一接收子模块,用于接收用户输入的第一音频数据,所述第一音频数据为包含自定义唤醒词的音频数据;
第一特征提取子模块,用于对所述第一音频数据进行特征提取,得到所述第一音频数据对应的FBank特征;
第一获取子模块,用于获取所述第一音频数据对应的语义标签序列以及音节序列;
第一创建子模块,用于将所述第一音频数据对应的FBank特征、语义标签序列以及音节序列,作为所述语音识别训练的一组训练数据。
本实施例的一个可选实施例中,所述第二获取模块804,包括:
第二创建子模块,用于将用户输入的第一音频数据对应的FBank特征、语义标签序列以及唤醒标签,作为所述语音唤醒训练的一组正例数据;所述第一音频数据为包含自定义唤醒词的音频数据;
第二接收子模块,用于接收用户输入的第二音频数据,对所述第二音频数据进行特征提取,得到所述第二音频数据对应的FBank特征;所述第二音频数据为不包含所述自定义唤醒词的音频数据;
第三创建子模块,用于将所述第二音频数据对应的FBank特征,随机生成的语义标签序列以及不唤醒标签,作为所述语音唤醒训练的一组负例数据;所述随机生成的语义标签序列与所述第二音频数据对应的语义标签序列不同。
本实施例提供的语音唤醒模型的训练装置,可用于执行前述方法实施例中的模型训练方法,其实现原理和技术效果类似,此处不做作赘述。
图9为本公开实施例提供的语音唤醒装置的结构示意图。本实施例提供的语音唤醒装置可以为一种电子设备或者为电子设备中的装置。如图9所示,本公开实施例提供的语音唤醒装置900可以包括:
接收模块901,用于接收用户输入的音频数据;
特征提取模块902,用于对所述音频数据进行特征提取,得到所述音频数据对应的FBank特征;
第一处理模块903,用于基于语音唤醒模型的编码模块,对所述音频数据对应的FBank特征进行编码,得到所述音频数据对应的特征编码序列;
第二处理模块904,用于采用ctc解码确定所述特征编码序列中得分大于或等于预设值的目标特征编码序列;
获取模块905,用于获取用户自定义唤醒词对应的语义标签序列;
第三处理模块906,用于基于所述语音唤醒模型的解码模块,对所述目标特征编码序列以及所述语义标签序列进行解码分析,确定是否唤醒终端设备。
本实施例提供的语音唤醒装置,可用于执行前述方法实施例中的语音唤醒方法,其实现原理和技术效果类似,此处不做作赘述。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
根据本公开的实施例,本公开还提供了一种计算机程序产品,计算机程序产品包括:计算机程序,计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从可读存储介质读取计算机程序,至少一个处理器执行计算机程序使得电子设备执行上述任一实施例提供的方案。
图10为本公开实施例提供的电子设备的结构示意图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图10所示,设备1000包括计算单元1001,其可以根据存储在只读存储器(ROM)1002中的计算机程序或者从存储单元1008加载到随机访问存储器(RAM)1003中的计算机程序,来执行各种适当的动作和处理。在RAM 1003中,还可存储设备1000操作所需的各种程序和数据。计算单元1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。
设备1000中的多个部件连接至I/O接口1005,包括:输入单元1006,例如键盘、鼠标等;输出单元1007,例如各种类型的显示器、扬声器等;存储单元1008,例如磁盘、光盘等;以及通信单元1009,例如网卡、调制解调器、无线通信收发机等。通信单元1009允许设备1000通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元1001可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元1001的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、 各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1001执行上文所描述的各个方法和处理,例如语音唤醒模型的训练方法或者语音唤醒方法。例如,在一些实施例中,语音唤醒模型的训练方法或者语音唤醒方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1008。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1002和/或通信单元1009而被载入和/或安装到设备1000上。当计算机程序加载到RAM 1003并由计算单元1001执行时,可以执行上文描述的语音唤醒模型的训练方法或者语音唤醒方法的一个或多个步骤。备选地,在其他实施例中,计算单元1001可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行语音唤醒模型的训练方法或者语音唤醒方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储 器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务("Virtual Private Server",或简称"VPS")中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。
Claims (21)
- 一种语音唤醒模型的训练方法,包括:获取语音识别训练数据,根据所述语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数;所述基础模型包括编码模块和解码模块;响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型;获取语音唤醒训练数据,根据所述语音唤醒训练数据,对所述第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型;将模型损失函数收敛时的第一模型作为所述语音唤醒模型。
- 根据权利要求1所述的方法,其中,所述响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型,包括:响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中所述解码模块的全连接子模块和归一化子模块的配置参数,得到所述第一模型。
- 根据权利要求1或2所述的方法,其中,所述响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型,包括:响应于用户发起的模型配置指令,基于所述基础模型的模型参数,将所述基础模型中所述解码模块的全连接子模块和归一化子模块的输出维度更新为二维。
- 根据权利要求1至3中任一项所述的方法,其中,所述模型损失函数包括连结时序分类ctc损失函数和交叉熵误差ce损失函数;所述ctc损失函数,用于训练所述基础模型或所述第一模型的编码模块;所述ce损失函数,用于训练所述基础模型或所述第一模型的解码模块。
- 根据权利要求1至4中任一项所述的方法,其中,所述根据所述语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数,包括:根据所述语音识别训练数据对所述基础模型的编码模块和解码模块进行联合训练;在所述编码模块对应的ctc损失函数和所述解码模块对应的ce损失函数均收敛时,得到所述基础模型的模型参数。
- 根据权利要求1至4中任一项所述的方法,其中,所述根据所述语音唤醒训练数据,对所述第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型,包括:根据所述语音唤醒训练数据,对所述第一模型的编码模块和解码模块进行联合训练;在所述编码模块对应的ctc损失函数和所述解码模块对应的ce损失函数均收敛时,得到所述第一模型。
- 根据权利要求1至6中任一项所述的方法,其中,所述获取语音识别训练数据,包括:接收用户输入的第一音频数据,所述第一音频数据为包含自定义唤醒词的音频数据;对所述第一音频数据进行特征提取,得到所述第一音频数据对应的滤波器组FBank特征;获取所述第一音频数据对应的语义标签序列以及音节序列;将所述第一音频数据对应的FBank特征、语义标签序列以及音节序列,作为所述语音识别训练的一组训练数据。
- 根据权利要求1至6中任一项所述的方法,其中,所述获取语音唤醒训练数据,包括:将用户输入的第一音频数据对应的FBank特征、语义标签序列以及唤醒标签,作为所述语音唤醒训练的一组正例数据;所述第一音频数据为包含自定义唤醒词的音频数据;接收用户输入的第二音频数据,对所述第二音频数据进行特征提取,得到所述第二音频数据对应的FBank特征;所述第二音频数据为不包含所述自定义唤醒词的音频数据;将所述第二音频数据对应的FBank特征,随机生成的语义标签序列以及不唤醒标签,作为所述语音唤醒训练的一组负例数据;所述随机生成的语义标签序列与所述第二音频数据对应的语义标签序列不同。
- 一种语音唤醒方法,应用于终端设备,所述方法包括:接收用户输入的音频数据;对所述音频数据进行特征提取,得到所述音频数据对应的滤波器组FBank特征;基于语音唤醒模型的编码模块,对所述音频数据对应的FBank特征进行编码,得到所述音频数据对应的特征编码序列;采用连结时序分类ctc解码确定所述特征编码 序列中得分大于或等于预设值的目标特征编码序列;获取用户自定义唤醒词对应的语义标签序列;基于所述语音唤醒模型的解码模块,对所述目标特征编码序列以及所述语义标签序列进行解码分析,确定是否唤醒所述终端设备。
- 一种语音唤醒模型的训练装置,包括:第一获取模块,用于获取语音识别训练数据;第一训练模块,用于根据所述语音识别训练数据对基础模型进行语音识别训练,得到模型损失函数收敛时的基础模型的模型参数;所述基础模型包括编码模块和解码模块;模型配置模块,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中解码模块的配置参数,得到第一模型;第二获取模块,用于获取语音唤醒训练数据;第二训练模块,用于根据所述语音唤醒训练数据,对所述第一模型进行语音唤醒训练,得到模型损失函数收敛时的第一模型;模型生成模块,用于将模型损失函数收敛时的第一模型作为所述语音唤醒模型。
- 根据权利要求10所述的装置,其中,所述模型配置模块,包括:模型参数更新子模块,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,更新所述基础模型中所述解码模块的全连接子模块和归一化子模块的配置参数,得到所述第一模型。
- 根据权利要求10或11所述的装置,其中,所述模型配置模块,包括:模型参数更新子模块,用于响应于用户发起的模型配置指令,基于所述基础模型的模型参数,将所述基础模型中所述解码模块的全连接子模块和归一化子模块的输出维度更新为二维。
- 根据权利要求10至12中任一项所述的装置,其中,所述模型损失函数包括连结时序分类ctc损失函数和交叉熵误差ce损失函数;所述ctc损失函数,用于训练所述基础模型或所述第一模型的编码模块;所述ce损失函数,用于训练所述基础模型或所述第一模型的解码模块。
- 根据权利要求10至13中任一项所述的装置,其中,所述第一训练模块,包括:第一联合训练子模块,用于根据所述语音识别训练数据对所述基础模型的编码模块和解码模块进行联合训练;在所述编码模块对应的ctc损失函数和所述解码模块对 应的ce损失函数均收敛时,得到所述基础模型的模型参数。
- 根据权利要求10至13中任一项所述的装置,其中,所述第二训练模块,包括:第二联合训练子模块,用于根据所述语音唤醒训练数据,对所述第一模型的编码模块和解码模块进行联合训练;在所述编码模块对应的ctc损失函数和所述解码模块对应的ce损失函数均收敛时,得到所述第一模型。
- 根据权利要求10至15中任一项所述的装置,其中,所述第一获取模块,包括:第一接收子模块,用于接收用户输入的第一音频数据,所述第一音频数据为包含自定义唤醒词的音频数据;第一特征提取子模块,用于对所述第一音频数据进行特征提取,得到所述第一音频数据对应的滤波器组FBank特征;第一获取子模块,用于获取所述第一音频数据对应的语义标签序列以及音节序列;第一创建子模块,用于将所述第一音频数据对应的FBank特征、语义标签序列以及音节序列,作为所述语音识别训练的一组训练数据。
- 根据权利要求10至15中任一项所述的装置,其中,所述第二获取模块,包括:第二创建子模块,用于将用户输入的第一音频数据对应的FBank特征、语义标签序列以及唤醒标签,作为所述语音唤醒训练的一组正例数据;所述第一音频数据为包含自定义唤醒词的音频数据;第二接收子模块,用于接收用户输入的第二音频数据,对所述第二音频数据进行特征提取,得到所述第二音频数据对应的FBank特征;所述第二音频数据为不包含所述自定义唤醒词的音频数据;第三创建子模块,用于将所述第二音频数据对应的FBank特征,随机生成的语义标签序列以及不唤醒标签,作为所述语音唤醒训练的一组负例数据;所述随机生成的语义标签序列与所述第二音频数据对应的语义标签序列不同。
- 一种语音唤醒装置,包括:接收模块,用于接收用户输入的音频数据;特征提取模块,用于对所述音频数据进行特征提取,得到所述音频数据对应的滤波器组FBank特征;第一处理模块,用于基于语音唤醒模型的编码模块,对所述音频数据对应的FBank 特征进行编码,得到所述音频数据对应的特征编码序列;第二处理模块,用于采用连结时序分类ctc解码确定所述特征编码序列中得分大于或等于预设值的目标特征编码序列;获取模块,用于获取用户自定义唤醒词对应的语义标签序列;第三处理模块,用于基于所述语音唤醒模型的解码模块,对所述目标特征编码序列以及所述语义标签序列进行解码分析,确定是否唤醒终端设备。
- 一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1至8中任一项所述的方法,或者,执行权利要求9所述的方法。
- 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1至8中任一项所述的方法,或者,执行权利要求9所述的方法。
- 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现权利要求1至8中任一项所述的方法,或者,权利要求9所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/328,135 US20230317060A1 (en) | 2022-04-06 | 2023-06-02 | Method and apparatus for training voice wake-up model, method and apparatus for voice wake-up, device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210356735.2A CN114842855A (zh) | 2022-04-06 | 2022-04-06 | 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 |
CN202210356735.2 | 2022-04-06 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/328,135 Continuation US20230317060A1 (en) | 2022-04-06 | 2023-06-02 | Method and apparatus for training voice wake-up model, method and apparatus for voice wake-up, device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023193394A1 true WO2023193394A1 (zh) | 2023-10-12 |
Family
ID=82564796
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/115175 WO2023193394A1 (zh) | 2022-04-06 | 2022-08-26 | 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114842855A (zh) |
WO (1) | WO2023193394A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114842855A (zh) * | 2022-04-06 | 2022-08-02 | 北京百度网讯科技有限公司 | 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 |
CN115132210B (zh) * | 2022-09-02 | 2022-11-18 | 北京百度网讯科技有限公司 | 音频识别方法、音频识别模型的训练方法、装置和设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107134279A (zh) * | 2017-06-30 | 2017-09-05 | 百度在线网络技术(北京)有限公司 | 一种语音唤醒方法、装置、终端和存储介质 |
CN112185382A (zh) * | 2020-09-30 | 2021-01-05 | 北京猎户星空科技有限公司 | 一种唤醒模型的生成和更新方法、装置、设备及介质 |
CN113096647A (zh) * | 2021-04-08 | 2021-07-09 | 北京声智科技有限公司 | 语音模型训练方法、装置和电子设备 |
CN113963688A (zh) * | 2021-12-23 | 2022-01-21 | 深圳市友杰智新科技有限公司 | 语音唤醒模型的训练方法、唤醒词的检测方法和相关设备 |
CN114242065A (zh) * | 2021-12-31 | 2022-03-25 | 科大讯飞股份有限公司 | 语音唤醒方法及装置、语音唤醒模块的训练方法及装置 |
CN114842855A (zh) * | 2022-04-06 | 2022-08-02 | 北京百度网讯科技有限公司 | 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 |
-
2022
- 2022-04-06 CN CN202210356735.2A patent/CN114842855A/zh active Pending
- 2022-08-26 WO PCT/CN2022/115175 patent/WO2023193394A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107134279A (zh) * | 2017-06-30 | 2017-09-05 | 百度在线网络技术(北京)有限公司 | 一种语音唤醒方法、装置、终端和存储介质 |
CN112185382A (zh) * | 2020-09-30 | 2021-01-05 | 北京猎户星空科技有限公司 | 一种唤醒模型的生成和更新方法、装置、设备及介质 |
CN113096647A (zh) * | 2021-04-08 | 2021-07-09 | 北京声智科技有限公司 | 语音模型训练方法、装置和电子设备 |
CN113963688A (zh) * | 2021-12-23 | 2022-01-21 | 深圳市友杰智新科技有限公司 | 语音唤醒模型的训练方法、唤醒词的检测方法和相关设备 |
CN114242065A (zh) * | 2021-12-31 | 2022-03-25 | 科大讯飞股份有限公司 | 语音唤醒方法及装置、语音唤醒模块的训练方法及装置 |
CN114842855A (zh) * | 2022-04-06 | 2022-08-02 | 北京百度网讯科技有限公司 | 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN114842855A (zh) | 2022-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11514891B2 (en) | Named entity recognition method, named entity recognition equipment and medium | |
US11915699B2 (en) | Account association with device | |
WO2021093449A1 (zh) | 基于人工智能的唤醒词检测方法、装置、设备及介质 | |
US20200402500A1 (en) | Method and device for generating speech recognition model and storage medium | |
US10699699B2 (en) | Constructing speech decoding network for numeric speech recognition | |
WO2018227781A1 (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
WO2023193394A1 (zh) | 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质 | |
US20180012593A1 (en) | Keyword detection modeling using contextual information | |
US10204619B2 (en) | Speech recognition using associative mapping | |
TW201935464A (zh) | 基於記憶性瓶頸特徵的聲紋識別的方法及裝置 | |
CN112259089B (zh) | 语音识别方法及装置 | |
CN112509555B (zh) | 方言语音识别方法、装置、介质及电子设备 | |
US20230127787A1 (en) | Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium | |
CN110992940B (zh) | 语音交互的方法、装置、设备和计算机可读存储介质 | |
CN113674746B (zh) | 人机交互方法、装置、设备以及存储介质 | |
WO2017166625A1 (zh) | 用于语音识别的声学模型训练方法、装置和电子设备 | |
CN114127849A (zh) | 语音情感识别方法和装置 | |
CN113611316A (zh) | 人机交互方法、装置、设备以及存储介质 | |
CN114242093A (zh) | 语音音色转换方法、装置、计算机设备和存储介质 | |
CN114678032B (zh) | 一种训练方法、语音转换方法及装置和电子设备 | |
CN114121022A (zh) | 语音唤醒方法、装置、电子设备以及存储介质 | |
US20230317060A1 (en) | Method and apparatus for training voice wake-up model, method and apparatus for voice wake-up, device, and storage medium | |
CN112382296A (zh) | 一种声纹遥控无线音频设备的方法和装置 | |
Kusumah et al. | Hybrid automatic speech recognition model for speech-to-text application in smartphones | |
CN118824236A (zh) | 发音准确度确定方法、装置、计算机设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22871156 Country of ref document: EP Kind code of ref document: A1 |