CN117174082A

CN117174082A - Training and execution method, device, equipment and storage medium of voice wake-up model

Info

Publication number: CN117174082A
Application number: CN202311213749.XA
Authority: CN
Inventors: 孙祥宇
Original assignee: Espressif Systems Shanghai Co Ltd
Current assignee: Espressif Systems Shanghai Co Ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-05
Also published as: WO2025061091A1

Abstract

This application provides a training method, execution method, device, equipment and storage medium for a voice wake-up model. The voice awakening model includes a neural network structure, an awakening word classification branch and a starting and ending point judgment branch. The training method includes: obtaining audio samples and performing acoustic feature processing on the audio samples to obtain acoustic feature frames; inputting the acoustic feature frames into the neural network structure to generate an output vector, wherein the output vector has the first attribute and/or the second attribute. Attributes, the first attribute is whether it contains a wake-up word, and the second attribute is the start point position and end point position of the wake-up word in the output vector; and based on the output vector, the wake-up word classification branch and the start and end point judgment branch are trained respectively. This application can improve the detection accuracy of wake-up words.

Description

Training and execution method, device, equipment and storage medium of voice wake-up model

技术领域Technical field

本申请涉及计算机技术领域，尤其涉及语音唤醒模型的训练方法、执行方法、装置、设备及存储介质。The present application relates to the field of computer technology, and in particular to training methods, execution methods, devices, equipment and storage media of voice arousal models.

背景技术Background technique

随着计算机技术的发展，出现越来越多的智能设备支持语音交互功能，例如，智能音箱、智能电视、汽车等大量设备配置了语音交互功能，可以直接使用语音控制和操作设备，为用户提供了极大的便利。在语音交互过程中，一般由唤醒词检测作为一轮交互的开端。一旦唤醒词被触发，设备就会主动收集用户语音指令。为了减低唤醒词的误识别引起的设备突然触发，很多方案会在本地装置进行一级唤醒词校验的基础上，在云端加一个二级唤醒词校验，这就要求准确地截取唤醒词。With the development of computer technology, more and more smart devices have emerged that support voice interaction functions. For example, a large number of devices such as smart speakers, smart TVs, and cars are equipped with voice interaction functions. They can directly control and operate devices using voice, providing users with A great convenience. In the process of voice interaction, wake word detection is generally used as the beginning of a round of interaction. Once the wake word is triggered, the device will actively collect user voice commands. In order to reduce the sudden triggering of the device caused by misrecognition of the wake word, many solutions will add a second level wake word verification in the cloud based on the first level wake word verification on the local device, which requires accurate interception of the wake word.

传统技术中，截取唤醒词的通常做法包括：In traditional technology, common methods of intercepting wake words include:

第一、在唤醒词触发位置，向前截取一段固定长度的音频，但考虑到不同用户发音的快慢不同，固定长度的截取一般会按照慢语速的长度设置，这样对于一个快语速的唤醒词触发就可能截取一些无关音频，从而影响二级唤醒校验的准确率。First, at the wake word trigger position, a fixed-length audio is intercepted forward. However, considering that different users have different pronunciation speeds, the fixed-length interception is generally set according to the length of a slow speaking speed, so that for a fast speaking wake-up call Word triggering may intercept some irrelevant audio, thus affecting the accuracy of secondary wake-up verification.

第二、使用语音端点检测(Voice Activate Detection，VAD)截取唤醒词的起始点，但这个方法只能在较安静的环境中使用，但在一个嘈杂的环境中，VAD的性能迅速下降，而无法有效获得唤醒词的起始位置。Second, use Voice Activate Detection (VAD) to intercept the starting point of the wake word, but this method can only be used in a quieter environment, but in a noisy environment, the performance of VAD declines rapidly and cannot Effectively obtains the starting position of the wake word.

因此，随着语音交互功能的快速发展，亟需一种准确且能够高效运行的语音唤醒技术。Therefore, with the rapid development of voice interaction functions, there is an urgent need for an accurate and efficient voice wake-up technology.

发明内容Contents of the invention

本申请旨在至少解决现有技术或相关技术中存在的技术问题之一，为此，本申请提供一种语音唤醒模型的训练方法、执行方法、装置、设备及存储介质。This application aims to solve at least one of the technical problems existing in the prior art or related technologies. To this end, this application provides a training method, execution method, device, equipment and storage medium for a voice wake-up model.

根据本申请的第一方面，提供了一种语音唤醒模型的训练方法，其中，该语音唤醒模型包括神经网络结构、唤醒词分类分支和起止点判断分支，该训练方法包括：According to the first aspect of the present application, a training method for a voice awakening model is provided, wherein the voice awakening model includes a neural network structure, an awakening word classification branch, and a start and end point judgment branch. The training method includes:

获取音频样本，并对音频样本进行声学特征处理以得到声学特征帧；Obtain audio samples and perform acoustic feature processing on the audio samples to obtain acoustic feature frames;

将声学特征帧输入至神经网络结构，以生成输出向量，其中，输出向量具有第一属性和/或第二属性，第一属性为是否包含唤醒词，第二属性为唤醒词在输出向量中的起始点位置和结束点位置；以及Input the acoustic feature frame to the neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether the wake-up word is included, and the second attribute is the position of the wake-up word in the output vector. the starting point position and the ending point position; and

根据输出向量，分别对唤醒词分类分支和起止点判断分支进行训练，包括：According to the output vector, the wake word classification branch and the start and end point judgment branch are trained respectively, including:

将输出向量分别输入至唤醒词分类分支和起止点判断分支，并由唤醒词分类分支输出音频样本包含唤醒词的概率，以及由起止点判断分支输出唤醒词在音频样本中的起始点位置和结束点位置；和The output vectors are input to the wake word classification branch and the start and end point judgment branch respectively, and the wake word classification branch outputs the probability that the audio sample contains the wake word, and the start and end point judgment branch outputs the starting point position and end of the wake word in the audio sample. point position; and

利用唤醒词分类分支的输出结果和起止点判断分支的输出结果分别调整唤醒词分类分支的参数和起止点判断分支的参数，以得到更新后的唤醒词分类分支以及起止点判断分支。The output results of the wake word classification branch and the output results of the start and end point judgment branches are used to adjust the parameters of the wake word classification branch and the parameters of the start and end point judgment branches respectively to obtain the updated wake word classification branch and the start and end point judgment branches.

作为本申请的一个实施例，唤醒词分类分支至少包括与神经网络结构连接的第一全连接层结构，神经网络结构的输出向量包括用于唤醒词分类分支的第一标签数据，第一标签数据涉及第一属性，第一标签数据用于表征输出向量中是否包含唤醒词且作为第一监督学习目标输入至第一全连接层结构。As an embodiment of the present application, the wake-up word classification branch at least includes a first fully connected layer structure connected to a neural network structure, and the output vector of the neural network structure includes first label data for the wake-up word classification branch. The first label data Relating to the first attribute, the first label data is used to characterize whether the output vector contains the wake-up word and is input to the first fully connected layer structure as the first supervised learning target.

作为本申请的一个实施例，利用唤醒词分类分支的输出结果调整唤醒词分类分支的参数包括：根据输出向量的第一属性和唤醒词分类分支的输出结果，确定唤醒词分类分支的第一损失函数；和，基于第一损失函数调整唤醒词分类分支的参数。As an embodiment of the present application, using the output result of the wake-up word classification branch to adjust the parameters of the wake-up word classification branch includes: determining the first loss of the wake-up word classification branch according to the first attribute of the output vector and the output result of the wake-up word classification branch. function; and, adjusting the parameters of the wake word classification branch based on the first loss function.

作为本申请的一个实施例，基于第一损失函数调整唤醒词分类分支的参数包括：根据第一损失函数的计算值迭代更新唤醒词分类分支的参数，直至收敛，以得到更新后的唤醒词分类分支。As an embodiment of the present application, adjusting the parameters of the wake-up word classification branch based on the first loss function includes: iteratively updating the parameters of the wake-up word classification branch according to the calculated value of the first loss function until convergence, so as to obtain the updated wake-up word classification. branch.

作为本申请的一个实施例，唤醒词分类分支的第一损失函数为交叉熵。As an embodiment of the present application, the first loss function of the wake-up word classification branch is cross entropy.

作为本申请的一个实施例，第一损失函数的计算公式为：As an embodiment of this application, the calculation formula of the first loss function is:

其中，为唤醒词的标签；当输出向量中包含唤醒词时，/>当输出向量中不包含唤醒词时，/>y为唤醒词分类分支输出的唤醒词概率，其范围在[0，1]。in, is the label of the wake-up word; when the output vector contains the wake-up word, /> When the output vector does not contain the wake word, /> y is the wake word probability output by the wake word classification branch, and its range is [0, 1].

作为本申请的一个实施例，起止点判断分支至少包括与神经网络结构连接的第二全连接层结构，神经网络结构的输出向量包括用于起止点判断分支的第二标签数据，第二标签数据涉及第二属性，第二标签数据用于表征唤醒词在输出向量中的起始点位置和结束点位置且作为第二监督学习目标输入至第二全连接层结构。As an embodiment of the present application, the start and end point judgment branch at least includes a second fully connected layer structure connected to the neural network structure, and the output vector of the neural network structure includes second label data for the start and end point judgment branch. The second label data Relating to the second attribute, the second label data is used to characterize the start point position and the end point position of the wake-up word in the output vector and is input to the second fully connected layer structure as the second supervised learning target.

作为本申请的一个实施例，利用起止点判断分支的输出结果调整起止点判断分支的参数包括：根据输出向量的第二属性和起止点判断分支的输出结果，确定起止点判断分支的第二损失函数；和，基于第二损失函数调整起止点判断分支的参数。As an embodiment of the present application, using the output result of the start and end point judgment branch to adjust the parameters of the start and end point judgment branch includes: determining the second loss of the start and end point judgment branch according to the second attribute of the output vector and the output result of the start and end point judgment branch. function; and, adjusting the parameters of the start and end point judgment branch based on the second loss function.

作为本申请的一个实施例，基于第二损失函数调整起止点判断分支的参数包括：根据第二损失函数的计算值迭代更新起止点判断分支的参数，直至收敛，以得到更新后的起止点判断分支。As an embodiment of the present application, adjusting the parameters of the start and end point judgment branches based on the second loss function includes: iteratively updating the parameters of the start and end point judgment branches according to the calculated value of the second loss function until convergence, so as to obtain the updated start and end point judgments. branch.

作为本申请的一个实施例，起止点判断分支的第二损失函数为输出的起始点位置和结束点位置与真实的起始点位置和结束点位置之间的均方误差。As an embodiment of the present application, the second loss function of the start and end point determination branch is the mean square error between the output starting point position and end point position and the true starting point position and end point position.

作为本申请的一个实施例，第二损失函数的计算公式为：As an embodiment of this application, the calculation formula of the second loss function is:

L₂＝(s1-k1)²+(s2-k2)² L ₂ =(s1-k1) ² +(s2-k2) ²

其中，L₂为均方误差值；s1为唤醒词在输出向量中的真实起始点位置；s2为唤醒词在输出向量中的真实结束点位置；k1为起止点判断分支输出的起始点位置；k2为起止点判断分支输出的结束点位置。Among them, L ₂ is the mean square error value; s1 is the real starting point position of the wake-up word in the output vector; s2 is the real end point position of the wake-up word in the output vector; k1 is the starting point position of the start and end point judgment branch output; k2 is the end point position of the start and end point judgment branch output.

作为本申请的一个实施例，训练方法还包括：利用唤醒词分类分支的输出结果和/或起止点判断分支的输出结果调整神经网络结构的参数，以得到更新后的神经网络结构。As an embodiment of the present application, the training method further includes: using the output result of the wake word classification branch and/or the output result of the start and end point determination branch to adjust parameters of the neural network structure to obtain an updated neural network structure.

作为本申请的一个实施例，对音频样本进行特征处理包括：对音频样本中的唤醒词进行端点检测，从唤醒词的起始点开始，按固定长度的时间窗口截取数据帧，直至唤醒词的结束点，以获得复数个相同时间长度的数据帧；复数个数据帧为一维数组的数据形式；以及对一维数组进行音频频谱特征提取，并输出二维特征数组。As an embodiment of the present application, performing feature processing on audio samples includes: performing endpoint detection on wake-up words in audio samples, starting from the starting point of the wake-up words, intercepting data frames according to a fixed-length time window until the end of the wake-up words points to obtain a plurality of data frames of the same time length; the plurality of data frames are in the form of one-dimensional arrays; and the audio spectrum features are extracted from the one-dimensional arrays and a two-dimensional feature array is output.

作为本申请的一个实施例，响应于截取的数据帧包括复数个唤醒词，保留包含第一个唤醒词的数据帧。As an embodiment of the present application, in response to the intercepted data frame including a plurality of wake words, the data frame containing the first wake word is retained.

作为本申请的一个实施例，将声学特征帧输入至神经网络结构，以生成输出向量包括：将二维特征数组输入神经网络结构中，通过卷积算法生成一维向量。As an embodiment of the present application, inputting the acoustic feature frame into the neural network structure to generate an output vector includes: inputting the two-dimensional feature array into the neural network structure and generating a one-dimensional vector through a convolution algorithm.

根据本申请的第二方面，还提供了一种语音唤醒模型的执行方法，包括：According to the second aspect of the present application, a method for executing a voice arousal model is also provided, including:

获取待检测语音数据；Get the voice data to be detected;

将待检测语音数据输入上述任意经训练后的语音唤醒模型中；Input the voice data to be detected into any of the above-trained voice arousal models;

获取唤醒词分类分支的输出结果和起止点判断分支的输出结果；以及Obtain the output results of the wake word classification branch and the output results of the start and end point judgment branches; and

基于唤醒词分类分支的输出结果和/或起止点判断分支的输出结果输出语音唤醒模型的唤醒结果。The wake-up result of the voice wake-up model is output based on the output result of the wake-up word classification branch and/or the output result of the start and end point judgment branch.

作为本申请的一个实施例，该执行方法还包括：将唤醒词分类分支的输出结果和/或起止点判断分支的输出结果发送至语音识别云端平台以进行唤醒词的二次识别。As an embodiment of the present application, the execution method further includes: sending the output result of the wake word classification branch and/or the output result of the start and end point determination branch to the speech recognition cloud platform for secondary recognition of the wake word.

根据本申请的第三方面，还提供了一种语音唤醒模型的训练装置，该语音唤醒模型包括神经网络结构、唤醒词分类分支和起止点判断分支，该训练装置包括：According to the third aspect of the present application, a training device for a voice wake-up model is also provided. The voice wake-up model includes a neural network structure, a wake-up word classification branch, and a starting and ending point judgment branch. The training device includes:

获取模块，用于获取音频样本，并对音频样本进行声学特征处理以得到声学特征帧；An acquisition module is used to acquire audio samples and perform acoustic feature processing on the audio samples to obtain acoustic feature frames;

生成模块，用于将声学特征帧输入至神经网络结构，以生成输出向量，其中，输出向量具有第一属性和/或第二属性，第一属性为是否包含唤醒词，第二属性为唤醒词在输出向量中的起始点位置和结束点位置；以及A generation module for inputting the acoustic feature frame into the neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether it contains a wake-up word, and the second attribute is the wake-up word. The starting point position and the ending point position in the output vector; and

训练模块，其根据输出向量，分别对唤醒词分类分支和起止点判断分支进行训练，包括：The training module trains the wake word classification branch and the start and end point judgment branch respectively based on the output vector, including:

将输出向量分别输入至唤醒词分类分支和起止点判断分支，并由唤醒词分类分支输出输出向量包含唤醒词的概率，以及由起止点判断分支输出唤醒词在输出向量中的起始点位置和结束点位置；和The output vectors are input to the wake word classification branch and the start and end point judgment branch respectively, and the wake word classification branch outputs the probability that the output vector contains the wake word, and the start and end point judgment branch outputs the starting point position and end of the wake word in the output vector. point position; and

利用唤醒词分类分支的输出结果和起止点判断分支的输出结果分别调整唤醒词分类分支的参数和起止点判断分支的参数，以得到更新后的语音唤醒模型。The output results of the wake word classification branch and the output results of the start and end point judgment branches are used to adjust the parameters of the wake word classification branch and the parameters of the start and end point judgment branches respectively to obtain an updated voice arousal model.

根据本申请的第四方面，还提供了一种语音唤醒设备，包括存储器和处理器，存储器中存储有计算机程序，处理器执行计算机程序时实现上述任意训练方法和/或上述任意执行方法。According to a fourth aspect of the present application, a voice wake-up device is also provided, including a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it implements any of the above training methods and/or any of the above execution methods.

根据本申请的第五方面，还提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述任意训练方法和/或上述任意执行方法。According to the fifth aspect of the present application, a computer-readable storage medium is also provided, on which a computer program is stored. When the computer program is executed by a processor, any of the above training methods and/or any of the above execution methods are implemented.

根据本申请的第六方面，还提供了一种计算机程序，其被处理器执行时实现上述任意训练方法和/或上述任意执行方法。According to a sixth aspect of the present application, a computer program is also provided, which when executed by a processor implements any of the above training methods and/or any of the above execution methods.

本申请实施例的语音唤醒模型的训练方法采用多任务训练的方法，在唤醒词检测网络的末端并行设置唤醒词分类分支和所述起止点判断分支，同时对唤醒词的分类任务和唤醒词起止点的回归任务进行训练，实现输出音频样本中是否包含唤醒词的同时，还输出唤醒词在音频样本中的起始点位置和结束点位置。因此，本申请实施例训练得到的语音唤醒模型可以快速、准确地获取唤醒词的起始点和结束点，有助于更精准地在音频帧中截取唤醒词。此外，由于同时对唤醒词的分类任务和唤醒词起止点的回归任务进行训练，能够有助于增强神经网络结构的效果，相比单一训练任务，能够进一步提高唤醒词检测的准确性。The training method of the speech wake-up model in the embodiment of the present application adopts a multi-task training method. The wake-up word classification branch and the start and end point judgment branch are set in parallel at the end of the wake-up word detection network, and at the same time, the wake-up word classification task and the start and end of the wake-up word are The point regression task is trained to output whether the audio sample contains the wake-up word, and also output the starting point position and end point position of the wake-up word in the audio sample. Therefore, the speech wake-up model trained in the embodiment of the present application can quickly and accurately obtain the starting point and end point of the wake-up word, which helps to intercept the wake-up word in the audio frame more accurately. In addition, since the classification task of wake words and the regression task of start and end points of wake words are trained at the same time, it can help enhance the effect of the neural network structure. Compared with a single training task, the accuracy of wake word detection can be further improved.

另外，本申请实施例的语音唤醒的执行方法基于上述经训练的语音唤醒模型，能够在输出音频样本中是否包含唤醒词的同时，还输出该唤醒词在音频样本中的起始点位置和结束点位置，实现显著提升唤醒词检测的准确性，极大提高语音唤醒的性能。此外，本申请在需要云端二次校验的场景能够更精准地在音频帧中截取唤醒词，因此，在需要云端二次校验的场景中特别能够体现出优势。In addition, the voice wake-up execution method in the embodiment of the present application is based on the above-trained voice wake-up model. It can output whether the wake-up word is included in the audio sample while also outputting the starting point position and end point of the wake-up word in the audio sample. position, significantly improving the accuracy of wake word detection and greatly improving the performance of voice wake-up. In addition, this application can intercept the wake-up word in the audio frame more accurately in scenarios that require secondary verification on the cloud. Therefore, it is particularly advantageous in scenarios that require secondary verification on the cloud.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings needed to describe the embodiments. The drawings in the following description are only some embodiments of the present application. For ordinary people in the art Technical personnel can also obtain other drawings based on these drawings without exerting creative labor.

图1为本申请实施例提供的一种语音唤醒模型的结构示意图。Figure 1 is a schematic structural diagram of a voice arousal model provided by an embodiment of the present application.

图2为本申请实施例提供的语音唤醒模型的训练方法的流程示意图。FIG. 2 is a schematic flowchart of a training method for a voice arousal model provided by an embodiment of the present application.

图3为本申请实施例提供的语音唤醒模型中的唤醒词分类分支的训练方法的流程示意图。FIG. 3 is a schematic flowchart of the training method of the wake word classification branch in the voice wake-up model provided by the embodiment of the present application.

图4为本申请实施例提供的语音唤醒模型的起止点判断分支的训练方法的流程示意图。FIG. 4 is a schematic flowchart of the training method of the starting and ending point judgment branch of the voice arousal model provided by the embodiment of the present application.

图5为本申请实施例提供的语音唤醒模型的训练方法的另一个流程示意图。FIG. 5 is another schematic flowchart of the training method of the voice arousal model provided by the embodiment of the present application.

图6为本申请实施例提供的语音唤醒模型的执行方法的流程示意图。FIG. 6 is a schematic flowchart of the execution method of the voice wake-up model provided by the embodiment of the present application.

图7为本申请实施例提供的语音唤醒模型的训练装置的结构示意图。FIG. 7 is a schematic structural diagram of a voice arousal model training device provided by an embodiment of the present application.

图8为本申请实施例提供的语音唤醒设备的结构示意图。Figure 8 is a schematic structural diagram of a voice wake-up device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

本申请实施例提供一种语音唤醒模型的训练方法和执行方法，应用于人工智能领域中的机器学习和语音技术领域，具体地，本申请实施例可以应用于智能音箱语音交互、智能车载语音交互、智能电视语音交互等需要语音交互的场景，以提高唤醒词的检测准确性。The embodiments of the present application provide a training method and execution method of a voice wake-up model, which are applied to the fields of machine learning and speech technology in the field of artificial intelligence. Specifically, the embodiments of the present application can be applied to smart speaker voice interaction and smart vehicle voice interaction. , smart TV voice interaction and other scenarios that require voice interaction to improve the detection accuracy of wake words.

为了便于理解本申请提供的技术方案，以下对本申请实施例的示例性应用场景进行说明。In order to facilitate understanding of the technical solutions provided by this application, exemplary application scenarios of the embodiments of this application are described below.

本申请实施例提供的语音唤醒模型的训练方法可以通过本申请实施例提供的语音唤醒模型的训练装置执行，本申请实施例提供的语音唤醒模型的训练装置可以是终端设备或服务器。示例性地，本申请实施例提供的语音唤醒模型的训练方法可以应用于终端设备中，例如，可以通过终端设备中的处理器、应用程序或者网页实现，终端设备与服务器存在数据通信，本申请实施例对此不做限制。本申请实施例对终端设备的具体类型不做限制，例如，终端设备可以是智能手机、个人电脑、平板电脑、可穿戴设备、车载终端等。本申请实施例对服务器的类型和数量也不做限制，例如，服务器可以是单个独立的服务器，也可以是服务器集群，本申请实施例仅以此为例，并不限于此。The training method of the voice wake-up model provided by the embodiment of the present application can be executed by the training device of the voice wake-up model provided by the embodiment of the present application. The training device of the voice wake-up model provided by the embodiment of the present application can be a terminal device or a server. Illustratively, the voice wake-up model training method provided by the embodiment of the present application can be applied to a terminal device. For example, it can be implemented through a processor, an application program or a web page in the terminal device. There is data communication between the terminal device and the server. This application The embodiment does not limit this. The embodiments of this application do not limit the specific type of the terminal device. For example, the terminal device may be a smartphone, a personal computer, a tablet computer, a wearable device, a vehicle-mounted terminal, etc. The embodiments of the present application do not limit the type and number of servers. For example, the server can be a single independent server or a server cluster. The embodiments of the present application only take this as an example and are not limited thereto.

在对语音唤醒模型训练完毕之后，可以将语音唤醒模型应用于存在语音识别功能需求的终端设备中以执行语音唤醒功能。例如，可以将语音唤醒模型应用在智能音箱、智能家居设备、智能手机、车载终端、可穿戴设备等终端设备中，本申请实施例对此不做限制。After the voice wake-up model is trained, the voice wake-up model can be applied to a terminal device that requires a voice recognition function to perform the voice wake-up function. For example, the voice wake-up model can be applied to terminal devices such as smart speakers, smart home devices, smartphones, vehicle-mounted terminals, and wearable devices. The embodiments of this application do not limit this.

图1为本申请实施例提供的一种语音唤醒模型的结构示意图。如图1所示，本申请实施例提供的语音唤醒模型包括声学特征处理单元、神经网络结构、唤醒词分类分支和起止点判断分支。其中，声学特征处理单元用于对输入的音频信号进行声学特征提取，并将提取的声学特征输入至神经网络结构。神经网络结构的输出作为唤醒词分类分支和起止点判断分支的输入，用于同时训练唤醒词分类分支的识别唤醒词与非唤醒词的分类任务、计算起止点判断分支的唤醒词起始点与结束点的回归任务。有利地，唤醒词分类分支和起止点判断分支共用一个神经网络结构的输出，有助于节省计算成本。Figure 1 is a schematic structural diagram of a voice arousal model provided by an embodiment of the present application. As shown in Figure 1, the voice wake-up model provided by the embodiment of the present application includes an acoustic feature processing unit, a neural network structure, a wake-up word classification branch, and a start and end point determination branch. Among them, the acoustic feature processing unit is used to extract acoustic features from the input audio signal, and input the extracted acoustic features into the neural network structure. The output of the neural network structure serves as the input of the wake word classification branch and the start and end point judgment branch, and is used to simultaneously train the wake word classification branch to identify wake words and non-wake words, and calculate the start and end points of the wake word of the start and end point judgment branch. Point return mission. Advantageously, the wake-up word classification branch and the start and end point judgment branch share the output of the same neural network structure, which helps to save computing costs.

如图2所示，本申请实施例提供的语音唤醒模型的训练方法包括以下步骤：As shown in Figure 2, the training method of the voice arousal model provided by the embodiment of the present application includes the following steps:

步骤S201：获取音频样本，并对音频样本进行声学特征处理以得到声学特征帧。Step S201: Obtain audio samples and perform acoustic feature processing on the audio samples to obtain acoustic feature frames.

在本步骤中，音频样本中包括唤醒词样本与非唤醒词样本。在一些示例中，唤醒词样本可以为人工录制的、含有唤醒词的音频。另外，可以通过后期添加噪声、改变唤醒词在音频中的位置等方式，增加音频样本的多样性与复杂度，从而增强训练效果。非唤醒词样本可以为不含有唤醒词的音频文件，其可以为日常场景中的人声，诸如电影电视剧的音频，也可以通过计算机自动采集而生成。In this step, the audio samples include wake word samples and non-wake word samples. In some examples, the wake word sample may be a manually recorded audio containing the wake word. In addition, the diversity and complexity of the audio samples can be increased by adding noise later, changing the position of the wake-up word in the audio, etc., thereby enhancing the training effect. Non-wake word samples can be audio files that do not contain wake words. They can be human voices in daily scenes, such as audio from movies and TV series, or they can be automatically collected and generated by a computer.

在对语音唤醒模型进行训练之前，可以通过机器识别或人工识别的方式标注每一个音频样本的类型，即标注该音频样本是否包含唤醒词。Before training the voice wake-up model, the type of each audio sample can be marked through machine recognition or manual recognition, that is, whether the audio sample contains the wake-up word.

同时，由于本实施例的训练方法中设置了起止点训练分支，因此在对语音唤醒模型进行训练之前，还可以通过机器识别或人工识别的方式准确地标注唤醒词样本中唤醒词的起止点。其中，唤醒词的起止点为唤醒词在音频样本中存在的起始时间点(又称为起始点)、结束时间点(又称为结束点)。可以理解，在一些示例中，非唤醒词样本不需要标注起始点和结束点。At the same time, since the starting and ending point training branches are set up in the training method of this embodiment, before training the voice awakening model, the starting and ending points of the wake-up words in the wake-up word samples can also be accurately marked through machine recognition or manual recognition. Among them, the start and end points of the wake-up word are the starting time point (also called the starting point) and the end time point (also called the end point) of the wake-up word in the audio sample. It can be understood that in some examples, non-wake word samples do not need to be marked with starting points and ending points.

在本步骤中，对音频样本进行声学特征处理包括：对于输入的音频样本，从唤醒词的起始点开始，按固定长度的时间窗口截取数据帧，直至音唤醒词的结束点，以从音频样本中获得复数个相同时间长度的数据帧。该复数个数据帧可以为一维数组的数据形式。在一些示例中，可以基于语音端点检测(VAD)技术从音频样本中识别出唤醒词的起始点与结束点。在一些示例中，响应于截取的数据帧包括复数个所述唤醒词，保留包含第一个唤醒词的数据帧，从而能够避免重复计算以节省计算资源、提高识别速度。In this step, the acoustic feature processing of the audio samples includes: for the input audio samples, starting from the starting point of the wake-up word, intercepting data frames according to a fixed-length time window until the end point of the wake-up word, so as to extract the data from the audio sample. Obtain multiple data frames of the same time length. The plurality of data frames may be in the form of one-dimensional array data. In some examples, the start and end points of the wake word can be identified from audio samples based on Voice Endpoint Detection (VAD) technology. In some examples, in response to the intercepted data frame including a plurality of the wake-up words, the data frame containing the first wake-up word is retained, thereby avoiding repeated calculations to save computing resources and improve recognition speed.

进一步，对于上述一维数组的数据帧进行声学特征提取，并输出二维特征数组。在一些示例中，可以采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients，MFCC)对该一维数组的数据帧进行声学特征提取与特征增强，以输出二维特征数组。具体地，在MFCC特征提取方法中，首先将音频数据进行快速傅里叶变换(FFT)，然后根据人类听觉特征对不同频段的频谱进行滤波和压缩处理，以得到MFCC特征。在其他示例中，还可以采用静态滤波器网络(Filter Bank，FBANK)、标准化能量系数(Power-Normalized Coefficients，PNCC)等进行音频频谱特征提取，本申请对此不作具体限制。Further, acoustic features are extracted from the data frame of the above one-dimensional array, and a two-dimensional feature array is output. In some examples, Mel-Frequency Cepstral Coefficients (MFCC) can be used to perform acoustic feature extraction and feature enhancement on the data frame of the one-dimensional array to output a two-dimensional feature array. Specifically, in the MFCC feature extraction method, the audio data is first subjected to Fast Fourier Transform (FFT), and then the spectrum of different frequency bands is filtered and compressed according to the human auditory characteristics to obtain the MFCC features. In other examples, static filter network (Filter Bank, FBANK), normalized energy coefficients (Power-Normalized Coefficients, PNCC), etc. can also be used to extract audio spectrum features, and this application does not impose specific restrictions on this.

举例来说，对于输入的音频样本，从音频样本中识别出的包含唤醒词的音频信息的起点开始，每间隔a秒通过时间窗口截取数据帧，直至包含唤醒词的音频信息的终点。基于此，可以从音频样本里截取m个相同长度的数据帧，以一维数组的数据形式，作为输入神经网络的训练样本。进一步，可以采用MFCC系数对输入的训练样本进行特征提取与特征增强，并输出n×m个32位的二维特征数组作为后续使用的声学特征帧。For example, for the input audio sample, starting from the starting point of the audio information containing the wake-up word identified in the audio sample, data frames are intercepted through the time window every a second until the end of the audio information containing the wake-up word. Based on this, m data frames of the same length can be intercepted from the audio sample and used as training samples for the input neural network in the form of one-dimensional array data. Furthermore, MFCC coefficients can be used to extract and enhance features of the input training samples, and output n×m 32-bit two-dimensional feature arrays as acoustic feature frames for subsequent use.

步骤S202：将声学特征帧输入至神经网络结构，以生成输出向量，其中，输出向量具有第一属性和/或第二属性，第一属性为是否包含唤醒词，第二属性为所述唤醒词在输出向量中的起始点位置和结束点位置。Step S202: Input the acoustic feature frame into the neural network structure to generate an output vector, where the output vector has a first attribute and/or a second attribute. The first attribute is whether it contains a wake-up word, and the second attribute is the wake-up word. The starting point position and the ending point position in the output vector.

在本步骤中，所述神经网络结构可以为唤醒词识别网络的骨干(backbone)结构。Backbone结构可以利用卷积神经网络构建而成，一般来说，可以由卷积层和池化层连接构成。该Backbone结构可以为卷积神经网络(convolutional neural network，CNN)、循环神经网络(recurrent neural network，RNN)、注意(attention)网络以及其它类型的识别网络结构，本申请对此不作具体限制。In this step, the neural network structure may be the backbone structure of the wake word recognition network. The Backbone structure can be constructed using a convolutional neural network. Generally speaking, it can be composed of convolutional layers and pooling layers. The Backbone structure can be a convolutional neural network (CNN), a recurrent neural network (RNN), an attention network, and other types of recognition network structures. This application does not impose specific restrictions on this.

当声学特征帧输入至Backbone结构后，Backbone结构可通过卷积算法压缩生成一维向量后输出。该一维向量用作后续唤醒词分类分支和起止点判断分支的共同输入。When the acoustic feature frame is input to the Backbone structure, the Backbone structure can be compressed through a convolution algorithm to generate a one-dimensional vector and then output. This one-dimensional vector is used as the common input of the subsequent wake word classification branch and the start and end point judgment branch.

可以理解，前述步骤S201中提及，在对语音唤醒模型进行训练之前，音频样本包括唤醒词样本和非唤醒词样本，即已标注是否包含唤醒词。并且，对于唤醒词样本，已标注唤醒词的起始点和结束点。因此，在本步骤中，相应地，神经网络结构的输出向量具有第一属性和/或第二属性，其中第一属性为是否包含唤醒词，第二属性为所述唤醒词在输出向量中的起始点位置和结束点位置。It can be understood that, as mentioned in the aforementioned step S201, before training the voice wake-up model, the audio samples include wake-up word samples and non-wake-up word samples, that is, whether they contain wake-up words have been marked. Moreover, for the wake word sample, the start point and end point of the wake word have been marked. Therefore, in this step, correspondingly, the output vector of the neural network structure has a first attribute and/or a second attribute, where the first attribute is whether it contains the wake-up word, and the second attribute is the position of the wake-up word in the output vector. The starting point position and the ending point position.

步骤S203：根据输出向量，分别对唤醒词分类分支和起止点判断分支进行训练。Step S203: According to the output vector, separately train the wake-up word classification branch and the start and end point determination branch.

在本步骤中，唤醒词分类分支用于训练判断是否为唤醒词的分类任务。在训练过程中，唤醒词分类分支使用所有音频样本进行训练，具体而言，唤醒词分类分支的输入为经神经网络结构处理后得到的一维向量数据，包括唤醒词和非唤醒词两类样本，通过训练输出样本是唤醒词的概率和样本不是唤醒词的概率。可以理解，两个概率之和为1。In this step, the wake word classification branch is used to train the classification task of judging whether it is a wake word. During the training process, the wake word classification branch uses all audio samples for training. Specifically, the input of the wake word classification branch is one-dimensional vector data processed by the neural network structure, including wake word and non-wake word samples. , the probability that the sample is a wake-up word and the probability that the sample is not a wake-up word are output through training. It can be understood that the sum of the two probabilities is 1.

起止点判断分支用于展开计算唤醒词的起始点和结束点的回归任务，并输出计算后的唤醒词的起止点。示例性地，在训练过程中，起止点判断分支仅输入包含唤醒词的样本进行训练，具体而言，起止点判断分支输入经神经网络结构处理后得到的一维向量数据，通过训练输出样本中唤醒词起始点的位置与结束点的位置。The start and end point judgment branch is used to expand the regression task of calculating the start point and end point of the wake word, and output the calculated start and end points of the wake word. For example, during the training process, the start and end point judgment branch only inputs samples containing wake-up words for training. Specifically, the start and end point judgment branch inputs one-dimensional vector data obtained after processing by the neural network structure, and outputs the samples through training. The position of the start point and the end point of the wake word.

继续参见图2，步骤S203包括如下步骤：Continuing to refer to Figure 2, step S203 includes the following steps:

步骤S2031：将神经网络结构的输出向量分别输入至唤醒词分类分支和起止点判断分支，并由唤醒词分类分支输出音频样本包含唤醒词的概率，以及由起止点判断分支输出唤醒词在音频样本中的起始点位置和结束点位置；和Step S2031: Input the output vector of the neural network structure to the wake word classification branch and the start and end point judgment branch respectively, and the wake word classification branch outputs the probability that the audio sample contains the wake word, and the start and end point judgment branch outputs the frequency of the wake word in the audio sample. The starting point position and the ending point position in ; and

步骤S2032：利用唤醒词分类分支的输出结果和起止点判断分支的输出结果分别调整唤醒词分类分支的参数和起止点判断分支的参数，以得到更新后的唤醒词分类分支以及起止点判断分支。Step S2032: Use the output results of the wake word classification branch and the output results of the start and end point judgment branches to respectively adjust the parameters of the wake word classification branch and the parameters of the start and end point judgment branches to obtain updated wake word classification branches and start and end point judgment branches.

在步骤S2031和步骤S2032中，以唤醒词分类分支为例，在一些示例中，其至少包括与神经网络结构连接的第一全连接层结构，神经网络结构的输出向量包括用于唤醒词分类分支的第一标签数据，第一标签数据涉及第一属性，第一标签数据用于表征输出向量中是否包含唤醒词且作为第一监督学习目标输入至第一全连接层结构。In steps S2031 and S2032, taking the wake-up word classification branch as an example, in some examples, it at least includes a first fully connected layer structure connected to a neural network structure, and the output vector of the neural network structure includes a wake-up word classification branch. The first label data, the first label data relates to the first attribute, the first label data is used to characterize whether the output vector contains the wake-up word and is input to the first fully connected layer structure as the first supervised learning target.

在一些示例中，参见图3，其为本申请实施例提供的语音唤醒模型的唤醒词分类分支的训练方法的流程示意图，其中，步骤S2032中的利用唤醒词分类分支的输出结果调整唤醒词分类分支的参数包括：In some examples, see FIG. 3 , which is a schematic flowchart of the training method of the wake-up word classification branch of the speech wake-up model provided by the embodiment of the present application. In step S2032, the output result of the wake-up word classification branch is used to adjust the wake-up word classification. Branch parameters include:

步骤S301：根据输出向量的第一属性和唤醒词分类分支的输出结果，确定唤醒词分类分支的第一损失函数。Step S301: Determine the first loss function of the wake-up word classification branch according to the first attribute of the output vector and the output result of the wake-up word classification branch.

在本步骤中，唤醒词分类分支的第一损失函数为交叉熵。In this step, the first loss function of the wake-up word classification branch is cross entropy.

步骤S302：基于第一损失函数调整唤醒词分类分支的参数。Step S302: Adjust the parameters of the wake-up word classification branch based on the first loss function.

在本步骤中，根据第一损失函数的计算值迭代更新唤醒词分类分支的参数，直至收敛，以得到更新后的唤醒词分类分支。In this step, the parameters of the wake-up word classification branch are iteratively updated according to the calculated value of the first loss function until convergence, so as to obtain the updated wake-up word classification branch.

在一个示例中，第一损失函数的计算公式为：In one example, the first loss function is calculated as:

其中，为唤醒词的标签；当输出向量中包含唤醒词时，/>当输出向量中不包含唤醒词时，/>y为唤醒词分类分支输出的唤醒词概率，其范围在[0,1]。in, is the label of the wake-up word; when the output vector contains the wake-up word, /> When the output vector does not contain the wake word, /> y is the wake word probability output by the wake word classification branch, and its range is [0,1].

基于此，经计算输出的唤醒词概率可以与音频样本中已标注的音频样本是否包含唤醒词的信息进行比对，一方面可用于评价训练效果，另一方面可用于调整参数以优化唤醒词分类分支，并将参数优化后的唤醒词分类分支作为更新后的唤醒词分类分支。在本实施例中，可以通过调整参数，使得唤醒词分类分支输出的唤醒词概率接近或等于已标注的唤醒词的标签。Based on this, the calculated output wake word probability can be compared with the information of whether the annotated audio sample in the audio sample contains the wake word. On the one hand, it can be used to evaluate the training effect, and on the other hand, it can be used to adjust parameters to optimize wake word classification. branch, and use the wake word classification branch after parameter optimization as the updated wake word classification branch. In this embodiment, the parameters can be adjusted so that the wake-up word probability output by the wake-up word classification branch is close to or equal to the label of the marked wake-up word.

在步骤S2031和步骤S2032中，以起止点判断分支为例，其至少包括与神经网络结构连接的第二全连接层结构，神经网络结构的输出向量包括用于起止点判断分支的第二标签数据，第二标签数据涉及第二属性，第二标签数据用于表征唤醒词在输出向量中的起始点位置和结束点位置且作为第二监督学习目标输入至第二全连接层结构。In steps S2031 and S2032, taking the start and end point judgment branch as an example, it at least includes a second fully connected layer structure connected to the neural network structure, and the output vector of the neural network structure includes second label data for the start and end point judgment branch. , the second label data relates to the second attribute, and the second label data is used to characterize the start point position and end point position of the wake-up word in the output vector and is input to the second fully connected layer structure as the second supervised learning target.

在一些示例中，参见图4，其为本申请实施例提供的语音唤醒模型的起止点判断分支的训练方法的流程示意图，其中，步骤S2032中的利用起止点判断分支的输出结果调整起止点判断分支的参数包括：In some examples, see FIG. 4 , which is a schematic flowchart of the training method of the start and end point judgment branch of the voice arousal model provided by the embodiment of the present application. In step S2032, the output result of the start and end point judgment branch is used to adjust the start and end point judgment. Branch parameters include:

步骤S401：根据输出向量的第二属性和起止点判断分支的输出结果，确定起止点判断分支的第二损失函数。Step S401: Determine the second loss function of the start and end point judgment branch according to the second attribute of the output vector and the output result of the start and end point judgment branch.

在本步骤中，起止点判断分支的第二损失函数为输出的起始点位置和结束点位置与真实的起始点位置和结束点位置之间的均方误差。In this step, the second loss function of the start and end point judgment branch is the mean square error between the output starting point position and end point position and the real starting point position and end point position.

步骤S402：基于第二损失函数调整起止点判断分支的参数。Step S402: Adjust the parameters of the start and end point determination branch based on the second loss function.

在本步骤中，根据第二损失函数的计算值迭代更新起止点判断分支的参数，直至收敛，以得到更新后的起止点判断分支。In this step, the parameters of the start and end point judgment branches are iteratively updated according to the calculated value of the second loss function until convergence, so as to obtain the updated start and end point judgment branches.

在一个示例中，第二损失函数的计算公式为：In one example, the second loss function is calculated as:

L₂＝(s1-k1)²+(s2-k2)² L ₂ =(s1-k1) ² +(s2-k2) ²

其中，L₂为均方误差值；Among them, L ₂ is the mean square error value;

s1为唤醒词在输出向量中的真实起始点位置；s1 is the real starting point position of the wake-up word in the output vector;

s2为唤醒词在输出向量中的真实结束点位置；s2 is the real end point position of the wake-up word in the output vector;

k1为起止点判断分支输出的起始点位置；k1 is the starting point position of the starting and ending point judgment branch output;

k2为起止点判断分支输出的结束点位置。k2 is the end point position of the start and end point judgment branch output.

基于此，经计算输出的唤醒词的起止点可以与音频样本中已标注的起止点信息进行比对，一方面可用于评价训练效果，另一方面可用于调整参数以优化起止点判断分支，并将参数优化后的起止点判断分支作为更新后的起止点判断分支。在本实施例中，可以通过调整参数，使得起止点判断分支输出的起始点位置k1和结束点位置k2分别接近或等于唤醒词在输出向量中的真实起始点位置s1和真实结束点位置s2。Based on this, the calculated start and end points of the wake word can be compared with the marked start and end point information in the audio sample. On the one hand, it can be used to evaluate the training effect. On the other hand, it can be used to adjust parameters to optimize the start and end point judgment branch, and The starting and ending point judgment branches after parameter optimization are used as the updated starting and ending point judgment branches. In this embodiment, parameters can be adjusted so that the start point position k1 and the end point position k2 of the start and end point judgment branch output are close to or equal to the real start point position s1 and the real end point position s2 of the wake word in the output vector respectively.

再次参见图1，在一些示例中，可以利用唤醒词分类分支的输出结果和/或起止点判断分支的输出结果调整神经网络结构的参数，以得到更新后的神经网络结构。示例性地，神经网络结构可以采用误差反向传播算法在训练过程中修正初始的神经网络结构中的参数，使得神经网络结构的重建误差损失越来越小，直至误差损失收敛。Referring again to Figure 1, in some examples, the output results of the wake word classification branch and/or the output results of the start and end point judgment branches can be used to adjust the parameters of the neural network structure to obtain an updated neural network structure. For example, the neural network structure can use an error backpropagation algorithm to modify the parameters in the initial neural network structure during the training process, so that the reconstruction error loss of the neural network structure becomes smaller and smaller until the error loss converges.

为了使本领域的技术人员更好地理解本申请，以下列举一个更为详细的实施例，如图5所示，本申请实施例提供一种语音唤醒模型的训练方法，包括如下步骤：In order to enable those skilled in the art to better understand the present application, a more detailed embodiment is enumerated below. As shown in Figure 5, the embodiment of the present application provides a training method for a voice arousal model, which includes the following steps:

步骤S501：生成唤醒词样本和非唤醒词样本。Step S501: Generate wake word samples and non-wake word samples.

在该步骤中，设定语音唤醒模型的唤醒词为“Hi乐鑫”，其他音频样本均为非唤醒词。唤醒词可以通过人工录制语音“Hi乐鑫”，在安静的、噪音少的环境中进行采样。为了增强训练效果，在采集唤醒词样本后，通过加入噪音、调整唤醒词在样本音频中的位置等方式，增加唤醒词样本的复杂度、提升训练效果。非唤醒词样本是通过截取电视剧、电影等其他含有人声但不包含唤醒词的音频。In this step, the wake-up word of the voice wake-up model is set to "Hi Lexin", and other audio samples are non-wake-up words. The wake-up word can be sampled in a quiet, low-noise environment by manually recording the voice "Hi Lexin". In order to enhance the training effect, after collecting the wake word samples, the complexity of the wake word samples is increased and the training effect is improved by adding noise and adjusting the position of the wake word in the sample audio. Non-wake word samples are obtained by intercepting TV series, movies and other audios that contain human voices but do not contain wake words.

步骤S502：标注唤醒词的起止点。Step S502: Mark the starting and ending points of the wake-up word.

在该步骤中，需要给定唤醒词在音频样本中的起始点和结束点，以用于训练用途。对此，可以通过机器标注或人工标注的方式，标出唤醒词在音频样本中的时间起始点和时间结束点。可以无需标注非唤醒词样本。In this step, the start and end points of the wake word in the audio sample need to be given for training purposes. In this regard, the time starting point and time end point of the wake word in the audio sample can be marked through machine annotation or manual annotation. There is no need to label non-wake word samples.

步骤S503：截取数据帧。Step S503: Intercept the data frame.

在该步骤中，通过一个长1s的窗口，从音频样本中的唤醒词被识别的起点开始，间隔32ms进行移动，截取数个数据帧，并可选地，上传至云端进行唤醒词的二次识别。In this step, through a 1s long window, starting from the starting point where the wake word in the audio sample is recognized, moving at intervals of 32ms, intercepting several data frames, and optionally uploading them to the cloud for secondary wake word recognition Identify.

在一个示例中，为了避免重复计算以节省计算资源、提高识别速度，在截取到的数个含有唤醒词的数据帧中，可以仅上传第一个数据帧至云端以进行唤醒词的二次识别。In one example, in order to avoid repeated calculations to save computing resources and improve recognition speed, among several intercepted data frames containing wake words, only the first data frame can be uploaded to the cloud for secondary recognition of wake words. .

步骤S504：数据帧的特征提取与特征增强。Step S504: Feature extraction and feature enhancement of the data frame.

在该步骤中，将语音样本中的一维数组通过MFCC进行特征提取与特征增强，输出50*32个32位的特征二维数组。In this step, the one-dimensional array in the speech sample is used for feature extraction and feature enhancement through MFCC, and 50*32 32-bit feature two-dimensional arrays are output.

步骤S505：通过卷积计算输出一维向量Step S505: Output a one-dimensional vector through convolution calculation

在该步骤中，将50*32个32位的特征二维数组样本输入神经网络结构，通过卷积计算后输出一维向量。In this step, 50*32 32-bit feature two-dimensional array samples are input into the neural network structure, and a one-dimensional vector is output after convolution calculation.

步骤S506：训练唤醒词分类分支和起止点判断分支Step S506: Train the wake word classification branch and the start and end point judgment branch

在该步骤中，将神经网络结构输出的一维向量分别输入唤醒词分类分支和起止点判断分支，并同时运行唤醒词和非唤醒词的分类任务、计算唤醒词起始点与结束点的回归任务。In this step, the one-dimensional vector output by the neural network structure is input into the wake word classification branch and the start and end point judgment branch respectively, and simultaneously runs the classification task of wake words and non-wake words, and the regression task of calculating the start point and end point of the wake word .

唤醒词分类分支和起止点判断分支均采用全连接层结构，也称为线性层结构，用于神经网络的最后一层的输出，以进行后续运算。其中，唤醒词分类分支的损失函数为交叉熵。起止点判断分支的损失函数为输出的起始点位置和结束点位置与真实的起始点位置和结束点位置之间的均方误差。相应地，唤醒词分类分支和起止点判断分支的具体数据类型和参数调整方式参见上方关于第一损失函数和第二损失函数的具体描述，在此不作赘述。考虑到唤醒词分类分支和起止点判断分支在后续运算所用的参数不同，该两个任务需要分开进行训练。Both the wake word classification branch and the start and end point judgment branch adopt a fully connected layer structure, also known as a linear layer structure, which is used for the output of the last layer of the neural network for subsequent operations. Among them, the loss function of the wake-up word classification branch is cross entropy. The loss function of the start and end point judgment branch is the mean square error between the output starting point position and end point position and the real starting point position and end point position. Correspondingly, for the specific data types and parameter adjustment methods of the wake word classification branch and the start and end point determination branch, please refer to the detailed description of the first loss function and the second loss function above, and will not be described again here. Considering that the wake word classification branch and the start and end point judgment branch use different parameters in subsequent operations, the two tasks need to be trained separately.

唤醒词分类分支用于训练唤醒词的分类任务。唤醒词分类分支的输入包含唤醒词与非唤醒词的一维向量数据样本，通过训练后输出两组数据，分别为样本是唤醒词的概率、样本不是唤醒词的概率，两个概率之和为1。唤醒词分类分支的输出结果与音频信息进行比对，以评价模型的训练效果，并用于调整参数以对唤醒词分类分支进行优化。The wake word classification branch is used to train wake word classification tasks. The input of the wake word classification branch includes one-dimensional vector data samples of wake words and non-wake words. After training, two sets of data are output, which are the probability that the sample is a wake word and the probability that the sample is not a wake word. The sum of the two probabilities is 1. The output results of the wake word classification branch are compared with the audio information to evaluate the training effect of the model and used to adjust parameters to optimize the wake word classification branch.

起止点判断分支用于训练唤醒词起始点和结束点的回归任务。起止点判断分支的输入仅包含唤醒词的一维向量数据样本，通过训练后输出两组数据，分别为唤醒词的起始点位置、唤醒词的结束点位置。唤醒词分类分支的输出结果与音频信息进行比对，以评价模型的训练效果，并用于调整参数以对起止点判断分支进行优化。The start and end point judgment branch is used to train the regression task of the start point and end point of the wake word. The input of the start and end point judgment branch only contains one-dimensional vector data samples of the wake-up word. After training, two sets of data are output, which are the starting point position of the wake-up word and the end point position of the wake-up word. The output results of the wake word classification branch are compared with the audio information to evaluate the training effect of the model, and are used to adjust parameters to optimize the start and end point determination branch.

本申请实施例经过大量的测试，实验证明有效，比如在非常有限资源的ESP32-S3上，用CPU的5％计算资源达到95％以上的识别率，并通过了Amazon的Alexa性能测试。The embodiments of this application have been tested extensively and proven effective. For example, on an ESP32-S3 with very limited resources, it used 5% of the computing resources of the CPU to achieve a recognition rate of more than 95%, and passed Amazon's Alexa performance test.

综上，本申请实施例的语音唤醒模型的训练方法采用多任务训练的方法，在唤醒词检测网络的末端并行设置唤醒词分类分支和所述起止点判断分支，同时对唤醒词的分类任务和唤醒词起止点的回归任务进行训练，实现输出音频样本中是否包含唤醒词的同时，还输出唤醒词在音频样本中的起始点位置和结束点位置。因此，本申请实施例训练得到的语音唤醒模型可以快速、准确地获取唤醒词的起始点和结束点，有助于更精准地在音频帧中截取唤醒词。此外，由于同时对唤醒词的分类任务和唤醒词起止点的回归任务进行训练，能够有助于增强神经网络结构的效果，相比单一训练任务，能够进一步提高唤醒词检测的准确性。另外，由于本申请实施例能够更加准确地截取唤醒词，因此在需要云端二次校验的场景特别能够体现出优势。In summary, the training method of the speech wake-up model in the embodiment of the present application adopts a multi-task training method, and sets up the wake-up word classification branch and the start and end point judgment branch in parallel at the end of the wake-up word detection network, and simultaneously performs the wake-up word classification task and The regression task of the start and end points of the wake word is trained to output not only whether the audio sample contains the wake word, but also the starting point position and end point position of the wake word in the audio sample. Therefore, the speech wake-up model trained in the embodiment of the present application can quickly and accurately obtain the starting point and end point of the wake-up word, which helps to intercept the wake-up word in the audio frame more accurately. In addition, since the classification task of wake words and the regression task of start and end points of wake words are trained at the same time, it can help enhance the effect of the neural network structure. Compared with a single training task, the accuracy of wake word detection can be further improved. In addition, because the embodiment of the present application can intercept the wake word more accurately, it can particularly show advantages in scenarios that require secondary verification in the cloud.

基于上述各个实施例，可以将训练好的语音唤醒模型预置在语音唤醒设备中，使得语音唤醒设备具有语音唤醒的功能。相应地，本申请实施例提供了一种语音唤醒模型的执行方法，如图6所示，该执行方法包括如下步骤：Based on the above embodiments, the trained voice wake-up model can be preset in the voice wake-up device, so that the voice wake-up device has the voice wake-up function. Correspondingly, the embodiment of the present application provides a method for executing the voice arousal model. As shown in Figure 6, the execution method includes the following steps:

步骤S601：获取待检测语音数据。Step S601: Obtain the voice data to be detected.

步骤S602：将待检测语音数据输入前述经训练后的语音唤醒模型中。Step S602: Input the voice data to be detected into the aforementioned trained voice arousal model.

步骤S603：获取唤醒词分类分支的输出结果和起止点判断分支的输出结果。Step S603: Obtain the output result of the wake word classification branch and the output result of the start and end point determination branch.

步骤S604：基于唤醒词分类分支的输出结果和/或起止点判断分支的输出结果输出语音唤醒模型的唤醒结果。Step S604: Output the wake-up result of the voice wake-up model based on the output result of the wake-up word classification branch and/or the output result of the start and end point determination branch.

在一些实施例中，上述语音唤醒模型的执行方法还包括：将唤醒词分类分支的输出结果和/或起止点判断分支的输出结果发送至语音识别云端平台以进行唤醒词的二次识别。In some embodiments, the execution method of the above-mentioned voice wake-up model also includes: sending the output result of the wake-up word classification branch and/or the output result of the start and end point determination branch to the speech recognition cloud platform for secondary recognition of the wake-up word.

本申请实施例的语音唤醒的执行方法基于上述经训练的语音唤醒模型，能够在输出音频样本中是否包含唤醒词的同时，还输出该唤醒词在音频样本中的起始点位置和结束点位置，实现显著提升唤醒词检测的准确性，极大提高语音唤醒的性能。此外，本申请在需要云端二次校验的场景能够更精准地在音频帧中截取唤醒词，因此，在需要云端二次校验的场景中特别能够体现出优势。The voice wake-up execution method in the embodiment of the present application is based on the above-trained voice wake-up model. It can output whether the wake-up word is included in the audio sample while also outputting the starting point position and the end point position of the wake-up word in the audio sample. It can significantly improve the accuracy of wake word detection and greatly improve the performance of voice wake-up. In addition, this application can intercept the wake-up word in the audio frame more accurately in scenarios that require secondary verification on the cloud. Therefore, it is particularly advantageous in scenarios that require secondary verification on the cloud.

参见图7，本申请实施例还提供了一种语音唤醒模型的训练装置，包括获取模块、生成模块和训练模块。获取模块用于获取音频样本，并对音频样本进行声学特征处理以得到声学特征帧。生成模块用于将声学特征帧输入至神经网络结构，以生成输出向量，其中，输出向量具有第一属性和/或第二属性，第一属性为是否包含唤醒词，第二属性为唤醒词在输出向量中的起始点位置和结束点位置。训练模块用于根据输出向量，分别对唤醒词分类分支和起止点判断分支进行训练，包括：将输出向量分别输入至唤醒词分类分支和起止点判断分支，并由唤醒词分类分支输出输出向量包含唤醒词的概率，以及由起止点判断分支输出唤醒词在输出向量中的起始点位置和结束点位置；和，利用唤醒词分类分支的输出结果和起止点判断分支的输出结果分别调整唤醒词分类分支的参数和起止点判断分支的参数，以得到更新后的语音唤醒模型。Referring to Figure 7, an embodiment of the present application also provides a training device for a voice arousal model, which includes an acquisition module, a generation module and a training module. The acquisition module is used to acquire audio samples and perform acoustic feature processing on the audio samples to obtain acoustic feature frames. The generation module is used to input the acoustic feature frame into the neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute. The first attribute is whether the wake-up word is included, and the second attribute is whether the wake-up word is included. Outputs the starting point position and the ending point position in the vector. The training module is used to train the wake word classification branch and the start and end point judgment branch respectively according to the output vector, including: inputting the output vector to the wake word classification branch and the start and end point judgment branch respectively, and the output vector of the wake word classification branch contains The probability of the wake-up word, and the starting point position and the end point position of the wake-up word in the output vector output by the start and end point judgment branch; and, using the output result of the wake word classification branch and the output result of the start and end point judgment branch to adjust the wake word classification respectively The parameters of the branch and the starting and ending points are used to determine the parameters of the branch to obtain the updated voice arousal model.

本实施例中，语音唤醒模型的训练装置的具体限定、实现原理以及有益效果可以参见上文中对于语音唤醒模型的训练方法的描述，在此不作赘述。上述各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。In this embodiment, the specific limitations, implementation principles, and beneficial effects of the training device for the voice arousal model can be found in the description of the training method for the voice arousal model above, and will not be described again here. Each of the above modules can be implemented in whole or in part through software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

参见图8，本申请实施例还提供一种语音唤醒设备，包括存储器和处理器，存储器中存储有计算机程序，处理器执行计算机程序时可实现上述各实施例中的语音唤醒模型的训练步骤和/或语音唤醒模型的执行步骤。Referring to Figure 8, an embodiment of the present application also provides a voice wake-up device, which includes a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it can implement the training steps and steps of the voice wake-up model in the above embodiments. / Or the execution steps of the voice wake-up model.

本申请实施例还提供一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时可实现上述各实施例中的语音唤醒模型的训练步骤和/或语音唤醒模型的执行步骤。Embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored. When executed by a processor, the computer program can implement the training steps of the voice arousal model and/or the execution of the voice arousal model in the above embodiments. step.

本申请实施例还提供一种计算机程序，其被处理器执行时可实现上述各实施例中的语音唤醒模型的训练步骤和/或语音唤醒模型的执行步骤。An embodiment of the present application also provides a computer program, which when executed by a processor can implement the training steps of the voice arousal model and/or the execution steps of the voice arousal model in the above embodiments.

上述实施例中，语音唤醒模型的语音唤醒设备、计算机可读存储介质和计算机程序的实现原理以及有益效果可以参见上文中关于语音唤醒模型的训练方法和执行方法的描述，在此不作赘述。In the above embodiments, the implementation principles and beneficial effects of the voice wake-up device, computer-readable storage medium and computer program of the voice wake-up model can be found in the above description of the training method and execution method of the voice wake-up model, and will not be described again here.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magneto resistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magneto resistive Random Access Memory (MRAM), ferroelectric memory (Ferroelectric Random Access Memory (FRAM)), phase change memory (Phase Change Memory, PCM), graphene memory, etc. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration but not limitation, RAM can be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to this.

虽然出于本公开的目的已经描述了本申请各方面的各种实施例，但是不应理解为将本公开的教导限制于这些实施例。在一个具体实施例中公开的特征并不限于该实施例，而是可以和不同实施例中公开的特征进行组合。例如，在一个实施例中描述的根据本申请的方法的一个或多个特征和/或操作，亦可单独地、组合地或整体地应用在另一实施例中。本领域技术人员应理解，还存在可能的更多可选实施方式和变型，可以对上述系统进行各种改变和修改，而不脱离由本申请权利要求所限定的范围。While various embodiments of various aspects of the present application have been described for purposes of this disclosure, they should not be construed as limiting the teachings of the disclosure to these embodiments. Features disclosed in one specific embodiment are not limited to this embodiment, but may be combined with features disclosed in different embodiments. For example, one or more features and/or operations of the method according to the present application described in one embodiment can also be applied in another embodiment individually, in combination or as a whole. Those skilled in the art will understand that there are more possible optional implementations and modifications, and various changes and modifications can be made to the above system without departing from the scope defined by the claims of the present application.

Claims

1. The method for training the voice wake-up model is characterized in that the voice wake-up model comprises a neural network structure, wake-up word classification branches and start and stop point judgment branches, and comprises the following steps:

acquiring an audio sample, and performing acoustic feature processing on the audio sample to obtain an acoustic feature frame;

inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and

Training the wake-up word classification branch and the start-stop point judgment branch according to the output vector, wherein the training comprises the following steps:

the output vector is respectively input into the wake-up word classification branch and the start-stop point judgment branch, the probability that the audio sample contains the wake-up word is output by the wake-up word classification branch, and the start point position and the end point position of the wake-up word in the audio sample are output by the start-stop point judgment branch; and

and respectively adjusting parameters of the wake-up word classification branch and parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain the updated wake-up word classification branch and the updated start-stop point judgment branch.

2. The method of claim 1, wherein the wake word classification branch comprises at least a first fully-connected layer structure connected to the neural network structure, an output vector of the neural network structure comprising first tag data for the wake word classification branch, the first tag data relating to a first attribute, the first tag data being used to characterize whether the wake word is contained in the output vector and input to the first fully-connected layer structure as a first supervised learning objective.

3. The method of claim 1, wherein adjusting parameters of the wake word classification branch using output results of the wake word classification branch comprises:

determining a first loss function of the wake-up word classification branch according to the first attribute of the output vector and the output result of the wake-up word classification branch;

and adjusting parameters of the wake-up word classification branches based on the first loss function.

4. The method of claim 3, wherein the adjusting parameters of the wake word classification branch based on the first loss function comprises:

and iteratively updating parameters of the wake-up word classification branches according to the calculated value of the first loss function until convergence to obtain the updated wake-up word classification branches.

5. A method according to claim 3, wherein the first loss function of the wake word classification branch is cross entropy.

6. The method of claim 5, wherein the first loss function is calculated as:

wherein,labels that are wake words; when the wake word is included in the output vector, and (2)>When the wake word is not included in the output vector, and (2)>

And y is the wake-up word probability output by the wake-up word classification branch, and the range of y is [0,1].

7. The method of claim 1, wherein the start-stop decision branch comprises at least a second fully-connected layer structure connected to a neural network structure, an output vector of the neural network structure comprising second tag data for the start-stop decision branch, the second tag data relating to a second attribute, the second tag data being used to characterize a start point position and an end point position of the wake-up word in the output vector and being input to the second fully-connected layer structure as a second supervised learning objective.

8. The method of claim 1, wherein adjusting parameters of the start-stop point determination branch using output results of the start-stop point determination branch comprises:

determining a second loss function of the start-stop point judging branch according to a second attribute of the output vector and an output result of the start-stop point judging branch;

and adjusting parameters of the start and stop point judgment branch based on the second loss function.

9. The method of claim 8, wherein the adjusting the parameters of the start-stop decision branch based on the second loss function comprises:

and iteratively updating the parameters of the start-stop point judgment branch according to the calculated value of the second loss function until convergence to obtain the updated start-stop point judgment branch.

10. The method of claim 8, wherein the second loss function of the start-stop point judgment branch is a mean square error between the output start point position and end point position and the actual start point position and end point position.

11. The method of claim 10, wherein the second loss function is calculated as:

L ₂ ＝(s1-k1) ² +(s2-k2) ²

wherein L is ₂ Is a mean square error value;

s1 is the true starting point position of the wake-up word in the output vector;

s2 is the true end point position of the wake-up word in the output vector;

k1 is the starting point position of the branch output judged by the starting point and the ending point;

and k2 is the end point position of the branch output judged by the start point and the stop point.

12. The method of claim 1, further comprising: and adjusting parameters of the neural network structure by utilizing the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch so as to obtain an updated neural network structure.

13. The method of claim 1, wherein the characterizing the audio sample comprises:

detecting the end point of the wake-up word in the audio sample, starting from the starting point of the wake-up word, intercepting the data frames according to a time window with a fixed length until the end point of the wake-up word to obtain a plurality of data frames with the same time length; the plurality of data frames are in the form of one-dimensional array data; and

And extracting the audio frequency spectrum characteristics of the one-dimensional array, and outputting a two-dimensional characteristic array.

14. The method of claim 13, wherein, in response to the intercepted data frame including a plurality of the wake words, a data frame including a first wake word is retained.

15. The method of claim 13, wherein the inputting the acoustic feature frame into a neural network structure to generate an output vector comprises:

inputting the two-dimensional feature array into the neural network structure, and generating a one-dimensional vector through a convolution algorithm.

16. A method of executing a voice wake model, comprising:

acquiring voice data to be detected;

inputting the voice data to be detected into a voice awakening model trained by any one of claims 1-15;

obtaining the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch; and

and outputting the wake-up result of the voice wake-up model based on the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch.

17. The method of claim 16, further comprising:

and sending the output result of the wake-up word classification branch and/or the output result of the start-stop point judgment branch to a voice recognition cloud platform to perform secondary recognition of the wake-up word.

18. A training device for a voice wake model, the voice wake model comprising a neural network structure, wake word classification branches and start and stop point judgment branches, the device comprising:

the acquisition module is used for acquiring an audio sample and carrying out acoustic feature processing on the audio sample to obtain an acoustic feature frame;

the generation module is used for inputting the acoustic feature frame into a neural network structure to generate an output vector, wherein the output vector has a first attribute and/or a second attribute, the first attribute is whether a wake-up word is contained or not, and the second attribute is a starting point position and an ending point position of the wake-up word in the output vector; and

the training module is used for training the wake-up word classification branch and the start-stop point judgment branch according to the output vector, and comprises the following steps:

the output vector is respectively input into the wake-up word classification branch and the start-stop point judgment branch, the probability that the output vector contains the wake-up word is output by the wake-up word classification branch, and the start point position and the end point position of the wake-up word in the output vector are output by the start-stop point judgment branch; and

And respectively adjusting parameters of the wake-up word classification branch and parameters of the start-stop point judgment branch by using the output result of the wake-up word classification branch and the output result of the start-stop point judgment branch so as to obtain an updated voice wake-up model.

19. A voice wakeup apparatus comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor when executing the computer program implements the training method of any one of claims 1 to 15 and/or the execution method of claim 16 or 17.

20. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the training method of any one of claims 1 to 15 and/or the execution method of claim 16 or 17.