CN118553234A

CN118553234A - Speech recognition model training, testing, speech recognition method and device

Info

Publication number: CN118553234A
Application number: CN202410726281.2A
Authority: CN
Inventors: 赵镜儒; 石东升
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2024-06-05
Filing date: 2024-06-05
Publication date: 2024-08-27

Abstract

The present disclosure provides a method and device for training a speech recognition model, which relates to the field of artificial intelligence technology, specifically to the technical fields of speech recognition, deep learning, large models, etc., and can be applied to scenarios such as artificial intelligence content generation. The specific implementation scheme is: obtaining a speech sample set, the speech sample set includes at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence; obtaining an initial speech recognition model, the speech recognition model is used to characterize the correspondence between the audio feature sequence and the predicted word unit sequence; using the predicted word unit representing the language to replace the language word unit in the initial word unit sequence in the speech sample set, to obtain a training sample set, the predicted word unit is a predicted word unit in the predicted word unit sequence obtained by inputting the speech sample selected from the speech sample set into the speech recognition model; based on the training sample set, the speech recognition model is trained to obtain a trained speech recognition model.

Description

Speech recognition model training, testing, speech recognition method and device

技术领域Technical Field

本公开涉及人工智能技术领域，具体涉及语音识别、深度学习、大模型等技术领域，可应用于人工智能的内容生成等场景，尤其涉及一种语音识别模型训练方法和装置、语音识别模型测试方法和装置、语音识别方法和装置、电子设备、计算机可读存储介质以及计算机程序产品。The present disclosure relates to the field of artificial intelligence technology, specifically to technical fields such as speech recognition, deep learning, and large models, and can be applied to scenarios such as artificial intelligence content generation, and in particular to a speech recognition model training method and device, a speech recognition model testing method and device, a speech recognition method and device, an electronic device, a computer-readable storage medium, and a computer program product.

背景技术Background Art

目前语音识别技术已经愈发成熟，尤其是对于英文、中文等广泛使用的语言，语音识别技术已经可以和专业的工作人员相媲美。但是一些语种的语言受限于训练样本数量等客观因素的限制，导致现有模型对其语音识别能力较弱。At present, speech recognition technology has become increasingly mature, especially for widely used languages such as English and Chinese, speech recognition technology is comparable to that of professional staff. However, some languages are limited by objective factors such as the number of training samples, resulting in the weak speech recognition ability of existing models.

发明内容Summary of the invention

本公开提供了一种语音识别模型训练方法和装置、语音识别模型测试方法和装置、语音识别方法和装置、电子设备、计算机可读存储介质以及计算机程序产品。The present disclosure provides a speech recognition model training method and device, a speech recognition model testing method and device, a speech recognition method and device, an electronic device, a computer-readable storage medium, and a computer program product.

根据第一方面，提供了一种语音识别模型训练方法，该方法包括：获取语音样本集，语音样本集包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列；获取初始的语音识别模型，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系；采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元；基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。According to a first aspect, a method for training a speech recognition model is provided, the method comprising: obtaining a speech sample set, the speech sample set comprising at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence; obtaining an initial speech recognition model, the speech recognition model being used to characterize the correspondence between the audio feature sequence and the prediction word unit sequence; replacing the language word units in the initial word unit sequence in the speech sample set with prediction word units representing the language to obtain a training sample set, the prediction word units being prediction word units in the prediction word unit sequence obtained by inputting speech samples selected from the speech sample set into the speech recognition model; training the speech recognition model based on the training sample set to obtain a trained speech recognition model.

根据第二方面，提供了一种语音识别模型测试方法，该方法包括：获取测试集和训练后的语音识别模型，测试集包括至少一个测试样本，测试样本包括：音频特征序列和测试词单元序列；训练后的语音识别模型采用如第一方面任一实现方式描述的语音识别模型训练方法训练得到，语音识别模型包括：编码器、解码器以及关键词模块；从测试集中选取测试样本；将测试样本中的音频特征序列输入编码器，得到音频中间特征；将音频中间特征、当前词单元序列输入解码器，得到语音识别模型输出的预测词单元；响应于预测词单元为非结束符，基于预测词单元更新当前词单元序列，继续将音频中间特征、当前词单元序列输入解码器，直至语音识别模型输出的预测词单元为结束符为止，得到测试样本对应的所有预测词单元；基于测试样本对应的所有预测词单元和测试样本中的测试词单元序列，检测语音识别模型是否测试合格。According to a second aspect, a speech recognition model testing method is provided, the method comprising: obtaining a test set and a trained speech recognition model, the test set comprising at least one test sample, the test sample comprising: an audio feature sequence and a test word unit sequence; the trained speech recognition model is trained using the speech recognition model training method as described in any implementation of the first aspect, the speech recognition model comprising: an encoder, a decoder and a keyword module; selecting a test sample from the test set; inputting the audio feature sequence in the test sample into the encoder to obtain an audio intermediate feature; inputting the audio intermediate feature and the current word unit sequence into the decoder to obtain a predicted word unit output by the speech recognition model; in response to the predicted word unit being a non-terminal symbol, updating the current word unit sequence based on the predicted word unit, continuing to input the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the speech recognition model is a terminator, and obtaining all predicted word units corresponding to the test sample; based on all predicted word units corresponding to the test sample and the test word unit sequence in the test sample, detecting whether the speech recognition model has passed the test.

根据第三方面，提供了一种语音识别方法，该方法包括：获取待识别语音；对待识别语音进行处理，得到音频特征数据；将音频特征数据输入语音识别模型，得到待识别语音的预测词单元序列，语音识别模型为采用如第一方面任一实现方式描述的语音识别模型训练方法得到的训练后的语音识别模型；基于预测词单元序列，得到待识别语音的文本数据。According to a third aspect, a speech recognition method is provided, the method comprising: obtaining a speech to be recognized; processing the speech to be recognized to obtain audio feature data; inputting the audio feature data into a speech recognition model to obtain a predicted word unit sequence of the speech to be recognized, the speech recognition model being a trained speech recognition model obtained by using the speech recognition model training method described in any implementation method of the first aspect; obtaining text data of the speech to be recognized based on the predicted word unit sequence.

根据第四方面，提供了一种语音识别模型训练装置，该装置包括：样本获取单元，被配置成获取语音样本集，语音样本集包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列；模型获取单元，被配置成获取初始的语音识别模型，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系；替换单元，被配置成采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元；训练单元，被配置成基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。According to a fourth aspect, a speech recognition model training device is provided, which includes: a sample acquisition unit, configured to acquire a speech sample set, the speech sample set includes at least one speech sample, the speech sample includes: an audio feature sequence and an initial word unit sequence; a model acquisition unit, configured to acquire an initial speech recognition model, the speech recognition model is used to characterize the correspondence between the audio feature sequence and the prediction word unit sequence; a replacement unit, configured to replace the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language, so as to obtain a training sample set, the prediction word unit is a prediction word unit in the prediction word unit sequence obtained by inputting the speech sample selected from the speech sample set into the speech recognition model; a training unit, configured to train the speech recognition model based on the training sample set, so as to obtain a trained speech recognition model.

根据第五方面，提供了一种语音识别模型测试装置，该装置包括：信息获取单元，被配置成获取测试集和训练后的语音识别模型，测试集包括至少一个测试样本，测试样本包括：音频特征序列和测试词单元序列；训练后的语音识别模型采用如第四方面任一实现方式描述的语音识别模型训练装置训练得到，语音识别模型包括：编码器、解码器以及关键词模块；选取单元，被配置成从测试集中选取测试样本；输入单元，被配置成将测试样本中的音频特征序列输入编码器，得到音频中间特征；得到单元，被配置成将音频中间特征、当前词单元序列输入解码器，得到语音识别模型输出的预测词单元；更新单元，被配置成响应于预测词单元为非结束符，基于预测词单元更新当前词单元序列，继续将音频中间特征、当前词单元序列输入解码器，直至语音识别模型输出的预测词单元为结束符为止，得到测试样本对应的所有预测词单元；测试单元，被配置成基于测试样本对应的所有预测词单元和测试样本中的测试词单元序列，检测语音识别模型是否测试合格。According to the fifth aspect, a speech recognition model testing device is provided, the device comprising: an information acquisition unit, configured to acquire a test set and a trained speech recognition model, the test set comprising at least one test sample, the test sample comprising: an audio feature sequence and a test word unit sequence; the trained speech recognition model is trained by using the speech recognition model training device as described in any implementation of the fourth aspect, the speech recognition model comprising: an encoder, a decoder and a keyword module; a selection unit, configured to select a test sample from the test set; an input unit, configured to input the audio feature sequence in the test sample into the encoder to obtain an audio intermediate feature; a obtaining unit, configured to input the audio intermediate feature and the current word unit sequence into the decoder to obtain a predicted word unit output by the speech recognition model; an updating unit, configured to update the current word unit sequence based on the predicted word unit in response to the predicted word unit being a non-terminal symbol, and continue to input the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the speech recognition model is a terminator, thereby obtaining all predicted word units corresponding to the test sample; a testing unit, configured to detect whether the speech recognition model has passed the test based on all predicted word units corresponding to the test sample and the test word unit sequence in the test sample.

根据第六方面，又提供了一种语音识别装置，该装置包括：语音获取单元，被配置成获取待识别语音；处理单元，被配置成对待识别语音进行处理，得到音频特征数据；识别单元，被配置成将音频特征数据输入语音识别模型，得到待识别语音的预测词单元序列，语音识别模型为采用第四方面任一实现方式描述的语音识别模型训练装置得到的训练后的语音识别模型；转化单元，被配置成基于预测词单元序列，得到待识别语音的文本数据。According to the sixth aspect, a speech recognition device is provided, which includes: a speech acquisition unit, configured to acquire speech to be recognized; a processing unit, configured to process the speech to be recognized and obtain audio feature data; a recognition unit, configured to input the audio feature data into a speech recognition model to obtain a predicted word unit sequence of the speech to be recognized, and the speech recognition model is a trained speech recognition model obtained by the speech recognition model training device described in any implementation method of the fourth aspect; a conversion unit, configured to obtain text data of the speech to be recognized based on the predicted word unit sequence.

根据第七方面，提供了一种电子设备，该电子设备包括：至少一个处理器；以及与至少一个处理器通信连接的存储器，其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够执行如第一方面、第二方面或第三方面任一实现方式描述的方法。According to the seventh aspect, an electronic device is provided, which includes: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method described in any implementation of the first aspect, the second aspect, or the third aspect.

根据第八方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，计算机指令用于使计算机执行如第一方面、第二方面或第三方面任一实现方式描述的方法。According to an eighth aspect, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause a computer to execute the method described in any implementation of the first aspect, the second aspect or the third aspect.

根据第九方面，提供了一种计算机程序产品，包括计算机程序，计算机程序在被处理器执行时实现如第一方面、第二方面或第三方面任一实现方式描述的方法。According to a ninth aspect, a computer program product is provided, comprising a computer program, which, when executed by a processor, implements the method described in any one of the implementations of the first aspect, the second aspect or the third aspect.

本公开的实施例提供的语音识别模型训练方法和装置，首先，获取语音样本集，语音样本集包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列；其次，获取初始的语音识别模型，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系；再次，采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元；最后，基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。由此，在训练语音识别模型之前，采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，可以使模型在其熟悉的词单元环境中对音频特征序列进行预测，提高了模型收敛速度，提高了模型训练效果。The speech recognition model training method and device provided by the embodiments of the present disclosure first obtain a speech sample set, the speech sample set includes at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence; secondly, obtain an initial speech recognition model, the speech recognition model is used to characterize the correspondence between the audio feature sequence and the prediction word unit sequence; thirdly, replace the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language, and obtain a training sample set, the prediction word unit is the prediction word unit in the prediction word unit sequence obtained by inputting the speech sample selected from the speech sample set into the speech recognition model; finally, train the speech recognition model based on the training sample set to obtain a trained speech recognition model. Therefore, before training the speech recognition model, replacing the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language can enable the model to predict the audio feature sequence in its familiar word unit environment, thereby improving the model convergence speed and the model training effect.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure.

图1是根据本公开语音识别模型训练方法的一个实施例的流程图；FIG1 is a flow chart of an embodiment of a method for training a speech recognition model according to the present disclosure;

图2是本公开关键词模块训练过程的一种结构示意图；FIG2 is a schematic diagram of a structure of a keyword module training process disclosed herein;

图3是根据本公开语音识别模型测试方法的一个实施例的流程图；FIG3 is a flow chart of an embodiment of a method for testing a speech recognition model according to the present disclosure;

图4是本公开语音识别模型测试过程的一种结构示意图；FIG4 is a schematic diagram of a structure of a speech recognition model testing process disclosed herein;

图5是根据本公开语音识别方法的一个实施例的流程图；FIG5 is a flow chart of an embodiment of a speech recognition method according to the present disclosure;

图6是根据本公开语音识别模型训练装置的一个实施例的结构示意图；FIG6 is a schematic diagram of the structure of an embodiment of a speech recognition model training device according to the present disclosure;

图7是根据本公开语音识别模型测试装置的一个实施例的结构示意图；FIG7 is a schematic diagram of the structure of an embodiment of a speech recognition model testing device according to the present disclosure;

图8是根据本公开语音识别装置的一个实施例的结构示意图；FIG8 is a schematic diagram of the structure of an embodiment of a speech recognition device according to the present disclosure;

图9是用来实现本公开实施例的语音识别模型训练方法或语音识别模型测试方法或语音识别方法的电子设备的框图。FIG9 is a block diagram of an electronic device used to implement the speech recognition model training method or the speech recognition model testing method or the speech recognition method of the embodiment of the present disclosure.

具体实施方式DETAILED DESCRIPTION

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。The following is a description of exemplary embodiments of the present disclosure in conjunction with the accompanying drawings, including various details of the embodiments of the present disclosure to facilitate understanding, which should be considered as merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. Similarly, for the sake of clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.

传统语音识别技术主要基于声学模型和语言模型实现语音到文本的转换。声学模型的作用是构建语音和文本之间的关系，常用声学模型有隐马尔科夫模型和高斯混合模型。语言模型的作用是构建文本中词之间的关系，常用的语言模型为N-gram模型。由于传统语音识别技术受限于手工设计的语音特征和非端到端的识别方案，导致其识别能力无法和人类相媲美。Traditional speech recognition technology mainly realizes speech-to-text conversion based on acoustic models and language models. The role of the acoustic model is to build the relationship between speech and text. Common acoustic models include hidden Markov models and Gaussian mixture models. The role of the language model is to build the relationship between words in the text. The commonly used language model is the N-gram model. Because traditional speech recognition technology is limited by manually designed speech features and non-end-to-end recognition solutions, its recognition ability cannot be comparable to that of humans.

基于深度学习的语音识别技术能够自动从原始语音数据中获取到有用的特征表示，并使用复杂的网络模型结构，学习语音信号中的复杂特征和结构。这使得语音识别技术在处理非标准语音、口音、方言以及不同语速的语音时具有更高的准确率和鲁棒性。Speech recognition technology based on deep learning can automatically obtain useful feature representations from raw speech data and use complex network model structures to learn complex features and structures in speech signals. This makes speech recognition technology more accurate and robust when dealing with non-standard voices, accents, dialects, and voices of different speaking speeds.

随着Transfomer等模型结构在语音识别领域的广泛使用和计算机硬件的高速发展，语音识别技术发生了质的飞跃。对于英文、中文等等广泛使用的语言，语音识别技术已经可以和专业人员相媲美；但是非广泛使用的语言，语言识别模型识别能力还是较弱。With the widespread use of Transformer and other model structures in the field of speech recognition and the rapid development of computer hardware, speech recognition technology has made a qualitative leap. For widely used languages such as English and Chinese, speech recognition technology is already comparable to that of professionals; however, for languages that are not widely used, the recognition ability of language recognition models is still relatively weak.

基于此，本公开提出了一种语音识别模型训练方法，图1示出了根据本公开语音识别模型训练方法的一个实施例的流程100，上述语音识别模型训练方法包括以下步骤：Based on this, the present disclosure proposes a speech recognition model training method. FIG1 shows a process 100 according to an embodiment of the speech recognition model training method of the present disclosure. The speech recognition model training method includes the following steps:

步骤101，获取语音样本集。Step 101: Obtain a speech sample set.

本实施例中，语音样本集是待识别语种的语音样本的集合，例如，待识别的语种为样本少的语言，通过语音识别模型通过对该种类语言的语音进行识别，得到该种类语言的文本，可以提高该种类语言的传播、学习和研究。In this embodiment, the speech sample set is a collection of speech samples of the language to be recognized. For example, the language to be recognized is a language with few samples. By using a speech recognition model to recognize the speech of this language and obtain the text of this language, the dissemination, learning and research of this language can be improved.

本实施例中，语音识别模型训练方法运行于其上的执行主体可以通过多种方式获取语音样本集，例如，执行主体可以通过有线连接方式或无线连接方式，从数据库服务器中获取存储于其中的语音样本集。再例如，用户可以通过与终端通信，获取终端所收集的语音样本集。In this embodiment, the execution subject on which the speech recognition model training method runs can obtain the speech sample set in a variety of ways. For example, the execution subject can obtain the speech sample set stored in the database server through a wired connection or a wireless connection. For another example, the user can obtain the speech sample set collected by the terminal by communicating with the terminal.

在这里，语音样本集可以包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列；其中，音频特征序列是对音频进行特征提取之后得到序列，例如梅尔特征、Fbank(FilterBank)特征等。Here, the speech sample set may include at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence; wherein the audio feature sequence is a sequence obtained after feature extraction of the audio, such as Mel features, Fbank (FilterBank) features, etc.

本实施例中，初始词单元序列，也称token序列，是将音频特征序列对应的文本分割成诸如词、标点符号、数字或纯字母等语言单元，在初始单元序列中具有表征音频特征序列对应的文本的语种的词单元。In this embodiment, the initial word unit sequence, also called token sequence, is obtained by dividing the text corresponding to the audio feature sequence into language units such as words, punctuation marks, numbers or pure letters, and the initial unit sequence contains word units of the language representing the text corresponding to the audio feature sequence.

本公开的技术方案中，所涉及的语音样本集的收集、存储、使用、加工、传输、提供和公开等处理，是在经授权后执行的，符合相关法律法规。In the technical solution of the present disclosure, the collection, storage, use, processing, transmission, provision and disclosure of the voice sample sets involved are carried out after authorization and comply with relevant laws and regulations.

步骤102，获取初始的语音识别模型。Step 102, obtaining an initial speech recognition model.

本实施例中，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系，具体地，语音识别模型可以是Whisper模型，Whisper模型是一个标准的transformer模型。Whisper模型将固定长度(30秒)的音频作为输入，首先将提取音频的Fbank特征输入到Whisper模型的编码器中生成音频的中间特征，然后再使用Whisper模型的解码器对中间特征进行解码，输出模型预测的预测词单元序列。In this embodiment, the speech recognition model is used to characterize the correspondence between the audio feature sequence and the predicted word unit sequence. Specifically, the speech recognition model can be a Whisper model, which is a standard transformer model. The Whisper model takes a fixed length (30 seconds) of audio as input, first inputs the Fbank features of the extracted audio into the encoder of the Whisper model to generate the intermediate features of the audio, and then uses the decoder of the Whisper model to decode the intermediate features and output the predicted word unit sequence predicted by the model.

可选地，语音识别模型还可以是MMS(Massively Multilingual Speech，大规模多语种语音)模型，MMS模型可以识别4000多种口头语言，MMS模型与Whisper模型相比模型参数量更大，预训练数据更多，模型功能也更强大，但是MMS模型的参数量巨大，部署成本过高。Optionally, the speech recognition model can also be an MMS (Massively Multilingual Speech) model. The MMS model can recognize more than 4,000 spoken languages. Compared with the Whisper model, the MMS model has a larger number of model parameters, more pre-training data, and more powerful model functions. However, the MMS model has a huge number of parameters and the deployment cost is too high.

步骤103，采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集。Step 103, using the predicted word unit representing the language to replace the language word unit in the initial word unit sequence in the speech sample set to obtain a training sample set.

本实施例中，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元。In this embodiment, the predicted word unit is a predicted word unit in a predicted word unit sequence obtained by inputting a speech sample selected from a speech sample set into a speech recognition model.

本实施例中，从语音样本中选取语音样本，并将该语音样本的音频特征序列输入语音识别模型，可以得到语音识别模型输出的预测词单元序列，其中，在该预测词单元序列中具有一个预测词单元是表征语种的词单元，将预测词单元序列中表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，可以使语音识别模型对训练样本集较熟悉。In this embodiment, a speech sample is selected from the speech sample, and the audio feature sequence of the speech sample is input into the speech recognition model, so that a predicted word unit sequence output by the speech recognition model can be obtained, wherein one predicted word unit in the predicted word unit sequence is a word unit representing a language, and the predicted word unit representing the language in the predicted word unit sequence replaces the language word unit in the initial word unit sequence in the speech sample set, so that the speech recognition model can be more familiar with the training sample set.

本实施例中，将表征语种的预测词单元替换初始词单元序列中的语种词单元，得到新的词单元序列，可以得到包括音频特征序列和新的词单元序列的训练样本集。In this embodiment, the predicted word unit representing the language replaces the language word unit in the initial word unit sequence to obtain a new word unit sequence, and a training sample set including the audio feature sequence and the new word unit sequence can be obtained.

本实施例中，训练样本包括：一个音频特征序列X＝(x₁,x₂,…x_k)和其对应的token序列Label＝(l₁,l₂,…l_m)。对于样本少的语言的识别任务，Label记录了当前音频对应的该种类语言文本信息。语音识别模型会将音频序列X作为输入，输出预测token序列Output＝(o₁,o₂,…o_m)。训练过程中一般会使用交叉熵损失函数衡量Label和Output的差异，使Output尽可能的和Label保持一致。在预测过程中，通过语音识别模型的输出，可以获取到模型预测的该种类语言的文本。In this embodiment, the training samples include: an audio feature sequence X = (x ₁ , x ₂ , …x _k ) and its corresponding token sequence Label = (l ₁ , l ₂ , …l _m ). For recognition tasks of languages with few samples, Label records the text information of the language corresponding to the current audio. The speech recognition model takes the audio sequence X as input and outputs the predicted token sequence Output = (o ₁ , o ₂ , …o _m ). During the training process, the cross entropy loss function is generally used to measure the difference between Label and Output, so that the Output is as consistent as possible with the Label. During the prediction process, the text of the language predicted by the model can be obtained through the output of the speech recognition model.

语音识别模型可以根据Output的形成方式，分成非自回归模型和非自回归模型。非自回归模型将音频序列X作为输出，并行解码出Output；自回归模型将音频序列X作为输出，串行解码出Output。非自回归模型并行解码，解码速度快，但是容易忽略音频序列之间的上下文关系，模型效果较差；自回归模型串行解码，解码速度慢，但是会构建音频序列之间的前后关系，模型效果较好。由于该种类语言的样本较少，训练难度略大，为了保证语音识别模型的效果，本公开采用自回归模型的解码方式。The speech recognition model can be divided into a non-autoregressive model and a non-autoregressive model according to the way the Output is formed. The non-autoregressive model takes the audio sequence X as the output and decodes the Output in parallel; the autoregressive model takes the audio sequence X as the output and decodes the Output in serial. The non-autoregressive model decodes in parallel and has a fast decoding speed, but it is easy to ignore the contextual relationship between the audio sequences, and the model effect is poor; the autoregressive model decodes in serial and has a slow decoding speed, but it can construct the before and after relationship between the audio sequences, and the model effect is better. Since there are fewer samples of this type of language and the training difficulty is slightly greater, in order to ensure the effect of the speech recognition model, the present disclosure adopts the decoding method of the autoregressive model.

本实施例中，语音识别模型输出的预测词单元序列是一段数组，比如[223，2124，2323，45]，每一个数字表示不同的单词，也就是上面的Output，为了提高语音识别模型的训练效果，在对语音识别模型训练过程中，可以一次性全部(非自回归)得到Output，而在对语音识别模型进行测试时可以一个一个陆续(自回归)得到Output中的预测词单元。In this embodiment, the predicted word unit sequence output by the speech recognition model is an array, such as [223, 2124, 2323, 45], where each number represents a different word, that is, the Output above. In order to improve the training effect of the speech recognition model, during the training of the speech recognition model, the Output can be obtained all at once (non-autoregressive), and when the speech recognition model is tested, the predicted word units in the Output can be obtained one by one (autoregressive).

步骤104，基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。Step 104: train the speech recognition model based on the training sample set to obtain a trained speech recognition model.

本实施例中，执行主体可以从步骤103中获取的训练样本集中选取训练样本，以及执行以下步骤A至步骤B的训练步骤，完成一次语音识别模型的迭代训练。其中，从训练样本集中选取训练样本的选取方式和选取数量在本申请中并不限制，并且语音识别模型的迭代训练的次数也并不限制。In this embodiment, the execution subject can select training samples from the training sample set obtained in step 103, and perform the training steps from step A to step B below to complete an iterative training of the speech recognition model. Among them, the selection method and the number of training samples selected from the training sample set are not limited in this application, and the number of iterative training of the speech recognition model is not limited.

步骤A.基于向语音识别模型输入的训练样本，计算语音识别模型的网络损失值。Step A: Based on the training samples input into the speech recognition model, the network loss value of the speech recognition model is calculated.

本实施例中，语音识别模型的每次迭代训练时，均会从训练样本集中选取训练样本，并将选取的训练样本输入语音识别模型，基于语音识别模型输出的预测词单元序列，计算语音识别模型的损失值。In this embodiment, during each iterative training of the speech recognition model, training samples are selected from the training sample set, and the selected training samples are input into the speech recognition model. Based on the predicted word unit sequence output by the speech recognition model, the loss value of the speech recognition model is calculated.

本实施例中，计算语音识别模型的损失值的语音识别模型的损失函数可以采用交叉熵损失函数，交叉熵损失函数能够衡量同一个随机变量中的两个不同概率分布的差异程度，在机器学习中就表示为真实概率分布与预测概率分布之间的差异。交叉熵损失函数的值越小，语音识别模型的预测效果就越好。In this embodiment, the loss function of the speech recognition model for calculating the loss value of the speech recognition model can adopt a cross-entropy loss function, which can measure the degree of difference between two different probability distributions in the same random variable, and is represented as the difference between the true probability distribution and the predicted probability distribution in machine learning. The smaller the value of the cross-entropy loss function, the better the prediction effect of the speech recognition model.

可选地，语音识别模型的损失函数也可以采用均方误差函数，均方误差函数是语音识别模型的预测词单元序列与真值(训练样本中新的词单元序列)差平方的期望，在语音识别模型的迭代训练过程中，可以利用梯度下降算法最小化语音识别模型的损失函数，从而迭代地优化语音识别模型的网络参数。Optionally, the loss function of the speech recognition model can also adopt the mean square error function, which is the expectation of the square of the difference between the predicted word unit sequence of the speech recognition model and the true value (the new word unit sequence in the training sample). During the iterative training process of the speech recognition model, the gradient descent algorithm can be used to minimize the loss function of the speech recognition model, thereby iteratively optimizing the network parameters of the speech recognition model.

梯度的本意是一个向量，表示某一损失函数在该点处的方向导数沿着该方向取得最大值，即损失函数在该点处沿着该方向变化最快，变化率最大。在深度学习中，神经网络的主要任务是在学习时找到最优的网络参数(权重和偏置)，这个最优的网络参数也就是损失函数最小时的参数。The original meaning of gradient is a vector, which means that the directional derivative of a loss function at that point reaches the maximum value along that direction, that is, the loss function changes fastest along that direction at that point and the rate of change is the largest. In deep learning, the main task of neural networks is to find the optimal network parameters (weights and biases) during learning. The optimal network parameters are also the parameters when the loss function is minimized.

步骤B.基于语音识别模型的损失值，训练语音识别模型，得到训练后的语音识别模型。Step B. Based on the loss value of the speech recognition model, the speech recognition model is trained to obtain a trained speech recognition model.

本实施例中，训练后的语音识别模型是通过将选取的语音样本输入语音识别模型迭代训练，对语音识别模型进行调参后，得到的训练完成的语音识别模型。In this embodiment, the trained speech recognition model is obtained by inputting the selected speech samples into the speech recognition model for iterative training and adjusting the parameters of the speech recognition model.

本实施例中，通过语音识别模型的损失值可以检测语音识别模型是否满足训练完成条件，在语音识别模型满足训练完成条件之后，得到训练完成的语音识别模型。In this embodiment, the loss value of the speech recognition model can be used to detect whether the speech recognition model meets the training completion condition. After the speech recognition model meets the training completion condition, a trained speech recognition model is obtained.

本实施例中，上述训练完成条件包括：语音识别模型的损失值小于第一损失值阈值。其中，第一损失阈值可以基于具体训练要求确定，例如，第一损失阈值为0.01。In this embodiment, the training completion condition includes: the loss value of the speech recognition model is less than a first loss value threshold. The first loss threshold can be determined based on specific training requirements, for example, the first loss threshold is 0.01.

可选地，本实施例中，响应于语音识别模型不满足训练完成条件，则调整语音识别模型中的相关参数使得语音识别模型的损失值收敛，基于调整后的语音识别模型，继续执行上述训练步骤A-步骤B。Optionally, in this embodiment, in response to the speech recognition model not meeting the training completion conditions, the relevant parameters in the speech recognition model are adjusted so that the loss value of the speech recognition model converges, and based on the adjusted speech recognition model, the above-mentioned training steps A-B are continued.

本实施例中，在语音识别模型不满足训练完成条件时，调整语音识别模型的相关参数，有助于帮助语音识别模型的损失值收敛。In this embodiment, when the speech recognition model does not meet the training completion conditions, adjusting the relevant parameters of the speech recognition model helps to help the loss value of the speech recognition model converge.

本实施例中，对于对样本少的语言进行识别的Whisper模型，由于Whisper模型接收的token序列(词单元序列)中必须包含语种的token(词单元)，决定模型进行哪类语种的语音识别任务。但是由于Whisper原始模型没有针对该种类语言的数据进行训练过，因此其token序列中缺少该种类语言的token。为此，对语音识别模型进行了多次训练，尝试用中文、英文等其他语言的token代替该种类语言的token；或者给该种类语言分配一个新的语种token，但是训练后的模型效果都未达到预期。而使用Whisper模型对该种类语言的语音频进行语种类别测试，发现Whisper模型会将大部分该种类语言的音频预测为另一种类的语言(表征语音的预测词单元)，而用另一种类的语言的token代替该种类语言的token，可以取得不错的训练效果。In this embodiment, for the Whisper model that recognizes languages with few samples, since the token sequence (word unit sequence) received by the Whisper model must contain the token (word unit) of the language, it is determined which type of language the model performs speech recognition tasks. However, since the original Whisper model has not been trained for data of this type of language, its token sequence lacks the token of this type of language. For this reason, the speech recognition model has been trained many times, and attempts have been made to replace the token of this type of language with tokens of other languages such as Chinese and English; or to assign a new language token to this type of language, but the effect of the model after training has not met expectations. The Whisper model is used to perform a language category test on the audio of this type of language, and it is found that the Whisper model predicts most of the audio of this type of language as another type of language (predicted word unit representing speech), and replacing the token of this type of language with the token of another type of language can achieve good training results.

本实施例中，上述基于语音识别模型的损失值，训练语音识别模型，得到训练后的语音识别模型，包括：响应于语音识别模型的损失值小于损失阈值，确定语音识别模型满足训练完成条件，将满足训练完成条件的语音识别模型中的特征识别子网络和分类子网络作为语音识别模型。In this embodiment, the above-mentioned loss value based on the speech recognition model is used to train the speech recognition model to obtain the trained speech recognition model, including: in response to the loss value of the speech recognition model being less than the loss threshold, determining that the speech recognition model meets the training completion condition, and using the feature recognition subnetwork and classification subnetwork in the speech recognition model that meets the training completion condition as the speech recognition model.

本公开的实施例提供的语音识别模型训练方法，首先，获取语音样本集，语音样本集包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列；其次，获取初始的语音识别模型，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系；再次，采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元；最后，基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。由此，在训练语音识别模型之前，采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，可以使模型在其熟悉的词单元环境中对音频特征序列进行预测，提高了模型收敛速度，提高了模型训练效果。The speech recognition model training method provided by the embodiment of the present disclosure first obtains a speech sample set, the speech sample set includes at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence; secondly, obtains an initial speech recognition model, the speech recognition model is used to characterize the correspondence between the audio feature sequence and the prediction word unit sequence; thirdly, replaces the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language, and obtains a training sample set, the prediction word unit is the prediction word unit in the prediction word unit sequence obtained by inputting the speech sample selected from the speech sample set into the speech recognition model; finally, based on the training sample set, trains the speech recognition model to obtain a trained speech recognition model. Therefore, before training the speech recognition model, replacing the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language can enable the model to predict the audio feature sequence in its familiar word unit environment, thereby improving the model convergence speed and the model training effect.

在本公开的一些可选实现方式中，上述语音识别模型包括：初始的识别子模型、与初始的识别子模型连接的关键词模块；基于训练样本集，训练语音识别模型，得到训练后的语音识别模型包括：基于训练样本集，训练初始的识别子模型，得到训练后的识别子模型；基于训练样本集、训练后的识别子模型，训练关键词模块，得到训练后的关键词模块；将训练后的识别子模型和训练后的关键词模块作为训练后的语音识别模型。In some optional implementations of the present disclosure, the above-mentioned speech recognition model includes: an initial recognition sub-model, and a keyword module connected to the initial recognition sub-model; based on the training sample set, the speech recognition model is trained to obtain a trained speech recognition model including: based on the training sample set, the initial recognition sub-model is trained to obtain a trained recognition sub-model; based on the training sample set and the trained recognition sub-model, a keyword module is trained to obtain a trained keyword module; the trained recognition sub-model and the trained keyword module are used as the trained speech recognition model.

如图2所示，语音识别模型包括：识别子模型和关键词模块，将训练样本集中的训练样本输入识别子模型，对识别子模型进行多次训练，可以得到训练后的识别子模型。As shown in FIG2 , the speech recognition model includes: a recognition sub-model and a keyword module. The training samples in the training sample set are input into the recognition sub-model, and the recognition sub-model is trained multiple times to obtain a trained recognition sub-model.

本可选实现方式提供的训练语音识别模型的方法，首先训练初始的识别子模型，得到训练后的识别子模型；在基于训练样本集、训练后的识别子模型，训练关键词模型，得到训练后的关键词模块，分步训练初始的识别子模型和关键词模块，提高了训练后的语音识别模型得到的可靠性。The method for training a speech recognition model provided by this optional implementation first trains an initial recognition sub-model to obtain a trained recognition sub-model; then, based on a training sample set and the trained recognition sub-model, a keyword model is trained to obtain a trained keyword module, and the initial recognition sub-model and the keyword module are trained step by step, thereby improving the reliability of the trained speech recognition model.

在本公开的一些可选实现方式中，上述基于训练样本集、训练后的识别子模型，训练关键词模块，得到训练后的关键词模块包括：固定训练后的识别子模型的参数；将训练样本集中的训练样本输入训练后的识别子模型，得到训练后的识别子模型输出的第一概率分布和关键词模块输出的第二概率分布；基于第一概率分布和第二概率分布，计算训练样本中初始词单元的损失值；基于损失值，得到训练后的关键词模块。In some optional implementations of the present disclosure, the above-mentioned training of the keyword module based on the training sample set and the trained recognition sub-model to obtain the trained keyword module includes: fixing the parameters of the trained recognition sub-model; inputting the training samples in the training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module; calculating the loss value of the initial word unit in the training sample based on the first probability distribution and the second probability distribution; and obtaining the trained keyword module based on the loss value.

本可选实现方式中，如图2所示，识别子模型为Whisper模型，为了提升模型对特定关键词(bias words)的识别能力，本公开的关键词模块可以采用TCPGen(Tree-constrained pointer generator)模块，通过TCPGen模块可以帮助Whisper模型(此时是训练后的识别子模型)提升对关键词的识别能力，首先，要生成需要重点识别的关键词序列(biasing list)。然后，TCPGen模块会根据该关键词序列生成一棵针对关键词前缀树，通过对该树的遍历生成一个针对所有token的概率分布P^ptr(y_i)和生成概率P^gen(y_i)。具体地，TCPGen模块首先会通过对关键词前缀树的遍历，得到关键词对应的token序列，然后通过获取每一个token序列对应的embedding，最后分别使用全连接层和softmax函数生成概率分布P^ptr(y_i)和生成概率P^gen(y_i)。最后，二者会和Whisper模型生成的token概率分布P^mdl(y_i)(虽然Whisper模型的参数已经固定，但是Whisper模型还可以进行前向预测)进行加权求和，得到最终的token概率分布P(y_i)，具体地，如式(1)所示：In this optional implementation, as shown in FIG2 , the recognition sub-model is a Whisper model. In order to improve the recognition ability of the model for specific keywords (bias words), the keyword module of the present disclosure can adopt a TCPGen (Tree-constrained pointer generator) module. The TCPGen module can help the Whisper model (the trained recognition sub-model at this time) improve its recognition ability for keywords. First, a keyword sequence (biasing list) that needs to be recognized is generated. Then, the TCPGen module generates a keyword prefix tree based on the keyword sequence, and generates a probability distribution P ^ptr (y _i ) and a generation probability P ^gen (y _i ) for all tokens by traversing the tree. Specifically, the TCPGen module first obtains the token sequence corresponding to the keyword by traversing the keyword prefix tree, and then obtains the embedding corresponding to each token sequence, and finally uses the fully connected layer and the softmax function to generate the probability distribution P ^ptr (y _i ) and the generation probability P ^gen (y _i ). Finally, the two will be weighted and summed with the token probability distribution ^Pmdl (y _i ) generated by the Whisper model (although the parameters of the Whisper model are fixed, the Whisper model can also perform forward prediction) to obtain the final token probability distribution P(y _i ), specifically, as shown in formula (1):

P(y_i)＝P^mdl(y_i)(1-P^gen(y_i))+P^ptr(y_i)P^gen(y_i) (1)P(y _i )=P ^mdl (y _i )(1-P ^gen (y _i ))+P ^ptr (y _i )P ^gen (y _i ) (1)

本可选实现方式中，在识别子模型训练完成的基础上，对关键词模型进行训练，可以保证关键词模型的训练效果。具体地，Whisper模型在训练时需要采用词单元序列右移一位技术，但是通过训练发现，单独训练TCPGen模块时不应使用词单元序列右移一位技术，否则训练损失会发生震荡，导致模型无法进行训练。这是因为TCPGen模块不会直接预测token序列，而是在原始Whisper模型的预测结果上进行修正，为此，在Whisper模型训练完成之后，在不采用词单元序列右移一位技术的基础上，对TCPGen模块进行训练，可以提高TCPGen模块的训练效果。In this optional implementation, the keyword model is trained on the basis of the completion of the recognition sub-model training, which can ensure the training effect of the keyword model. Specifically, the Whisper model needs to use the technology of shifting the word unit sequence right by one position during training, but through training, it is found that the technology of shifting the word unit sequence right by one position should not be used when training the TCPGen module alone, otherwise the training loss will oscillate, resulting in the inability to train the model. This is because the TCPGen module does not directly predict the token sequence, but makes corrections on the prediction results of the original Whisper model. For this reason, after the Whisper model training is completed, the TCPGen module is trained without using the technology of shifting the word unit sequence right by one position, which can improve the training effect of the TCPGen module.

本实施例提供的训练关键词模块的方法，首先固定训练后的识别子模型的参数；然后，基于训练样本集，计算训练后的识别子模型输出的第一概率分数和关键词模块输出的第二概率分布；基于第一概率分布和第二概率分布，计算训练样本中初始词单元的损失值，基于损失值，得到训练后的关键词模块，提高了关键词模块训练的可靠性。The method for training a keyword module provided in this embodiment first fixes the parameters of the trained recognition sub-model; then, based on the training sample set, calculates the first probability score output by the trained recognition sub-model and the second probability distribution output by the keyword module; based on the first probability distribution and the second probability distribution, calculates the loss value of the initial word unit in the training sample, and based on the loss value, obtains the trained keyword module, thereby improving the reliability of keyword module training.

在本公开的一些可选实现方式中，上述基于训练样本集，训练初始的识别子模型，得到训练后的识别子模型包括：从训练样本集选取训练样本，得到选取样本；将选取样本中的新的词单元序列中的新的词单元进行右移一位，得到训练词单元序列；基于选取样本中的音频特征序列和训练词单元序列，训练初始的识别子模型，得到训练后的识别子模型。In some optional implementations of the present disclosure, the above-mentioned training of the initial recognition sub-model based on the training sample set to obtain the trained recognition sub-model includes: selecting training samples from the training sample set to obtain selected samples; shifting the new word units in the new word unit sequence in the selected samples right by one position to obtain a training word unit sequence; training the initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected samples to obtain the trained recognition sub-model.

本可选实现方式中，新的词单元序列是将表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元得到的词单元序列，新的词单元是新的词单元序列中的词单元。In this optional implementation, the new word unit sequence is a word unit sequence obtained by replacing the language word units in the initial word unit sequence in the speech sample set with predicted word units representing the language, and the new word units are word units in the new word unit sequence.

本可选实现方式中，识别子模型可以是Whisper模型，而在训练whisper模型时，为了适应Whisper自回归模型的特性，在训练过程中会使用词单元序列右移一位(shift_tokens_right)技术，即模型需要预测token序列的下一个词。为了实现这一点，需要将向模型输入的token序列右移一个单位。但是通过训练发现，单独训练TCPGen时不应使用词单元序列右移一位技术，否则训练损失会发生震荡，导致模型无法进行训练。这是因为TCPGen模块不会直接预测token序列，而是在原始Whisper模型的预测结果上进行修正。In this optional implementation, the recognition sub-model can be a Whisper model, and when training the whisper model, in order to adapt to the characteristics of the Whisper autoregressive model, the word unit sequence is shifted right by one position (shift_tokens_right) during the training process, that is, the model needs to predict the next word in the token sequence. To achieve this, the token sequence input to the model needs to be shifted right by one unit. However, through training, it was found that the word unit sequence shifted right by one position should not be used when training TCPGen alone, otherwise the training loss will oscillate, resulting in the inability to train the model. This is because the TCPGen module does not directly predict the token sequence, but makes corrections based on the prediction results of the original Whisper model.

本可选实现方式中，将选取样本中的新的词单元序列中的新的词单元进行右移一位为了，目的是为了可以使初始的识别子模型可以有效地预测下一个Token。In this optional implementation, the new word unit in the new word unit sequence in the selected sample is shifted right by one position in order to enable the initial recognition sub-model to effectively predict the next Token.

本实施例提供的训练初始的识别子模型的方法，从训练样本集中选取训练样本，将选取样本中的新的词单元序列中的新的词单元右移一位，得到训练词单元序列，基于选取样本中的音频特征序列和训练词单元序列，训练初始的识别子模型，提高了训练后的识别子模型得到的可靠性，提高了语音识别模型的训练效果。The method for training an initial recognition sub-model provided in this embodiment selects a training sample from a training sample set, shifts the new word unit in the new word unit sequence in the selected sample right by one position to obtain a training word unit sequence, and trains the initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected sample, thereby improving the reliability of the trained recognition sub-model and improving the training effect of the speech recognition model.

在本公开的一些可选实现方式中，上述获取语音样本集包括：获取包括至少一个初始语音的初始数据集；对初始数据集进行预处理，得到处理数据集；对处理数据集进行数据增强，得到语音数据集；基于语音数据集，得到语音样本集。In some optional implementations of the present disclosure, the above-mentioned acquisition of a speech sample set includes: acquiring an initial data set including at least one initial speech; preprocessing the initial data set to obtain a processed data set; performing data enhancement on the processed data set to obtain a speech data set; and obtaining a speech sample set based on the speech data set.

本可选实现方式中，初始语音是一种语种语言的语音数据，该初始语音可以是未经过处理的语音数据，通过对初始数据集中的初始语音进行处理，得到语音样本集。例如，初始语音是稀缺语言的语音，则得到的语音样本集是该稀缺语言的样本集，通过该样本集可以训练得到语音识别模型。In this optional implementation, the initial speech is speech data of a language, and the initial speech may be unprocessed speech data. The speech sample set is obtained by processing the initial speech in the initial data set. For example, if the initial speech is speech of a rare language, the obtained speech sample set is a sample set of the rare language, and the speech recognition model can be trained by the sample set.

本可选实现方式中，数据预处理是准备原始数据并使其适合机器学习模型的过程，通过数据预处理可以使预处理后的处理数据集可以适用于机器学习模型的训练，上述对初始数据集进行预处理，得到处理数据集包括：去除初始数据中的标点符号等无法识别的信息。In this optional implementation, data preprocessing is the process of preparing raw data and making it suitable for a machine learning model. Data preprocessing can make the preprocessed processed data set suitable for training a machine learning model. The above-mentioned preprocessing of the initial data set to obtain the processed data set includes: removing unrecognizable information such as punctuation marks in the initial data.

本可选实现方式中，基于初始数据集对应的语种的样本信息量少的问题，对处理数据集进行数据增强，增加了语音数据集中语音数据的数量。In this optional implementation, based on the problem that the sample information amount of the language corresponding to the initial data set is small, data enhancement is performed on the processed data set, thereby increasing the amount of speech data in the speech data set.

本可选实现方式中，上述基于语音数据集，得到语音样本集包括：提取语音数据集中各个语音数据的语音特征，得到各个语音数据的语音特征；对各个语音特征进行固定长度限定，得到各个语音特征的音频特征，确定各个语音数据的初始词单元，得到包括至少一个语音样本的语音样本集。其中，语音特征可以是梅尔特征或者Fbank(Filter Banks)特征，而语音特征的提取可以采用成熟的手段，此处不再赘述。In this optional implementation, the above-mentioned voice sample set based on the voice data set includes: extracting the voice features of each voice data in the voice data set to obtain the voice features of each voice data; limiting each voice feature by a fixed length to obtain the audio features of each voice feature, determining the initial word unit of each voice data, and obtaining a voice sample set including at least one voice sample. Among them, the voice feature can be a Mel feature or an Fbank (Filter Banks) feature, and the extraction of the voice feature can adopt mature means, which will not be repeated here.

本可选实现方式提供的获取语音样本集的方法，首先对初始数据集进行预处理，得到处理数据集；再对处理数据集进行数据增强，得到语音数据集；基于语音数据集，得到语音样本集，提高了语音样本集得到的可靠性。The method for obtaining a speech sample set provided by this optional implementation first preprocesses an initial data set to obtain a processed data set; then performs data enhancement on the processed data set to obtain a speech data set; based on the speech data set, a speech sample set is obtained, thereby improving the reliability of obtaining the speech sample set.

在本公开的一些可选实现方式中，上述对初始数据集进行预处理，得到处理数据集包括以下至少一项：对初始数据集中的所有初始数据进行同一采样率的采样；删除初始数据集中的重复和无法识别的初始数据。In some optional implementations of the present disclosure, the above-mentioned preprocessing of the initial data set to obtain a processed data set includes at least one of the following: sampling all initial data in the initial data set at the same sampling rate; deleting repeated and unrecognizable initial data in the initial data set.

可选地，上述对初始数据集进行预处理，得到处理数据集还可以包括：对初始数据集中所有初始数据进行处理，使所有初始数据位于同一音频通道。Optionally, the preprocessing of the initial data set to obtain the processed data set may further include: processing all initial data in the initial data set so that all the initial data are located in the same audio channel.

可选地，上述对初始数据集进行预处理，得到处理数据集还可以包括：采用语音合成技术生成新的初始数据，将新的初始数据增加到处理数据集中，由于样本少的语言的训练样本较少，通过补充合成的该种类语言的音频作为训练样本，提升数据集的规模。通过训练发现，合成数据集不宜过多，否则对训练效果可能有副作用，主要可能是因为合成数据的质量难以保证，为此，可以将新的初始数据的数据量保持在约占总语音样本集的数据量的30％。Optionally, the above-mentioned preprocessing of the initial data set to obtain the processed data set may also include: using speech synthesis technology to generate new initial data, adding the new initial data to the processed data set, and increasing the size of the data set by supplementing the synthesized audio of the language as a training sample due to the small number of training samples for the language with few samples. Through training, it is found that the synthesized data set should not be too large, otherwise it may have side effects on the training effect, mainly because the quality of the synthesized data is difficult to guarantee. For this reason, the amount of new initial data can be kept at about 30% of the total amount of the speech sample set.

本可选实现方式提供对初始数据集进行预处理的方法，包括对初始数据集进行同一采样率的采样，删除初始数据集中的重复和无法识别的初始数据，由此，提高了语音预处理的多样性。This optional implementation provides a method for preprocessing an initial data set, including sampling the initial data set at the same sampling rate and deleting repeated and unrecognizable initial data in the initial data set, thereby improving the diversity of speech preprocessing.

在本公开的一些可选实现方式中，上述对处理数据集进行数据增强，得到语音数据集包括以下至少一项：修改处理数据集中各个处理数据的语速，得到增加数据集，在处理数据集中增加增加数据集；为处理数据集中各个处理数据增加回响，得到回响数据集，在处理数据集中增加回响数据集；为处理数据集中各个处理数据增加噪声，得到噪声数据集，在处理数据集中增加噪声数据集；修改处理数据集中各个处理数据的音频频谱图，得到频谱数据集，在处理数据集中增加频谱数据集。In some optional implementations of the present disclosure, the above-mentioned data enhancement of the processed data set to obtain a speech data set includes at least one of the following: modifying the speaking speed of each processed data in the processed data set to obtain an increased data set, and adding the increased data set to the processed data set; adding reverberation to each processed data in the processed data set to obtain an reverberation data set, and adding an reverberation data set to the processed data set; adding noise to each processed data in the processed data set to obtain a noise data set, and adding a noise data set to the processed data set; modifying the audio spectrum graph of each processed data in the processed data set to obtain a spectrum data set, and adding a spectrum data set to the processed data set.

本可选实现方式中，修改处理数据集中处理数据(音频)的语速，增加训练集的数量，让数据更加多样化；随机为处理数据集中处理数据增加回响，可以实现模拟多种声音场景的目的；随机为处理数据集中处理数据增加噪声，可以提升数据复杂度；可以使用Specaugment技术(用于语音识别增强的技术)修改处理数据集中处理数据的音频频谱图，提升音频特征质量。In this optional implementation, the speaking speed of the processed data (audio) in the processed data set is modified, the number of training sets is increased, and the data is made more diverse; reverberation is randomly added to the processed data in the processed data set to achieve the purpose of simulating a variety of sound scenes; noise is randomly added to the processed data in the processed data set to increase data complexity; Specaugment technology (a technology used for speech recognition enhancement) can be used to modify the audio spectrum graph of the processed data in the processed data set to improve the quality of audio features.

本可选实现方式提供的对处理数据进行数据增强的方法，包括：修改处理数据中处理数据的语速，为处理数据增加回响，为处理数据增加噪声，修改处理数据的音频频谱图中的至少一种或多种，提高语音数据集得到的多样本性。The method for data enhancement of processed data provided by this optional implementation includes: modifying the speaking speed of processed data in the processed data, adding reverberation to the processed data, adding noise to the processed data, modifying at least one or more of the audio spectrum graphs of the processed data, and improving the multi-sample nature of the speech data set.

进一步地，基于上述实施例提供的语音识别模型训练方法，本公开还提供了一种语音识别模型测试方法的一个实施例，本公开的语音识别模型测试方法结合了语音识别、深度学习等人工智能领域。Furthermore, based on the speech recognition model training method provided in the above-mentioned embodiment, the present disclosure also provides an embodiment of a speech recognition model testing method. The speech recognition model testing method of the present disclosure combines artificial intelligence fields such as speech recognition and deep learning.

参见图3，示出了根据本公开语音识别模型测试方法的一个实施例的流程300，本实施例提供的语音识别模型测试方法包括以下步骤：3 , a process 300 of an embodiment of a method for testing a speech recognition model according to the present disclosure is shown. The method for testing a speech recognition model provided in this embodiment includes the following steps:

步骤301，获取测试集和训练后的语音识别模型。Step 301, obtaining a test set and a trained speech recognition model.

在本实施例中，测试集包括至少一个测试样本，测试样本包括：音频特征序列和测试词单元序列。In this embodiment, the test set includes at least one test sample, and the test sample includes: an audio feature sequence and a test word unit sequence.

本实施例中，测试样本是与训练样本的数据结构相同的样本，即测试样本包括：音频特征序列和新的词单元序列，新的词单元序列是将表征语种的预测词单元替换初始词单元序列中的语种词单元的序列，新的词单元序列包括至少一个新词单元，但是测试样本并未作为训练样本对语音识别模型进行过训练。In this embodiment, the test sample is a sample with the same data structure as the training sample, that is, the test sample includes: an audio feature sequence and a new word unit sequence, the new word unit sequence is a sequence in which the predicted word units representing the language replace the language word units in the initial word unit sequence, and the new word unit sequence includes at least one new word unit, but the test sample has not been used as a training sample to train the speech recognition model.

本实施例中，语音识别模型可以是采用如上述图1实施例所描述的方法而训练得到的训练后的语音识别模型，具体训练过程可以参见图3实施例的相关描述，在此不再赘述。In this embodiment, the speech recognition model can be a trained speech recognition model obtained by training using the method described in the embodiment of Figure 1 above. The specific training process can be found in the relevant description of the embodiment of Figure 3, which will not be repeated here.

本实施例中，语音识别模型包括：编码器、解码器以及关键词模块，其中，编码器和解码器组成训练后的识别子模型。In this embodiment, the speech recognition model includes: an encoder, a decoder and a keyword module, wherein the encoder and the decoder constitute a trained recognition sub-model.

步骤302，从测试集中选取测试样本。Step 302: Select test samples from the test set.

本实施例中，执行主体可以将从步骤301中的测试集选取测试样本，In this embodiment, the execution entity may select a test sample from the test set in step 301.

本实施例中，从测试集中选取测试样本的选取方式和选取数量在本申请中并不限制，并且语音识别模型的迭代训练的次数也并不限制。In this embodiment, the selection method and number of test samples from the test set are not limited in this application, and the number of iterative training of the speech recognition model is not limited.

步骤303，将测试样本中的音频特征序列输入编码器，得到音频中间特征。Step 303: Input the audio feature sequence in the test sample into the encoder to obtain the audio intermediate feature.

本实施例中，如图4所示，音频特征序列可以是提取的音频的音频梅尔特征，将音频梅尔特征输入该输入到训练后的识别子模型的编码器，得到编码器生成音频中间特征；在解码器第一次预测时，生成的起始token序列(图4中并未示出)作为当前词单元序列，随着测试时间的推移，经过多次预测，当前词单元序列被基于预测词单元而更新后的当前词单元序列代替。In this embodiment, as shown in Figure 4, the audio feature sequence can be the audio Mel features of the extracted audio, and the audio Mel features are input into the encoder of the trained recognition sub-model to obtain the audio intermediate features generated by the encoder; when the decoder predicts for the first time, the generated starting token sequence (not shown in Figure 4) is used as the current word unit sequence. As the test time goes by, after multiple predictions, the current word unit sequence is replaced by the current word unit sequence updated based on the predicted word unit.

步骤304，将音频中间特征、当前词单元序列输入解码器，得到语音识别模型输出的预测词单元。Step 304, input the audio intermediate features and the current word unit sequence into the decoder to obtain the predicted word unit output by the speech recognition model.

本实施例中，在解码器第一次预测时，音频中间特征和起始token序列一起送入解码器解码，得到解码器输出的预测token，需要说明的时，解码器输出的所有的预测词单元中可以有一个预测词单元是表征语种的预测词单元。In this embodiment, when the decoder makes the first prediction, the intermediate audio features and the starting token sequence are sent to the decoder for decoding to obtain the predicted token output by the decoder. It should be noted that among all the predicted word units output by the decoder, one predicted word unit can be a predicted word unit representing the language.

步骤305，响应于预测词单元为非结束符，基于预测词单元更新当前词单元序列，继续将音频中间特征、当前词单元序列输入解码器，直至语音识别模型输出的预测词单元为结束符为止，得到测试样本对应的所有预测词单元。Step 305, in response to the predicted word unit being a non-terminal symbol, the current word unit sequence is updated based on the predicted word unit, and the audio intermediate features and the current word unit sequence continue to be input into the decoder until the predicted word unit output by the speech recognition model is a terminator, thereby obtaining all predicted word units corresponding to the test sample.

如图4所示，EOT表征结束符，当预测词单元为非结束符，即Not EOT时，基于解码器的预测词单元更新当前词单元序列，具体地，更新当前词单元序列可以是将预测的token(预测词单元)添加到当前token序列的末尾。需要说明的是，在解码器第一次预测时，解码器输出的预测词单元添加到起始token序列之后，作为更新当前词单元序列的结果，该结果作为当前词单元序列，再进行解码器的下一次预测。如图4所示，当预测词单元为结束符，则结束对音频特征序列的预测。As shown in Figure 4, EOT represents the end symbol. When the predicted word unit is not an end symbol, that is, Not EOT, the current word unit sequence is updated based on the predicted word unit of the decoder. Specifically, updating the current word unit sequence can be adding the predicted token (predicted word unit) to the end of the current token sequence. It should be noted that when the decoder predicts for the first time, the predicted word unit output by the decoder is added after the starting token sequence as the result of updating the current word unit sequence. The result is used as the current word unit sequence, and the next prediction of the decoder is performed. As shown in Figure 4, when the predicted word unit is an end symbol, the prediction of the audio feature sequence is terminated.

本实施例中，当解码器进行多次预测之后，若解码器输出的预测词单元是EOT，此时解码器结束预测，将解码器从第一次输出到输出EOT之前输出的预测词单元组合在一起，得到所有预测词单元。In this embodiment, after the decoder performs multiple predictions, if the predicted word unit output by the decoder is EOT, the decoder ends the prediction and combines the predicted word units output by the decoder from the first output to before outputting EOT to obtain all predicted word units.

步骤306，基于测试样本对应的所有预测词单元和测试样本中的测试词单元序列，检测语音识别模型是否测试合格。Step 306: Based on all the predicted word units corresponding to the test sample and the test word unit sequence in the test sample, detect whether the speech recognition model has passed the test.

本实施例中，可以将所有预测词单元按照解码器的预测顺序排列在一起得到预测词单元序列，将预测词单元序列与测试样本中的测试词单元序列进行比较；响应于预测词单元序列与测试词单元序列相同，确定语音识别模型测试合格，可以直接将语音识别模型应用于语言的文本的预测；响应于预测词单元序列与测试词单元序列不相同，确定语音识别模型测试不合格，不能直接将语音识别模型应用于语言的文本的预测。In this embodiment, all predicted word units can be arranged together according to the prediction order of the decoder to obtain a predicted word unit sequence, and the predicted word unit sequence can be compared with the test word unit sequence in the test sample; in response to the predicted word unit sequence being the same as the test word unit sequence, it is determined that the speech recognition model has passed the test, and the speech recognition model can be directly applied to the prediction of the text of the language; in response to the predicted word unit sequence being different from the test word unit sequence, it is determined that the speech recognition model has failed the test, and the speech recognition model cannot be directly applied to the prediction of the text of the language.

本公开的实施例提供的语音识别模型测试方法，获取测试集和语音识别模型，从测试集中选取测试样本，将测试样本中的音频特征序列输入编码器，在语音识别模型未输出结束符时，一直采用更新单元序列替换。由此，语音识别模型会一直对测试样本的音频中间特征进行预测，得到音频中间特征的预测词单元序列，通过对预测词单元序列检测语音识别模型是否测试合格，提高了语音识别模型测试的可靠性。The speech recognition model testing method provided by the embodiment of the present disclosure obtains a test set and a speech recognition model, selects a test sample from the test set, inputs the audio feature sequence in the test sample into the encoder, and always uses the update unit sequence to replace it when the speech recognition model does not output the end symbol. As a result, the speech recognition model will always predict the audio intermediate features of the test sample to obtain the predicted word unit sequence of the audio intermediate features, and by detecting whether the speech recognition model is qualified by the predicted word unit sequence, the reliability of the speech recognition model test is improved.

在本公开的一些可选实现方式中，基于测试样本对应的所有预测词单元和测试样本中的测试词单元序列，检测语音识别模型是否测试合格包括：对测试样本对应的所有预测词单元进行排序，得到预测词单元序列；基于预测词单元序列与测试样本中的测试词单元序列，计算测试样本的字错率；响应于字错率小于错误阈值，确定语音识别模型合格。In some optional implementations of the present disclosure, detecting whether a speech recognition model has passed the test based on all predicted word units corresponding to the test sample and the test word unit sequence in the test sample includes: sorting all predicted word units corresponding to the test sample to obtain a predicted word unit sequence; calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample; and determining that the speech recognition model is qualified in response to the word error rate being less than an error threshold.

本可选实现方式中，将预测词单元序列与测试词单元序列进行比较，确定预测词单元序列中与测试词单元序列不同的词单元个数，计算该词单元个数与测试词单元序列中测试词单元总数的比值，得到字错率。In this optional implementation, the predicted word unit sequence is compared with the test word unit sequence, the number of word units in the predicted word unit sequence that are different from the test word unit sequence is determined, and the ratio of the number of word units to the total number of test word units in the test word unit sequence is calculated to obtain the word error rate.

本可选实现方式中，错误阈值可以基于测试需求确定，例如，错误阈值为20％，也即在预测词单元序列中有小于20％的预测词单元与测试词单元序列中测试词单元不同，确定语音识别模型合格。In this optional implementation, the error threshold can be determined based on the test requirements. For example, the error threshold is 20%, that is, less than 20% of the predicted word units in the predicted word unit sequence are different from the test word units in the test word unit sequence, and the speech recognition model is determined to be qualified.

本可选实现方式提供的检测语音识别模型是否测试合格的方法，通过测试样本对应的预测词单元序列和测试词单元序列，计算测试样本的字错率，为语音识别模型的测试提供了一种可靠的实现手段，提高了语音识别模型测试的可靠性。This optional implementation provides a method for detecting whether a speech recognition model has passed a test. By calculating the word error rate of a test sample through the predicted word unit sequence and the test word unit sequence corresponding to the test sample, it provides a reliable implementation method for testing the speech recognition model and improves the reliability of the speech recognition model test.

进一步地，基于上述实施例提供的语音识别模型训练方法，本公开还提供了一种语音识别方法的一个实施例，本公开的语音识别方法结合了语音识别、深度学习等人工智能领域。Furthermore, based on the speech recognition model training method provided in the above embodiment, the present disclosure also provides an embodiment of a speech recognition method. The speech recognition method of the present disclosure combines artificial intelligence fields such as speech recognition and deep learning.

参见图5，示出了根据本公开语音识别方法的一个实施例的流程500，本实施例提供的语音识别方法包括以下步骤：5 , a process 500 according to an embodiment of the speech recognition method of the present disclosure is shown. The speech recognition method provided in this embodiment includes the following steps:

步骤501，获取待识别语音。Step 501: Acquire speech to be recognized.

在本实施例中，语音识别方法的执行主体可以通过多种方式来获取待识别语音。例如，执行主体可以通过有线连接方式或无线连接方式，从数据库服务器中获取存储于其中的待识别语音。再例如，执行主体也可以实时接收终端或其他设备实时采集的待识别语音。In this embodiment, the execution subject of the speech recognition method can obtain the speech to be recognized in a variety of ways. For example, the execution subject can obtain the speech to be recognized stored in the database server through a wired connection or a wireless connection. For another example, the execution subject can also receive the speech to be recognized collected by a terminal or other device in real time.

本实施例中，待识别语音是需要进行语音到文本转换的语音信息，待识别语音可以稀缺语种的语音，例如待识别语音为一段样本少的语言的语音。In this embodiment, the speech to be recognized is speech information that needs to be converted from speech to text. The speech to be recognized may be speech in a scarce language, for example, the speech to be recognized is speech in a language with few samples.

步骤502，对待识别语音进行处理，得到音频特征数据。Step 502: Process the speech to be recognized to obtain audio feature data.

本实施例中，执行主体可以将从步骤501中获取的待识别语进行处理，得到音频特征数据。In this embodiment, the execution entity may process the to-be-recognized speech obtained in step 501 to obtain audio feature data.

本实施例中，音频特征数据是对待识别语音进行特征(如，梅尔特征，或Fbank特征)识别之后得到的数据。具体地，上述步骤502包括:固定待识别语音的长度(例如30秒)，提取该长度下待识别语音的梅尔或fbank特征，得到固定长度的音频特征数据。In this embodiment, the audio feature data is data obtained after the feature (e.g., Mel feature or Fbank feature) of the speech to be recognized is performed. Specifically, the above step 502 includes: fixing the length of the speech to be recognized (e.g., 30 seconds), extracting the Mel or fbank features of the speech to be recognized under the length, and obtaining the audio feature data of fixed length.

本实施例中，音频特征数据是与音频特征序列的数据结构相同的数据。In this embodiment, the audio feature data has the same data structure as the audio feature sequence.

步骤503，将音频特征数据输入语音识别模型，得到待识别语音的预测词单元序列。Step 503: input the audio feature data into the speech recognition model to obtain a predicted word unit sequence of the speech to be recognized.

本实施例中，语音识别模型可以是采用如上述图1实施例所描述的方法而训练得到的训练后的语音识别模型，具体训练过程可以参见图1实施例的相关描述，在此不再赘述。In this embodiment, the speech recognition model can be a trained speech recognition model obtained by training using the method described in the embodiment of Figure 1 above. The specific training process can refer to the relevant description of the embodiment of Figure 1, which will not be repeated here.

步骤504，基于预测词单元序列，得到待识别语音的文本数据。Step 504: obtaining text data of the speech to be recognized based on the predicted word unit sequence.

本实施例中，将此单序列还原为文本数据属于成熟的技术手段，而将预测词单元序列转化为待识别语音的文本数据也属于成熟的手段，此处不再赘述。In this embodiment, restoring the single sequence to text data is a mature technical means, and converting the predicted word unit sequence into text data of the speech to be recognized is also a mature means, which will not be described in detail here.

本公开提供的语音识别方法，可以应用在样本少的语言的社交、学习、工作等场景。首先，它可以应用于该种类语言为主要语言的地区，帮助人们更便捷地与智能设备进行交互。其次，该模型可以在教育领域中使用，辅助学生学习该种类语言的发音和听力。此外，在商业领域中，它可以用于客户服务和市场调研，以满足该种类语言的使用者的需求。The speech recognition method provided by the present disclosure can be applied to social, learning, work and other scenarios of languages with few samples. First, it can be applied to areas where this type of language is the main language to help people interact with smart devices more conveniently. Second, the model can be used in the field of education to assist students in learning the pronunciation and listening of this type of language. In addition, in the business field, it can be used for customer service and market research to meet the needs of users of this type of language.

本公开的实施例提供的语音识别方法，获取待识别语音，对待识别语音进行处理，得到音频特征数据，将音频特征数据输入语音识别模型训练方法生成的语音识别模型中，得到音频特征数据的预测词单元序列，通过预测词单元序列，得到待识别语音的文本数据。由此，采用语音识别模型生成语音识别结果，提高了语音识别的可靠性和准确性。The speech recognition method provided by the embodiment of the present disclosure obtains the speech to be recognized, processes the speech to be recognized, obtains audio feature data, inputs the audio feature data into the speech recognition model generated by the speech recognition model training method, obtains the predicted word unit sequence of the audio feature data, and obtains the text data of the speech to be recognized by predicting the word unit sequence. Thus, the speech recognition model is used to generate the speech recognition result, which improves the reliability and accuracy of speech recognition.

进一步参考图6，作为对上述各图所示方法的实现，本公开提供了语音识别模型训练装置的一个实施例，该装置实施例与图1所示的方法实施例相对应，该装置具体可应用于各种电子设备中。Further referring to FIG. 6 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a speech recognition model training device. The device embodiment corresponds to the method embodiment shown in FIG. 1 , and the device can be specifically applied to various electronic devices.

如图6所示，本实施例提供的语音识别模型训练装置600包括：样本获取单元601，模型获取单元602，替换单元603，训练单元604。其中，上述样本获取单元601，可以被配置成获取语音样本集，语音样本集包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列。上述模型获取单元602，可以被配置成获取初始的语音识别模型，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系。上述替换单元603，可以被配置成采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元。上述训练单元604，可以被配置成基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。As shown in FIG6 , the speech recognition model training device 600 provided in this embodiment includes: a sample acquisition unit 601, a model acquisition unit 602, a replacement unit 603, and a training unit 604. The sample acquisition unit 601 can be configured to acquire a speech sample set, the speech sample set includes at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence. The model acquisition unit 602 can be configured to acquire an initial speech recognition model, and the speech recognition model is used to characterize the correspondence between the audio feature sequence and the prediction word unit sequence. The replacement unit 603 can be configured to replace the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit characterizing the language, and obtain the training sample set, and the prediction word unit is the prediction word unit in the prediction word unit sequence obtained by inputting the speech sample selected from the speech sample set into the speech recognition model. The training unit 604 can be configured to train the speech recognition model based on the training sample set to obtain the trained speech recognition model.

在本实施例中，语音识别模型训练装置600中：样本获取单元601，模型获取单元602，替换单元603，训练单元604的具体处理及其所带来的技术效果可分别参考图1对应实施例中的步骤101、步骤102、步骤103、步骤104的相关说明，在此不再赘述。In this embodiment, the specific processing of the sample acquisition unit 601, the model acquisition unit 602, the replacement unit 603, and the training unit 604 in the speech recognition model training device 600 and the technical effects brought about by them can be respectively referred to the relevant descriptions of step 101, step 102, step 103, and step 104 in the corresponding embodiment of Figure 1, and will not be repeated here.

在本实施例的一些可选的实现方式中，上述语音识别模型包括：初始的识别子模型、与初始的识别子模型连接的关键词模块；上述训练单元604被配置成：基于训练样本集，训练初始的识别子模型，得到训练后的识别子模型；基于训练样本集、训练后的识别子模型，训练关键词模块，得到训练后的关键词模块；将训练后的识别子模型和训练后的关键词模块作为训练后的语音识别模型。In some optional implementations of the present embodiment, the above-mentioned speech recognition model includes: an initial recognition sub-model, and a keyword module connected to the initial recognition sub-model; the above-mentioned training unit 604 is configured to: train the initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model; train the keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module; and use the trained recognition sub-model and the trained keyword module as the trained speech recognition model.

在本实施例的一些可选的实现方式中，上述训练单元604进一步被配置成：固定训练后的识别子模型的参数；将训练样本集中的训练样本输入训练后的识别子模型，得到训练后的识别子模型输出的第一概率分布和关键词模块输出的第二概率分布；基于第一概率分布和第二概率分布，计算训练样本中初始词单元的损失值；基于损失值，得到训练后的关键词模块。In some optional implementations of the present embodiment, the training unit 604 is further configured to: fix the parameters of the trained recognition sub-model; input the training samples in the training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module; calculate the loss value of the initial word unit in the training sample based on the first probability distribution and the second probability distribution; and obtain the trained keyword module based on the loss value.

在本实施例的一些可选的实现方式中，上述训练单元604进一步被配置成：从训练样本集选取训练样本，得到选取样本；将选取样本中的新的词单元序列中的新的词单元进行右移一位，得到训练词单元序列；基于选取样本中的音频特征序列和训练词单元序列，训练初始的识别子模型，得到训练后的识别子模型。In some optional implementations of the present embodiment, the training unit 604 is further configured to: select a training sample from a training sample set to obtain a selected sample; shift the new word unit in the new word unit sequence in the selected sample right by one position to obtain a training word unit sequence; train an initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected sample to obtain a trained recognition sub-model.

在本实施例的一些可选实现方式中，上述样本获取单元601被配置成：获取包括至少一个初始语音的初始数据集；对初始数据集进行预处理，得到处理数据集；对处理数据集进行数据增强，得到语音数据集；基于语音数据集，得到语音样本集。In some optional implementations of this embodiment, the sample acquisition unit 601 is configured to: acquire an initial data set including at least one initial speech; preprocess the initial data set to obtain a processed data set; perform data enhancement on the processed data set to obtain a speech data set; and obtain a speech sample set based on the speech data set.

在本实施例的一些可选实现方式中，上述样本获取单元601进一步被配置成实现以下至少一项：对初始数据集中的所有初始数据进行同一采样率的采样；删除初始数据集中的重复和无法识别的初始数据。In some optional implementations of this embodiment, the sample acquisition unit 601 is further configured to implement at least one of the following: sampling all initial data in the initial data set at the same sampling rate; deleting repeated and unrecognizable initial data in the initial data set.

在本实施例的一些可选实现方式中，上述样本获取单元601进一步被配置成实现以下至少一项：修改处理数据集中各个处理数据的语速，得到增加数据集，在处理数据集中增加增加数据集；为处理数据集中各个处理数据增加回响，得到回响数据集，在处理数据集中增加回响数据集；为处理数据集中各个处理数据增加噪声，得到噪声数据集，在处理数据集中增加噪声数据集；修改处理数据集中各个处理数据的音频频谱图，得到频谱数据集，在处理数据集中增加频谱数据集。In some optional implementations of this embodiment, the sample acquisition unit 601 is further configured to implement at least one of the following: modifying the speech rate of each processed data in the processed data set to obtain an added data set, and adding the added data set to the processed data set; adding reverberation to each processed data in the processed data set to obtain an reverberation data set, and adding an reverberation data set to the processed data set; adding noise to each processed data in the processed data set to obtain a noise data set, and adding a noise data set to the processed data set; modifying the audio spectrum graph of each processed data in the processed data set to obtain a spectrum data set, and adding a spectrum data set to the processed data set.

本公开的实施例提供的语音识别模型训练装置，首先，样本获取单元601获取语音样本集，语音样本集包括至少一个语音样本，语音样本包括：音频特征序列和初始词单元序列；其次，模型获取单元602获取初始的语音识别模型，语音识别模型用于表征音频特征序列与预测词单元序列之间的对应关系；再次，替换单元603采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，得到训练样本集，预测词单元为将从语音样本集中选取的语音样本输入语音识别模型，得到的预测词单元序列中的预测词单元；最后，训练单元604基于训练样本集，训练语音识别模型，得到训练后的语音识别模型。由此，在训练语音识别模型之前，采用表征语种的预测词单元替换语音样本集中初始词单元序列中的语种词单元，可以使模型在其熟悉的词单元环境中对音频特征序列进行预测，提高了模型收敛速度，提高了模型训练效果。The speech recognition model training device provided by the embodiment of the present disclosure, first, the sample acquisition unit 601 acquires a speech sample set, the speech sample set includes at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence; secondly, the model acquisition unit 602 acquires an initial speech recognition model, and the speech recognition model is used to characterize the correspondence between the audio feature sequence and the prediction word unit sequence; thirdly, the replacement unit 603 replaces the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language, and obtains a training sample set, and the prediction word unit is a prediction word unit in the prediction word unit sequence obtained by inputting the speech sample selected from the speech sample set into the speech recognition model; finally, the training unit 604 trains the speech recognition model based on the training sample set to obtain a trained speech recognition model. Therefore, before training the speech recognition model, replacing the language word unit in the initial word unit sequence in the speech sample set with the prediction word unit representing the language can enable the model to predict the audio feature sequence in its familiar word unit environment, thereby improving the model convergence speed and the model training effect.

进一步参考图7，作为对上述各图所示方法的实现，本公开提供了语音识别模型测试装置的一个实施例，该装置实施例与图3所示的方法实施例相对应，该装置具体可应用于各种电子设备中。Further referring to FIG. 7 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a speech recognition model testing device, which corresponds to the method embodiment shown in FIG. 3 , and can be specifically applied to various electronic devices.

如图7所示，本实施例提供的语音识别模型测试装置700包括：信息获取单元701，选取单元702，输入单元703，得到单元704，更新单元705，测试单元706。其中，上述信息获取单元701，可以被配置成获取测试集和训练后的语音识别模型，上述测试集包括至少一个测试样本，测试样本包括：音频特征序列和测试词单元序列；上述训练后的语音识别模型采用上述语音识别模型训练装置训练得到，语音识别模型包括：编码器、解码器以及关键词模块。上述选取单元702，可以被配置成从测试集中选取测试样本。上述输入单元703，可以被配置成将测试样本中的音频特征序列输入编码器，得到音频中间特征。上述得到单元704，可以被配置成将音频中间特征、当前词单元序列输入解码器，得到语音识别模型输出的预测词单元。上述更新单元705，可以被配置成响应于预测词单元为非结束符，基于预测词单元更新当前词单元序列，继续将音频中间特征、当前词单元序列输入解码器，直至语音识别模型输出的预测词单元为结束符为止，得到测试样本对应的所有预测词单元。上述测试单元706，可以被配置成基于测试样本对应的所有预测词单元和测试样本中的测试词单元序列，检测语音识别模型是否测试合格。As shown in FIG7 , the speech recognition model testing device 700 provided in this embodiment includes: an information acquisition unit 701, a selection unit 702, an input unit 703, a obtaining unit 704, an updating unit 705, and a testing unit 706. The information acquisition unit 701 can be configured to obtain a test set and a trained speech recognition model, wherein the test set includes at least one test sample, and the test sample includes: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training using the speech recognition model training device, and the speech recognition model includes: an encoder, a decoder, and a keyword module. The selection unit 702 can be configured to select a test sample from the test set. The input unit 703 can be configured to input the audio feature sequence in the test sample into the encoder to obtain an audio intermediate feature. The obtaining unit 704 can be configured to input the audio intermediate feature and the current word unit sequence into the decoder to obtain a predicted word unit output by the speech recognition model. The updating unit 705 can be configured to update the current word unit sequence based on the predicted word unit in response to the predicted word unit being a non-terminal symbol, and continue to input the audio intermediate features and the current word unit sequence into the decoder until the predicted word unit output by the speech recognition model is a terminal symbol, thereby obtaining all the predicted word units corresponding to the test sample. The testing unit 706 can be configured to detect whether the speech recognition model has passed the test based on all the predicted word units corresponding to the test sample and the test word unit sequence in the test sample.

在本实施例中，语音识别装置700中：信息获取单元701，选取单元702，输入单元703，得到单元704，更新单元705，测试单元706的具体处理及其所带来的技术效果可分别参考图3对应实施例中的步骤301、步骤302、步骤304、步骤305、步骤306的相关说明，在此不再赘述。In this embodiment, in the speech recognition device 700, the specific processing of the information acquisition unit 701, the selection unit 702, the input unit 703, the obtaining unit 704, the updating unit 705, and the testing unit 706 and the technical effects brought about by them can be respectively referred to the relevant descriptions of step 301, step 302, step 304, step 305, and step 306 in the corresponding embodiment of Figure 3, and will not be repeated here.

在本实施例的一些可选实现方式中，上述测试单元706进一步被配置成：对测试样本对应的所有预测词单元进行排序，得到预测词单元序列；基于预测词单元序列与测试样本中的测试词单元序列，计算测试样本的字错率；响应于字错率小于错误阈值，确定语音识别模型合格。In some optional implementations of the present embodiment, the above-mentioned test unit 706 is further configured to: sort all predicted word units corresponding to the test sample to obtain a predicted word unit sequence; calculate the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample; and determine that the speech recognition model is qualified in response to the word error rate being less than an error threshold.

进一步参考图8，作为对上述各图所示方法的实现，本公开提供了语音识别装置的一个实施例，该装置实施例与图5所示的方法实施例相对应，该装置具体可应用于各种电子设备中。Further referring to FIG. 8 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a speech recognition device, which corresponds to the method embodiment shown in FIG. 5 , and can be specifically applied to various electronic devices.

如图8所示，本实施例提供的语音识别装置800包括：语音获取单元801，处理单元802，识别单元803，转化单元804。其中，上述语音获取单元801，可以被配置成获取待识别语音。上述处理单元802，可以被配置成对待识别语音进行处理，得到音频特征数据。上述识别单元803，可以被配置成将音频特征数据输入语音识别模型，得到待识别语音的预测词单元序列，语音识别模型为上述实施例的语音识别模型训练装置得到的训练后的语音识别模型。上述转化单元804，可以被配置成基于预测词单元序列，得到待识别语音的文本数据。As shown in FIG8 , the speech recognition device 800 provided in this embodiment includes: a speech acquisition unit 801, a processing unit 802, a recognition unit 803, and a conversion unit 804. Among them, the above-mentioned speech acquisition unit 801 can be configured to acquire the speech to be recognized. The above-mentioned processing unit 802 can be configured to process the speech to be recognized and obtain audio feature data. The above-mentioned recognition unit 803 can be configured to input the audio feature data into the speech recognition model to obtain the predicted word unit sequence of the speech to be recognized, and the speech recognition model is the trained speech recognition model obtained by the speech recognition model training device of the above-mentioned embodiment. The above-mentioned conversion unit 804 can be configured to obtain the text data of the speech to be recognized based on the predicted word unit sequence.

在本实施例中，语音识别装置800中：语音获取单元801，处理单元802，识别单元803，转化单元804的具体处理及其所带来的技术效果可分别参考图5对应实施例中的步骤501、步骤502、步骤503、步骤504的相关说明，在此不再赘述。In this embodiment, the specific processing of the speech acquisition unit 801, the processing unit 802, the recognition unit 803, and the conversion unit 804 in the speech recognition device 800 and the technical effects brought about by them can be respectively referred to the relevant descriptions of step 501, step 502, step 503, and step 504 in the corresponding embodiment of Figure 5, and will not be repeated here.

本公开的技术方案中，所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solution of the present disclosure, the collection, storage, use, processing, transmission, provision and disclosure of user personal information involved are in compliance with the provisions of relevant laws and regulations and do not violate public order and good morals.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to an embodiment of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图9示出了可以用来实施本公开的实施例的示例电子设备900的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG9 shows a schematic block diagram of an example electronic device 900 that can be used to implement an embodiment of the present disclosure. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present disclosure described and/or required herein.

如图9所示，设备900包括计算单元901，其可以根据存储在只读存储器(ROM)902中的计算机程序或者从存储单元908加载到随机访问存储器(RAM)903中的计算机程序，来执行各种适当的动作和处理。在RAM 903中，还可存储设备900操作所需的各种程序和数据。计算单元901、ROM 902以及RAM903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。As shown in Figure 9, the device 900 includes a computing unit 901, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 902 or a computer program loaded from a storage unit 908 into a random access memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

设备900中的多个部件连接至I/O接口905，包括：输入单元906，例如键盘、鼠标等；输出单元907，例如各种类型的显示器、扬声器等；存储单元908，例如磁盘、光盘等；以及通信单元909，例如网卡、调制解调器、无线通信收发机等。通信单元909允许设备900通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a disk, an optical disk, etc.; and a communication unit 909, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元901可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元901的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元901执行上文所描述的各个方法和处理，例如语音识别模型训练方法或语音识别模型测试方法或语音识别方法。例如，在一些实施例中，语音识别模型训练方法或语音识别模型测试方法或语音识别方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元908。在一些实施例中，计算机程序的部分或者全部可以经由ROM 902和/或通信单元909而被载入和/或安装到设备900上。当计算机程序加载到RAM 903并由计算单元901执行时，可以执行上文描述的语音识别模型训练方法或语音识别模型测试方法或语音识别方法的一个或多个步骤。备选地，在其他实施例中，计算单元901可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行语音识别模型训练方法或语音识别模型测试方法或语音识别方法。The computing unit 901 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc. The computing unit 901 performs the various methods and processes described above, such as a speech recognition model training method or a speech recognition model testing method or a speech recognition method. For example, in some embodiments, the speech recognition model training method or the speech recognition model testing method or the speech recognition method may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as a storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed on the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the speech recognition model training method or the speech recognition model testing method or the speech recognition method described above may be executed. Alternatively, in other embodiments, the computing unit 901 may be configured to execute a speech recognition model training method or a speech recognition model testing method or a speech recognition method in any other suitable manner (for example, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程语音识别模型训练装置或语音识别模型测试装置或语音识别装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。The program code for implementing the method of the present disclosure can be written in any combination of one or more programming languages. These program codes can be provided to a processor or controller of a general-purpose computer, a special-purpose computer or other programmable speech recognition model training device or a speech recognition model testing device or a speech recognition device, so that the program code, when executed by the processor or controller, enables the functions/operations specified in the flow chart and/or block diagram to be implemented. The program code can be executed entirely on the machine, partially on the machine, partially on the machine as a stand-alone software package and partially on a remote machine, or entirely on a remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include: a local area network (LAN), a wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system may include a client and a server. The client and the server are generally remote from each other and usually interact through a communication network. The relationship of client and server is generated by computer programs running on respective computers and having a client-server relationship with each other. The server may be a cloud server, a server of a distributed system, or a server combined with a blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this disclosure can be executed in parallel, sequentially or in different orders, as long as the desired results of the technical solutions disclosed in this disclosure can be achieved, and this document does not limit this.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above specific implementations do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent substitution and improvement made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.

Claims

1. A method of training a speech recognition model, the method comprising:

Obtaining a set of speech samples, the set of speech samples comprising at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence;

acquiring an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between an audio characteristic sequence and a predicted word unit sequence;

Replacing language word units in the initial word unit sequence in the voice sample set by using the predicted word units of the characterization language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into the voice recognition model;

And training the voice recognition model based on the training sample set to obtain a trained voice recognition model.

2. The method of claim 1, wherein the speech recognition model comprises: the system comprises an initial recognition sub-model and a keyword module connected with the initial recognition sub-model; training the voice recognition model based on the training sample set, wherein the obtaining the trained voice recognition model comprises the following steps:

training the initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model;

Training the keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module;

and taking the trained recognition sub-model and the trained keyword module as a trained voice recognition model.

3. The method of claim 2, wherein the training the keyword module based on the training sample set, the trained recognition sub-model, the obtaining the trained keyword module comprises:

Fixing parameters of the trained recognition sub-model;

Inputting training samples in a training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module;

calculating a loss value of an initial word unit in a training sample based on the first probability distribution and the second probability distribution;

And obtaining a trained keyword module based on the loss value.

4. The method of claim 2, wherein the training the initial recognition sub-model based on the training sample set, resulting in a trained recognition sub-model comprises:

selecting a training sample from the training sample set to obtain a selected sample;

Right shifting a new word unit in the new word unit sequence in the selected sample by one bit to obtain a training word unit sequence;

and training the initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected sample to obtain a trained recognition sub-model.

5. The method of claim 1, wherein the acquiring a set of speech samples comprises:

acquiring an initial data set comprising at least one initial voice;

preprocessing the initial data set to obtain a processed data set;

Performing data enhancement on the processed data set to obtain a voice data set;

and obtaining a voice sample set based on the voice data set.

6. The method of claim 5, wherein the preprocessing the initial dataset to obtain a processed dataset comprises at least one of:

sampling all initial data in the initial data set at the same sampling rate;

Duplicate and unrecognizable initial data in the initial data set is deleted.

7. The method of claim 5, wherein the data enhancing the processed data set to obtain a speech data set comprises at least one of:

Modifying the speech speed of each processing data in the processing data set to obtain an increasing data set, and increasing the increasing data set in the processing data set;

Adding reverberation to each processing data in the processing data set to obtain a reverberation data set, and adding the reverberation data set in the processing data set;

adding noise to each processing data in the processing data set to obtain a noise data set, and adding the noise data set in the processing data set;

And modifying an audio frequency spectrogram of each processing data in the processing data set to obtain a frequency spectrum data set, and adding the frequency spectrum data set in the processing data set.

8. A speech recognition model testing method, the method comprising:

Obtaining a test set and a trained speech recognition model, the test set comprising at least one test sample, the test sample comprising: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training the speech recognition model training method according to any one of claims 1 to 7, and the speech recognition model comprises: an encoder, a decoder, and a keyword module;

Selecting a test sample from the test set;

inputting the audio feature sequence in the test sample into the encoder to obtain audio intermediate features;

Inputting the audio intermediate characteristics and the current word unit sequence into the decoder to obtain a predicted word unit output by the voice recognition model;

Responding to the predicted word unit as a non-ending symbol, updating a current word unit sequence based on the predicted word unit, and continuously inputting the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the voice recognition model is the ending symbol, so as to obtain all the predicted word units corresponding to the test sample;

And detecting whether the voice recognition model is qualified or not based on all the predicted word units corresponding to the test sample and the test word unit sequences in the test sample.

9. The method of claim 8, wherein the detecting whether the speech recognition model is qualified based on all the predicted word units corresponding to the test sample and the sequence of the test word units in the test sample comprises:

sequencing all the predicted word units corresponding to the test sample to obtain a predicted word unit sequence;

calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample;

And determining that the speech recognition model is qualified in response to the word error rate being less than an error threshold.

10. A method of speech recognition, the method comprising:

Acquiring voice to be recognized;

processing the voice to be recognized to obtain audio characteristic data;

Inputting the audio feature data into a voice recognition model to obtain a predicted word unit sequence of the voice to be recognized, wherein the voice recognition model is a trained voice recognition model obtained by adopting the voice recognition model training method according to any one of claims 1-7;

And obtaining text data of the voice to be recognized based on the predicted word unit sequence.

11. A speech recognition model training apparatus, the apparatus comprising:

A sample acquisition unit configured to acquire a set of speech samples, the set of speech samples including at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence;

the model acquisition unit is configured to acquire an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence;

The replacing unit is configured to replace language word units in the initial word unit sequence in the voice sample set by using the predicted word units of the characterization language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into the voice recognition model;

and the training unit is configured to train the voice recognition model based on the training sample set to obtain a trained voice recognition model.

12. The apparatus of claim 11, wherein the speech recognition model comprises: the system comprises an initial recognition sub-model and a keyword module connected with the initial recognition sub-model; the training unit is configured to: training the initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model; training the keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module; and taking the trained recognition sub-model and the trained keyword module as a trained voice recognition model.

13. The apparatus of claim 12, wherein the training unit is further configured to: fixing parameters of the trained recognition sub-model; inputting training samples in a training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module; calculating a loss value of an initial word unit in a training sample based on the first probability distribution and the second probability distribution; and obtaining a trained keyword module based on the loss value.

14. The apparatus of claim 12, wherein the training unit is further configured to: selecting a training sample from the training sample set to obtain a selected sample; right shifting a new word unit in the new word unit sequence in the selected sample by one bit to obtain a training word unit sequence; and training the initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected sample to obtain a trained recognition sub-model.

15. The apparatus of claim 11, wherein the sample acquisition unit is configured to: acquiring an initial data set comprising at least one initial voice; preprocessing the initial data set to obtain a processed data set; performing data enhancement on the processed data set to obtain a voice data set; and obtaining a voice sample set based on the voice data set.

16. The apparatus of claim 15, wherein the sample acquisition unit is further configured to at least one of:

sampling all initial data in the initial data set at the same sampling rate;

Duplicate and unrecognizable initial data in the initial data set is deleted.

17. The apparatus of claim 15, wherein the sample acquisition unit is further configured to at least one of:

18. A speech recognition model testing apparatus, the apparatus comprising:

An information acquisition unit configured to acquire a test set and a trained speech recognition model, the test set including at least one test sample, the test sample including: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training the speech recognition model training device according to any one of claims 11 to 17, and the speech recognition model comprises: an encoder, a decoder, and a keyword module;

a selecting unit configured to select a test sample from the test set;

an input unit configured to input the audio feature sequence in the test sample to the encoder to obtain audio intermediate features;

The obtaining unit is configured to input the audio intermediate characteristics and the current word unit sequence into the decoder to obtain a predicted word unit output by the voice recognition model;

the updating unit is configured to respond to the predicted word unit as a non-ending symbol, update the current word unit sequence based on the predicted word unit, and continuously input the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the voice recognition model is the ending symbol, so as to obtain all the predicted word units corresponding to the test sample;

And the test unit is configured to detect whether the voice recognition model is qualified or not based on all the prediction word units corresponding to the test sample and the test word unit sequences in the test sample.

19. The apparatus of claim 18, wherein the test unit is further configured to: sequencing all the predicted word units corresponding to the test sample to obtain a predicted word unit sequence; calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample; and determining that the speech recognition model is qualified in response to the word error rate being less than an error threshold.

20. A speech recognition device, the device comprising:

A voice acquisition unit configured to acquire a voice to be recognized;

the processing unit is configured to process the voice to be recognized to obtain audio characteristic data;

A recognition unit configured to input the audio feature data into a speech recognition model to obtain the predicted word unit sequence of the speech to be recognized, wherein the speech recognition model is a trained speech recognition model obtained by using the speech recognition model training device according to any one of claims 11 to 17;

and the conversion unit is configured to obtain text data of the voice to be recognized based on the predicted word unit sequence.

21. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-10.