CN114360510A

CN114360510A - Voice recognition method and related device

Info

Publication number: CN114360510A
Application number: CN202210042387.1A
Authority: CN
Inventors: 袁有根; 吕志强; 黄申
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-04-15
Anticipated expiration: 2042-01-14
Also published as: CN114360510B

Abstract

The embodiment of the application discloses a voice recognition method and a related device, which at least relate to a voice recognition technology in artificial intelligence, and the voice data to be recognized is used as input data of a time delay neural network in an acoustic model. When the syllable is identified through the output layer, the syllable information before and after the voice frame can be combined, and the syllable of the voice frame is subjected to auxiliary judgment based on the pronunciation rule so as to output more accurate syllable probability distribution. Moreover, because syllables are generally composed of one or more phonemes, the method has higher fault-tolerant capability, not only obtains more accurate voice recognition results based on syllable probability distribution, but also has low requirements on the quality of voice data to be recognized, and effectively expands the application scenes of the voice recognition technology.

Description

A kind of speech recognition method and related device

技术领域technical field

本申请涉及语音识别领域，特别是涉及一种语音识别方法和相关装置。The present application relates to the field of speech recognition, and in particular, to a speech recognition method and related apparatus.

背景技术Background technique

通过语音识别技术可以为用户提供语音内容的识别服务，该技术可以应用于各种场景中，例如语音转文字、语音唤醒、人机交互等场景。在具体实现中，可以通过声学模型提取待识别的语音数据的声学特征，并基于声学特征确定对应的语音识别结果。Voice recognition technology can provide users with voice content recognition services. This technology can be applied to various scenarios, such as voice-to-text, voice wake-up, and human-computer interaction. In a specific implementation, the acoustic features of the speech data to be recognized may be extracted through an acoustic model, and a corresponding speech recognition result may be determined based on the acoustic features.

相关技术主要以音素(phone)作为声学模型的建模单元，音素是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。The related art mainly uses phone as the modeling unit of the acoustic model. The phone is the smallest phonetic unit divided according to the natural attributes of the voice. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme.

但是，音素的建模粒度较细，这种细粒度的语音识别方式对待识别语音数据的质量要求很高，细微的发音误差都可能会直接影响识别结果。由此导致语音识别技术难以适应一些语音识别场景。However, the modeling granularity of phonemes is relatively fine, and this fine-grained speech recognition method has high requirements on the quality of the recognized speech data, and slight pronunciation errors may directly affect the recognition results. As a result, it is difficult for speech recognition technology to adapt to some speech recognition scenarios.

发明内容SUMMARY OF THE INVENTION

为了解决上述技术问题，本申请提供了一种语音识别方法和相关装置，用于提高语音识别结果的准确率，扩展语音识别技术的使用场景。In order to solve the above technical problems, the present application provides a speech recognition method and a related device, which are used to improve the accuracy of speech recognition results and expand the usage scenarios of speech recognition technology.

一方面，本申请实施例提供一种语音识别方法，所述方法包括：On the one hand, the embodiment of the present application provides a speech recognition method, the method includes:

获取声学模型和待识别的语音数据，所述声学模型包括时延神经网络，所述时延神经网络的输出层包括与多个音节分别对应的声学建模单元；Acquiring an acoustic model and the speech data to be recognized, the acoustic model includes a time-delay neural network, and the output layer of the time-delay neural network includes acoustic modeling units corresponding to a plurality of syllables respectively;

将所述语音数据作为所述时延神经网络的输入数据，通过所述时延神经网络确定所述语音数据所包括语音帧分别对应的音节概率分布，所述音节概率分布用于标识所述语音帧与所述多个音节分别对应的概率；The speech data is used as the input data of the time-delay neural network, and the syllable probability distribution corresponding to the speech frames included in the speech data is determined through the time-delay neural network, and the syllable probability distribution is used to identify the speech the probability that the frame corresponds to the plurality of syllables respectively;

根据所述音节概率分布确定所述语音数据对应的语音识别结果。The speech recognition result corresponding to the speech data is determined according to the syllable probability distribution.

另一方面，本申请实施例提供一种语音识别装置，所述装置包括获取单元、音节概率分布确定单元和语音识别结果确定单元；On the other hand, an embodiment of the present application provides a speech recognition device, the device includes an acquisition unit, a syllable probability distribution determination unit, and a speech recognition result determination unit;

所述获取单元，用于获取声学模型和待识别的语音数据，所述声学模型包括时延神经网络，所述时延神经网络的输出层包括与多个音节分别对应的声学建模单元；The acquisition unit is used to acquire an acoustic model and the speech data to be recognized, the acoustic model includes a time-delay neural network, and the output layer of the time-delay neural network includes an acoustic modeling unit corresponding to a plurality of syllables respectively;

所述音节概率分布确定单元，用于将所述语音数据作为所述时延神经网络的输入数据，通过所述时延神经网络确定所述语音数据所包括语音帧分别对应的音节概率分布，所述音节概率分布用于标识所述语音帧与所述多个音节分别对应的概率；The syllable probability distribution determining unit is configured to use the speech data as the input data of the time-delay neural network, and determine the syllable probability distributions corresponding to the speech frames included in the speech data through the time-delay neural network. The syllable probability distribution is used to identify the probability that the speech frame corresponds to the multiple syllables respectively;

所述语音识别结果确定单元，用于根据所述音节概率分布确定所述语音数据对应的语音识别结果。The speech recognition result determination unit is configured to determine the speech recognition result corresponding to the speech data according to the syllable probability distribution.

另一方面，本申请实施例提供一种计算机设备，所述设备包括处理器以及存储器：On the other hand, an embodiment of the present application provides a computer device, the device includes a processor and a memory:

所述存储器用于存储程序代码，并将所述程序代码传输给所述处理器；the memory is used to store program code and transmit the program code to the processor;

所述处理器用于根据所述程序代码中的指令执行上述方面所述的方法。The processor is configured to execute the method described in the above aspects according to the instructions in the program code.

另一方面，本申请实施例提供了一种计算机可读存储介质，所述计算机可读存储介质用于存储计算机程序，所述计算机程序用于执行上述方面所述的方法。On the other hand, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the method described in the foregoing aspects.

另一方面，本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述方面所述的方法。On the other hand, an embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method described in the above aspects.

由上述技术方案可以看出，针对待识别的语音数据，获取包括时延神经网络的声学模型。将所述语音数据作为时延神经网络的输入数据，由于该时延神经网络的输出层包括与多个音节分别对应的声学建模单元，使得通过时延神经网络能以音节为识别粒度，通过输出层的各个声学建模单元得到语音数据所包括语音帧分别对应的音节概率分布。时延神经网络在语音特征提取和向输出层传递的过程中会携带语音帧在语音数据中丰富的上下文信息，该上下文信息可以体现出语音帧在语音数据中前后音节的相关信息，由此在通过输出层以音节进行识别时，可以结合语音帧的前后音节信息，基于发音规则对语音帧所属音节进行辅助判断，以输出更为精准的音节概率分布。而且由于音节一般由一个或多个音素组成，具有更高的容错能力，即使语音数据中一个音节中出现个别音素的发音误差，对整个音节的识别结果的影响也相对于音素来说更小，从而即使待识别的语音数据质量不高，也能够基于音节概率分布获取更为准确的确定语音识别结果，有效扩展了语音识别技术的适用场景。It can be seen from the above technical solutions that, for the speech data to be recognized, an acoustic model including a time-delay neural network is obtained. Taking the speech data as the input data of the time-delay neural network, since the output layer of the time-delay neural network includes acoustic modeling units corresponding to multiple syllables, the time-delay neural network can use syllables as the identification granularity, and Each acoustic modeling unit of the output layer obtains the probability distribution of the syllables corresponding to the speech frames included in the speech data. In the process of speech feature extraction and transmission to the output layer, the time-delay neural network will carry the rich context information of the speech frame in the speech data. When identifying syllables through the output layer, the syllables of the speech frame can be judged based on the pronunciation rules based on the information of the syllables before and after the speech frame, so as to output a more accurate probability distribution of syllables. Moreover, since a syllable is generally composed of one or more phonemes, it has higher error tolerance. Even if the pronunciation error of an individual phoneme occurs in a syllable in the speech data, the impact on the recognition result of the entire syllable is smaller than that of the phoneme. Therefore, even if the quality of the speech data to be recognized is not high, a more accurate determination result of speech recognition can be obtained based on the probability distribution of syllables, which effectively expands the applicable scenarios of the speech recognition technology.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为本申请实施例提供的一种语音识别方法的应用场景示意图；1 is a schematic diagram of an application scenario of a speech recognition method provided by an embodiment of the present application;

图2为本申请实施例提供的一种语音识别方法的流程图；2 is a flowchart of a speech recognition method provided by an embodiment of the present application;

图3为本申请实施例提供的一种TDNN的示意图；3 is a schematic diagram of a TDNN provided by an embodiment of the present application;

图4为本申请实施例提供的一种声学模型的示意图；4 is a schematic diagram of an acoustic model provided by an embodiment of the present application;

图5为本申请实施例提供的一种语音识别系统的应用场景实施例的示意图；5 is a schematic diagram of an embodiment of an application scenario of a speech recognition system provided by an embodiment of the present application;

图6为本申请实施例提供的一种声学模型的示意图；6 is a schematic diagram of an acoustic model provided by an embodiment of the present application;

图7为本申请实施例提供的一种语音识别装置的示意图；FIG. 7 is a schematic diagram of a speech recognition device according to an embodiment of the present application;

图8为本申请实施例提供的一种终端设备的结构图；FIG. 8 is a structural diagram of a terminal device provided by an embodiment of the present application;

图9为本申请实施例提供的一种服务器的结构图。FIG. 9 is a structural diagram of a server provided by an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图，对本申请的实施例进行描述。The embodiments of the present application will be described below with reference to the accompanying drawings.

在识别语音数据时，相关技术中，以音素作为声学模型的建模单元，比如关键词“你好”对应的音素序列为“n iy3 hh aw3”，只有在语音数据中识别出音素序列“n iy3 hhaw3”才能识别出关键词“你好”。由于音素的建模粒度太细，其对语音数据的质量要求较高。若关键词中的一个因素发音不标准，导致该关键词识别失败，从而导致语音识别结果的准确率较低，甚至在一些如背景音嘈杂、用户讲话不标准等复杂的语音识别场景中，语音识别的鲁棒性较低，使得语音识别技术的适用场景较少。When recognizing speech data, in the related art, the phoneme is used as the modeling unit of the acoustic model. For example, the phoneme sequence corresponding to the keyword "hello" is "n iy3 hh aw3", only the phoneme sequence "n iy3 hh aw3" is recognized in the speech data. iy3 hhaw3" to recognize the keyword "hello". Because the modeling granularity of phoneme is too fine, it has higher requirements on the quality of speech data. If the pronunciation of one of the keywords is not standard, the keyword recognition will fail, resulting in a low accuracy of the speech recognition result. Even in some complex speech recognition scenarios such as noisy background sounds and non-standard user speech, the speech The robustness of recognition is low, making speech recognition technology less applicable to scenarios.

基于此，本申请实施例提供了一种语音识别方法，不仅提高了语音识别结果的准确性，而且对语音数据质量要求较低，有效扩展了语音识别技术的适用场景。Based on this, the embodiments of the present application provide a speech recognition method, which not only improves the accuracy of speech recognition results, but also has lower requirements on the quality of speech data, effectively expanding the applicable scenarios of speech recognition technology.

本申请实施例提供的语音识别方法是基于人工智能实现的，人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。The speech recognition methods provided in the embodiments of the present application are implemented based on artificial intelligence. Artificial intelligence (AI) uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use it Knowledge of theories, methods, techniques and applied systems for obtaining optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning, autonomous driving, and smart transportation.

在本申请实施例中，主要涉及的人工智能技术包括上述语音处理技术等方向。In the embodiments of the present application, the artificial intelligence technology mainly involved includes the above-mentioned voice processing technology and other directions.

本申请提供的语音识别方法可以应用于具有数据处理能力的语音识别设备，如终端设备、服务器。其中，本申请涉及的终端设备具体可以为智能手机、平板电脑、笔记本电脑、掌上电脑、个人电脑、智能手表、智能音箱、车载设备、可穿戴设备、智能语音交互设备、智能家电、车载终端等，但并不局限于此。本申请涉及的服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备以及服务器可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。服务器和终端设备的数量也不做限制。The speech recognition method provided in this application can be applied to speech recognition devices with data processing capabilities, such as terminal devices and servers. Among them, the terminal devices involved in this application may specifically be smart phones, tablet computers, notebook computers, PDAs, personal computers, smart watches, smart speakers, vehicle-mounted devices, wearable devices, intelligent voice interaction devices, smart home appliances, vehicle-mounted terminals, etc. , but not limited to this. The server involved in this application may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms. The terminal device and the server can be directly or indirectly connected through wired or wireless communication, which is not limited in this application. The number of servers and terminal devices is also not limited.

前述语音识别设备可以具备语音处理技术，语音技术(Speech Technology)的关键技术有自动语音识别技术和语音合成技术以及声纹识别技术。让计算机能听、能看、能说、能感觉，是未来人机交互的发展方向，其中语音成为未来最被看好的人机交互方式之一。The aforementioned speech recognition device may have speech processing technology, and the key technologies of speech technology include automatic speech recognition technology, speech synthesis technology, and voiceprint recognition technology. Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.

在本申请实施例提供的语音识别方法中，采用的人工智能模型主要涉及对语音识别技术的应用，通过声学模型确定待识别的语音数据对应的语音识别结果。In the speech recognition method provided by the embodiment of the present application, the adopted artificial intelligence model mainly involves the application of speech recognition technology, and the speech recognition result corresponding to the speech data to be recognized is determined by the acoustic model.

为了便于理解本申请的技术方案，下面结合实际应用场景，以终端设备作为语音识别设备对本申请实施例提供的语音识别方法进行介绍。In order to facilitate the understanding of the technical solutions of the present application, the speech recognition method provided by the embodiments of the present application is described below in combination with an actual application scenario, using a terminal device as a speech recognition device.

参见图1，该图为本申请实施例提供的一种语音识别方法的应用场景示意图。在图1所示的应用场景中，终端设备为智能音箱100，用于对待识别的语音数据进行关键词“你好”的识别。Referring to FIG. 1 , this figure is a schematic diagram of an application scenario of a speech recognition method provided by an embodiment of the present application. In the application scenario shown in FIG. 1 , the terminal device is a smart speaker 100 , which is used to recognize the keyword "hello" in the speech data to be recognized.

智能音箱100获取声学模型。其中，声学模型包括时延神经网络，时延神经网络的输出层包括与多个音节分别对应的声学建模单元。在本申请实施例中，不再以音素作为声学建模单元，而是以音节作为声学建模单元，音节(Syllable)是语言中单个元音音素和辅音音素组合发音的最小语音单位，单个元音音素也可自成音节，即一个音节可以包括一个或多个音素。以关键词“你好”为例，其对应的音节序列为“ni3 hao3”，音素序列为“n iy3hh aw3”，故，相比于音素而言，音节具有更高的容错能力，即使语音数据中一个音节中出现个别音素的发音误差，对整个音节的识别结果的影响也相对于音素来说更小，从而对待识别的语音数据质量的要求降低。The smart speaker 100 acquires an acoustic model. The acoustic model includes a time-delay neural network, and the output layer of the time-delay neural network includes acoustic modeling units corresponding to multiple syllables respectively. In the embodiment of the present application, the phoneme is no longer used as the acoustic modeling unit, but the syllable is used as the acoustic modeling unit. Phonemes can also be self-syllabic, that is, a syllable can include one or more phonemes. Taking the keyword "hello" as an example, the corresponding syllable sequence is "ni3 hao3", and the phoneme sequence is "niy3hh aw3". Therefore, compared with phonemes, syllables have higher fault tolerance, even if speech data Pronunciation errors of individual phonemes appear in one syllable, and the impact on the recognition result of the whole syllable is smaller than that of phonemes, so the requirements for the quality of the speech data to be recognized are lowered.

智能音箱100获取到待识别的语音数据后，将语音数据输入至时延神经网络，时延神经网络包括了多层网络，每一层网络对语音特征均具有较强的抽象能力，而且，时延神经网络具有较宽的上下文视野，能够捕捉更广的上下文信息，该上下文信息可以体现出语音帧在语音数据中前后音节的相关信息，由此在通过输出层以音节进行识别时，可以结合语音帧的前后音节信息，基于发音规则对语音帧所属音节进行辅助判断，从而输出针对于音节序列为“ni3 hao3”的音节概率分布更为准确。After the smart speaker 100 obtains the speech data to be recognized, it inputs the speech data into the time-delay neural network. The time-delay neural network includes multiple layers of networks. The extended neural network has a wider contextual vision and can capture wider contextual information. The contextual information can reflect the relevant information of the syllables before and after the speech frame in the speech data. Therefore, when identifying syllables through the output layer, it can be combined with The information of the syllables before and after the speech frame is used to assist in the judgment of the syllable to which the speech frame belongs based on the pronunciation rules, so that the output probability distribution of the syllables whose syllable sequence is "ni3 hao3" is more accurate.

智能音箱100根据更为精准的音节概率分布，确定语音数据对应的语音识别结果也更为准确。由此，不仅提高了语音识别结果的准确性，还降低了对语音数据质量的要求，有效扩展了语音识别技术的适用场景。The smart speaker 100 determines a more accurate speech recognition result corresponding to the speech data according to a more accurate syllable probability distribution. As a result, not only the accuracy of the speech recognition result is improved, but also the requirement for the quality of speech data is lowered, which effectively expands the applicable scenarios of speech recognition technology.

下面结合附图，以服务器作为语音识别设备，对本申请实施例提供的一种语音识别方法进行介绍。In the following, a speech recognition method provided by an embodiment of the present application is introduced by using a server as a speech recognition device with reference to the accompanying drawings.

参见图2，该图为本申请实施例提供的一种语音识别方法的流程图。如图2所示，该语音识别方法包括以下步骤：Referring to FIG. 2 , which is a flowchart of a speech recognition method provided by an embodiment of the present application. As shown in Figure 2, the speech recognition method includes the following steps:

S201：获取声学模型和待识别的语音数据。S201: Acquire an acoustic model and speech data to be recognized.

相关技术中，以音素作为建模单元的高斯混合模型(Gaussian Mixed Mode，GMM)-隐马尔可夫模型(Hidden Markov Model，HMM)，即GMM-HMM模型成为了自动语音识别(Automatic Speech Recognition，ASR)任务的主流系统。但是随着标注语料的累积和计算机算力的迅增增长，很多新提出的深层神经网络模型的效果已经大大超越了GMM-HMM模型，并且还一直在进一步提升。由于这些深层神经网络的强大建模能力，声学模型的输出标签也不再需要细分到音素。很多研究工作发现声学模型也可以使用更大的建模颗粒度，并且大部分情况下都能取得更好的效果。In the related art, the Gaussian Mixed Model (GMM)-Hidden Markov Model (HMM) with phoneme as the modeling unit, that is, the GMM-HMM model has become the automatic speech recognition (Automatic Speech Recognition, The mainstream system for ASR) tasks. However, with the accumulation of labeled corpus and the rapid growth of computer computing power, the effect of many newly proposed deep neural network models has greatly surpassed the GMM-HMM model, and has been further improved. Due to the powerful modeling capabilities of these deep neural networks, the output labels of acoustic models also no longer need to be subdivided into phonemes. A lot of research work has found that acoustic models can also use larger modeling granularity, and in most cases can achieve better results.

基于此，本申请实施例不再以音素作为声学建模单元，而是以音节作为声学建模单元。其中，音节主要由声母、韵母和声调组成，而每一个音节又可以分解成一个或者多个音素。参见表1，以关键词“你好”和“世界”为例说明音节和音素的区别。Based on this, in the embodiments of the present application, phonemes are no longer used as acoustic modeling units, but syllables are used as acoustic modeling units. Among them, syllables are mainly composed of initials, finals and tones, and each syllable can be decomposed into one or more phonemes. See Table 1, and take the keywords "hello" and "world" as examples to illustrate the difference between syllables and phonemes.

表1 音节和音素的区别Table 1 Differences between syllables and phonemes

关键词Key words 音节序列syllable sequence 音素序列phoneme sequence 你好Hello ni3 hao3ni3 hao3 n iy3 hh aw3n iy3 hh aw3 世界world shi4 jie4shi4 jie4 sh iy4 j iy4 eh4sh iy4 j iy4 eh4

相比音素而言，音节能够携带更多的上下文信息，从而提升声学模型的上下文学习能力，提升语音关键词的覆盖率。另外，音节更符合人的说话过程，也更容易理解。而且由于音节一般由一个或多个音素组成，具有更高的容错能力，即使语音数据中一个音节中出现个别音素的发音误差，对整个音节的识别结果的影响也相对于音素来说更小，从而对语音数据的质量的要求降低，有效扩展了语音识别技术的适用场景。Compared with phonemes, syllables can carry more contextual information, thereby improving the contextual learning ability of acoustic models and improving the coverage of speech keywords. In addition, syllables are more in line with the human speaking process and are easier to understand. Moreover, since a syllable is generally composed of one or more phonemes, it has higher error tolerance. Even if the pronunciation error of an individual phoneme occurs in a syllable in the speech data, the impact on the recognition result of the entire syllable is smaller than that of the phoneme. As a result, the requirements for the quality of the speech data are reduced, and the applicable scenarios of the speech recognition technology are effectively expanded.

其中，声学模型(Acoustic Model，AM)包括时延神经网络(Time Delay NeuralNetwork，TDNN)，后续会进行说明。语音数据又称声音文件，是通过语音来记录的数据，如唤醒智能音箱的声音可以被称作语音数据。The Acoustic Model (AM) includes a Time Delay Neural Network (TDNN), which will be described later. Voice data, also known as sound files, is data recorded by voice. For example, the sound of waking up a smart speaker can be called voice data.

可以理解的是，在本申请的具体实施方式中，涉及到语音数据等用户相关的数据，当本申请以上实施例运用到具体产品或技术中，需要获得用户许可或者同意，且相关数据的收集、使用和处理需要遵循相关国家和地区的相关法律法规和标准。It can be understood that, in the specific implementation of this application, it involves user-related data such as voice data. When the above embodiments of this application are applied to specific products or technologies, it is necessary to obtain user permission or consent, and the collection of relevant data. , use and processing need to comply with relevant laws, regulations and standards of relevant countries and regions.

需要说明的是，可以同时获取声学模型和待识别的语音数据，还可以预先获取声学模型后，再获取待识别的语音数据，也可以在获取待识别的语音数据后，获取声学模型，本申请对此不做具体限定。It should be noted that the acoustic model and the voice data to be recognized can be obtained at the same time, the voice data to be recognized can also be obtained after the acoustic model is obtained in advance, or the acoustic model can be obtained after the voice data to be recognized is obtained. There is no specific limitation on this.

S202：将语音数据作为时延神经网络的输入数据，通过时延神经网络确定语音数据所包括语音帧分别对应的音节概率分布。S202: Use the speech data as input data of the time-delay neural network, and determine the probability distribution of syllables corresponding to speech frames included in the speech data through the time-delay neural network.

声学模型可以将语音输入转化为声学表示的音节输出，在本申请实施例中，会针对每个语音帧，计算出它所属音节的概率分布。其中，声学模型包括TDNN，如图3所示，该图为本申请实施例提供的一种TDNN的示意图。The acoustic model can convert the speech input into an acoustically represented syllable output. In this embodiment of the present application, the probability distribution of the syllable to which it belongs is calculated for each speech frame. The acoustic model includes a TDNN, as shown in FIG. 3 , which is a schematic diagram of a TDNN provided by an embodiment of the present application.

如图3所示，TDNN包括多层的网络，每一层网络对语音特征都有较强的抽象能力。而且TDNN有着较宽的上下文视野，能够捕捉更广的上下文信息，并对语音时序依赖信息的建模能力更强。As shown in Figure 3, TDNN includes multiple layers of networks, and each layer of the network has a strong ability to abstract speech features. Moreover, TDNN has a wider contextual view, can capture wider contextual information, and has a stronger ability to model speech timing dependence information.

在本申请实施例中，TDNN的输出层包括与多个音节分别对应的声学建模单元，将语音数据输入至TDNN后，TDNN能以音节为识别粒度，确定语音数据所包括语音帧分别对应的音节概率分布。其中，音节概率分布用于标识语音帧与多个音节分别对应的概率。如图3所示，在TDNN的输出层，能够输出每个音节对应的概率。需要说明的是，图3仅以4个音节为例进行示出，但不限于4个音节。例如，可以统计出如中文所需的所有音节，用输出层表示所有音节，从而确定中文用户输入的待识别的语音数据对应的识别结果。In the embodiment of the present application, the output layer of the TDNN includes acoustic modeling units corresponding to multiple syllables. After the speech data is input to the TDNN, the TDNN can use the syllable as the recognition granularity to determine the corresponding speech frames included in the speech data. Syllable probability distribution. The syllable probability distribution is used to identify the probability that the speech frame corresponds to a plurality of syllables respectively. As shown in Figure 3, in the output layer of TDNN, the probability corresponding to each syllable can be output. It should be noted that, FIG. 3 only shows 4 syllables as an example, but is not limited to 4 syllables. For example, all syllables required for Chinese can be counted, and all syllables can be represented by the output layer, so as to determine the recognition result corresponding to the speech data to be recognized input by the Chinese user.

由于时延神经网络在语音特征提取和向输出层传递的过程中会携带语音帧在语音数据中丰富的上下文信息，该上下文信息可以体现出语音帧在语音数据中前后音节的相关信息，由此在通过输出层以音节进行识别时，可以结合语音帧的前后音节信息，基于发音规则对语音帧所属音节进行辅助判断，以输出更为精准的音节概率分布。Since the time-delay neural network will carry the rich context information of the speech frame in the speech data in the process of speech feature extraction and transmission to the output layer, the context information can reflect the relevant information of the syllables before and after the speech frame in the speech data. When identifying syllables through the output layer, it can combine the information of the syllables before and after the speech frame to assist in the judgment of the syllable to which the speech frame belongs based on the pronunciation rules, so as to output a more accurate probability distribution of syllables.

S203：根据音节概率分布确定语音数据对应的语音识别结果。S203: Determine a speech recognition result corresponding to the speech data according to the syllable probability distribution.

其中，音节概率分布用于标识语音帧与多个音节分别对应的概率。例如，TDNN的输出层会输出100个音节分别对应的概率，则通过TDNN得到的音节概率分布为1×100的行向量，行向量中每一个元素表示该音节对应的概率，100个元素对应的概率和为1。可以理解的是，某一音节的概率越大，则语音帧的实际发音与该音节的发音一样的可能性越大。继续以前述为例，若语音帧A得到的音节概率分布为[0.1，0.8，0.1，0，……，0](中间省略了95个0)，则语音帧A的实际发音是前三个音节组成的发音的可能性较大，又或者语音帧A的实际发音是第二个音节的发音的可能性较大，本申请对此不做具体限定，本领域技术人员可以根据实际应用场景进行设置。The syllable probability distribution is used to identify the probability that the speech frame corresponds to a plurality of syllables respectively. For example, the output layer of TDNN will output the probability corresponding to 100 syllables, then the probability distribution of syllables obtained by TDNN is a 1×100 row vector, each element in the row vector represents the probability corresponding to the syllable, and the corresponding The probability sum is 1. It can be understood that the greater the probability of a certain syllable, the greater the probability that the actual pronunciation of the speech frame is the same as the pronunciation of the syllable. Continuing to take the previous example, if the probability distribution of syllables obtained by speech frame A is [0.1, 0.8, 0.1, 0, ..., 0] (95 0s are omitted in the middle), then the actual pronunciation of speech frame A is the first three The pronunciation of syllable composition is more likely, or the actual pronunciation of speech frame A is the pronunciation of the second syllable. This application does not specifically limit this, and those skilled in the art can perform according to actual application scenarios set up.

在获得语音数据中每一个语音帧对应的音节概率分布后，可以根据多个语音帧分别对应的音节概率分布进一步确定语音数据对应的语音识别结果。如用户输入一段语音后，将其作为待识别的语音数据，通过TDNN可以获取语音数据所包括语音帧分别对应的音节概率分布，进而确定语音识别结果。After obtaining the syllable probability distribution corresponding to each speech frame in the speech data, the speech recognition result corresponding to the speech data may be further determined according to the syllable probability distribution corresponding to the multiple speech frames respectively. For example, after the user inputs a piece of speech, it is regarded as the speech data to be recognized, and the probability distribution of the syllables corresponding to the speech frames included in the speech data can be obtained through the TDNN, and then the speech recognition result can be determined.

需要说明的是，本申请可以应用于智能设备中如唤醒交互、音视频文件语音关键词检测等语音关键词检测(Keyword spotting，KWS)任务。由于受到标注语料和算力的限制，采用高斯混合模型(Gaussian Mixed Mode，GMM)、深度神经网络(Deep NeuralNetworks，DNN)、卷积神经网络(Convolutional Neural Network，CNN)、循环神经网络(Recurrent Neural Network，RNN)等声学模型效果不够强，导致它在智能设备中的效果不佳。为了应对智能设备有限的计算资源限制以及海量音视频文件的处理需求，不仅需要关键词检测的速度足够快，而且希望检测结果足够准。It should be noted that the present application can be applied to voice keyword spotting (Keyword spotting, KWS) tasks such as wake-up interaction and voice keyword detection of audio and video files in smart devices. Due to the limitations of labeled corpus and computing power, Gaussian Mixed Mode (GMM), Deep Neural Networks (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (Recurrent Neural Network) are used. Network, RNN) and other acoustic models are not strong enough, resulting in poor performance in smart devices. In order to cope with the limited computing resources of smart devices and the processing requirements of massive audio and video files, not only the speed of keyword detection is required to be fast enough, but also the detection results are expected to be accurate enough.

基于此，参见图4，该图为本申请实施例提供的一种声学模型的示意图。如图4所示，该声学模型不仅包括TDNN，还包括解码器。由于TDNN有着较宽的上下文视野，能够捕捉更广的上下文信息，因此它可以进一步增加声学模型对语音时序信息的建模能力，即TDNN能够更高效的对时序信息进行处理，从而提升关键词检测的准确性。解码器用于根据音节概率分布确定语音数据中是否包括关键词的语音识别结果，以便于根据语音识别结果确定是否唤醒对应的终端设备。Based on this, refer to FIG. 4 , which is a schematic diagram of an acoustic model provided by an embodiment of the present application. As shown in Figure 4, this acoustic model includes not only the TDNN but also the decoder. Since TDNN has a wider contextual view and can capture wider contextual information, it can further increase the ability of acoustic models to model speech timing information, that is, TDNN can process timing information more efficiently, thereby improving keyword detection. accuracy. The decoder is configured to determine whether the speech data includes the speech recognition result of the keyword according to the syllable probability distribution, so as to determine whether to wake up the corresponding terminal device according to the speech recognition result.

为了方便说明，下面以智能设备的唤醒交互场景为例进行说明，参见S2031-S2033：For the convenience of description, the following takes the wake-up interaction scenario of the smart device as an example for description, see S2031-S2033:

S2031：根据语音数据对应的唤醒场景，确定与唤醒场景对应的关键词。S2031: Determine a keyword corresponding to the wake-up scene according to the wake-up scene corresponding to the voice data.

可以理解的是，由于唤醒场景不同，其对应的关键词可能不同，本领域技术人员可以根据实际需要进行设置。例如，智能家居唤醒场景对应的关键词为“打开XX(家居设备)”，虚拟助理唤醒场景对应的关键词为“XXX(虚拟助理名称)”等，根据本领域技术人员对关键词的设置，在获取待识别的语音数据后，根据对应的唤醒场景，确定与唤醒场景对应的关键词。It can be understood that, due to different wake-up scenarios, the corresponding keywords may be different, and those skilled in the art can set them according to actual needs. For example, the keyword corresponding to the smart home wake-up scene is "open XX (home equipment)", and the keyword corresponding to the virtual assistant wake-up scene is "XXX (virtual assistant name)", etc. According to the setting of the keyword by those skilled in the art, After acquiring the speech data to be recognized, the keyword corresponding to the wake-up scene is determined according to the corresponding wake-up scene.

需要说明的是，S2031可以为在获得音节概率分布后执行，也可以在获取待识别的语音数据后执行，本申请对此不做具体限定。It should be noted that, S2031 may be executed after obtaining the probability distribution of syllables, or may be executed after obtaining the speech data to be recognized, which is not specifically limited in this application.

S2032：根据音节概率分布，通过解码器确定用于标识语音数据中是否包括关键词的语音识别结果。S2032: According to the probability distribution of syllables, determine, by the decoder, a speech recognition result for identifying whether the speech data includes a keyword.

其中，解码器用于将音节概率分布解压成语音识别结果，故在通过TDNN获得音节概率分布后，不仅可以用根据该音节概率分布，通过解码器确定语音数据对应的语音识别结果，还可以通过解码器确定用于标识语音数据中是否包括关键词的语音识别结果。Among them, the decoder is used to decompress the syllable probability distribution into the speech recognition result. Therefore, after obtaining the syllable probability distribution through TDNN, not only can the decoder determine the speech recognition result corresponding to the speech data according to the syllable probability distribution, but also can decode the syllable probability distribution. The processor determines a speech recognition result for identifying whether a keyword is included in the speech data.

作为一种可能的实现方式，可以针对给定的关键词，通过基于加权有限状态转换器(Weighted Finite State Transducer，WFST)的解码器进行解码后，直接输出是否包括关键词的语音识别结果。其中，WFST提供了统一的框架，能够用统一的方式融合来自声学模型、发音词典和语言模型的信息，在巨大的搜索空间里寻找最优的路径，从而可以提高模型的搜索速度。As a possible implementation manner, a given keyword may be decoded by a weighted finite state converter (Weighted Finite State Transducer, WFST)-based decoder, and the speech recognition result of whether the keyword is included may be directly output. Among them, WFST provides a unified framework, which can integrate information from acoustic models, pronunciation dictionaries and language models in a unified way to find the optimal path in a huge search space, thereby improving the search speed of the model.

作为一种可能的实现方式，在解码过程中，进一步还可以采用基于WFST的Lattice静态解码策略，在搜索空间里寻找最优的路径，提高模型的搜索速度。As a possible implementation manner, in the decoding process, the Lattice static decoding strategy based on WFST can be further adopted to find the optimal path in the search space and improve the search speed of the model.

作为一种可能的实现方式，在解码过程中，还可以通过帧剪枝等方法提升解码速度。例如，生成的音节概率分布为1×100的向量，前三个概率分别为0.1、0.8和0.1，后97个概率均为0，通过帧剪枝的方式将后97个剪去，将1×100的向量变为1×3的向量，从而只需对1×3的向量进行解码，进而提升解码速度。As a possible implementation, in the decoding process, the decoding speed can also be improved by methods such as frame pruning. For example, the probability distribution of the generated syllables is a 1×100 vector, the first three probabilities are 0.1, 0.8 and 0.1 respectively, and the last 97 probabilities are all 0. The last 97 are pruned by frame pruning, and the 1× A vector of 100 becomes a 1×3 vector, so that only the 1×3 vector needs to be decoded, thereby improving the decoding speed.

S2033：若语音识别结果指示语音数据中包括关键词，将唤醒场景对应终端设备唤醒。S2033: If the voice recognition result indicates that the voice data includes a keyword, wake up the terminal device corresponding to the wake-up scene.

由上述技术方案可知，用音节替代音素作为声学模型的建模单元，粒度变粗，不仅所占的内存的变小，所需的计算资源更少。对于语音关键词，音节的覆盖率更高，容错性更高，准确性更高。同时，对于同样长度的待识别的语音数据，基于音节检测关键词的速度更快，在应对有限的计算资源限制以及海量音视频文件的处理需求时，由于声学模型包括TDNN和解码器，TDNN可以进一步增加声学模型对语音时序信息的建模能力，解码器可以确定语音数据中是否包括关键词的语音识别结果，从而提升关键词检测的准确性。由此，本申请实施例提供的声学模型不仅需要关键词检测的速度更快，而且检测结果的准确性更高，其性能和速度取得了较好的平衡，在智能设备中的效果会更好。It can be seen from the above technical solutions that using syllables instead of phonemes as the modeling unit of the acoustic model, the granularity becomes coarser, which not only reduces the memory occupied, but also requires less computing resources. For phonetic keywords, syllable coverage is higher, error tolerance is higher, and accuracy is higher. At the same time, for the same length of speech data to be recognized, it is faster to detect keywords based on syllables. When dealing with limited computing resource constraints and the processing requirements of massive audio and video files, since the acoustic model includes TDNN and decoder, TDNN can To further increase the ability of the acoustic model to model speech timing information, the decoder can determine whether the speech data includes the speech recognition results of keywords, thereby improving the accuracy of keyword detection. Therefore, the acoustic model provided by the embodiment of the present application not only requires faster keyword detection, but also higher accuracy of the detection result, a better balance between its performance and speed, and a better effect in smart devices .

针对前述S2032，本申请实施例不具体限定通过解码器确定语音识别结果的方式，例如，可以根据音节概率分布和匹配词表，通过解码器确定用于标识语音数据中是否包括关键词的语音识别结果。Regarding the aforementioned S2032, the embodiment of the present application does not specifically limit the manner in which the speech recognition result is determined by the decoder. For example, the speech recognition method for identifying whether the speech data includes a keyword may be determined by the decoder according to the probability distribution of syllables and the matching vocabulary. result.

例如，在虚拟助理唤醒场景中，其对应的关键词为“你好世界”，根据该关键词建立对应的关键词表。用户在输入一段语音后，将该语音作为待识别的语音数据，通过本申请实施例提供的声学模型确定音节概率分布，解码器查找“你好世界”对应的关键词表，结合音节概率分布确定用户输入的语音数据是否包括“你好世界”。若该语音数据包括“你好世界”，则解码器可以输出“包括关键词”的结果，进一步，还可以根据该语音识别结果将虚拟助理唤醒，为用户提供服务，如S2033所示。其中，唤醒场景对应的终端设备可以为如智能音箱、虚拟助理、智能家具等终端设备，本申请对此不做具体限定。For example, in the virtual assistant wake-up scenario, the corresponding keyword is "hello world", and a corresponding keyword table is established according to the keyword. After the user inputs a piece of speech, the speech is used as the speech data to be recognized, and the probability distribution of syllables is determined by the acoustic model provided in the embodiment of the present application, and the decoder searches for the keyword table corresponding to "Hello World", and determines the probability distribution of the syllables. Whether the voice data input by the user includes "Hello World". If the voice data includes "Hello World", the decoder may output a result of "including keywords", and further, the virtual assistant may be awakened according to the voice recognition result to provide services for the user, as shown in S2033. The terminal device corresponding to the wake-up scene may be a terminal device such as a smart speaker, a virtual assistant, and smart furniture, which is not specifically limited in this application.

其中，关键词表是以关键词为核心建立的，包括了关键词可能会对应的读音。本申请实施例不具体限定关键词表的确定方式，例如可以根据相似音节和/或多音字音节，以及关键词对应的音节，以音节为划分粒度构建针对关键词的匹配词表。通过相似音节和/或多音字音节丰富匹配词表，有效提升关键词的覆盖程度，进而避免漏唤醒，提高用户的使用感。下面分别对相似音节和多音字音节进行说明。Among them, the keyword table is established with the keyword as the core, including the possible corresponding pronunciation of the keyword. The embodiments of the present application do not specifically limit the method of determining the keyword table. For example, a matching vocabulary for keywords may be constructed with syllables as the granularity according to similar syllables and/or polysyllabic syllables and syllables corresponding to keywords. Enriching the matching vocabulary through similar syllables and/or polysyllabic syllables can effectively improve the coverage of keywords, thereby avoiding missed wakeups and improving the user's sense of use. The similar syllables and the polysyllabic syllables are respectively described below.

相似音节：相似音节是根据发音相似程度确定关键词所包括音节对应的相似音节，例如，关键词“你好”的音节序列为“ni3 hao3”，当将关键词大声拉长读时，“你”可能被读作一声而不是三声，故音节序列“ni1 hao3”可以被作为相似音节加入搭配关键词表中。Similar syllables: Similar syllables are similar syllables corresponding to the syllables included in the keyword according to the degree of pronunciation similarity. For example, the syllable sequence of the keyword "hello" is "ni3 hao3", when the keyword is read out loud, "you" " may be pronounced once instead of three, so the syllable sequence "ni1 hao3" can be added to the list of collocation keywords as similar syllables.

多音字音节：多音字是具有不止一个读音的字，若关键词中具有多音字，确定多音字对应的多音字音节，其所有读音应该被收录在关键词表中，以防止用户口音导致的漏唤醒。继续以关键词“你好”为例，音节序列“ni3 hao4”可以被作为相似音节加入搭配关键词表中。Polysyllabic syllable: A polysyllabic word is a word with more than one pronunciation. If a keyword has a polysyllabic character, determine the polysyllabic syllable corresponding to the polysyllabic character, and all its pronunciations should be included in the keyword table to prevent leakage caused by the user's accent. wake. Continuing to take the keyword "hello" as an example, the syllable sequence "ni3 hao4" can be added to the collocation keyword table as a similar syllable.

作为一种可能的实现方式，一些易错读音也会被加入到关键词表中，如关键词“傥”易被读作“dang(三声)”，故音节“dang3”会被收录至关键词“傥”对应的关键词表中。As a possible implementation method, some mispronounced pronunciations will also be added to the keyword table. For example, the keyword "傥" is easily read as "dang (three tones)", so the syllable "dang3" will be included in the key word list. The word "傥" corresponds to the keyword table.

作为一种可能的实现方式，TDNN不仅可以结合语音帧的前后音节信息，还可以结合第i帧语音帧在语音数据中的前后至少一帧对应的语音帧特征，进一步提升音节概率分布的准确性。以TDNN包括N层特征提取层为例，针对语音数据所包括语音帧中的第i帧语音帧，S202可以通过S2021-S2023实现，具体如下：As a possible implementation, TDNN can not only combine the information of the syllables before and after the speech frame, but also combine the characteristics of the speech frame corresponding to at least one frame before and after the i-th speech frame in the speech data to further improve the accuracy of the syllable probability distribution. . Taking the TDNN including N layers of feature extraction layers as an example, for the i-th speech frame in the speech frames included in the speech data, S202 can be implemented through S2021-S2023, as follows:

S2021：根据第j-1层特征提取层针对第i帧语音帧的输出特征，通过第j层特征提取层确定第i帧语音帧的语音帧特征。S2021 : According to the output feature of the j-1th layer of feature extraction layer for the i-th speech frame, determine the speech frame feature of the i-th speech frame through the j-th layer of feature extraction layer.

TDNN包括的每一层网络，均可以称之为特征提取层，每一层特征提取层均会针对待识别的语音数据所包括的每一帧语音帧进行特征提取。下面针对第i帧语音帧，以N层特征提取层中任意两层特征提取层，即第j-1层特征提取层和第j层特征提取层为例进行说明。其中，j∈N。Each layer of network included in TDNN can be called a feature extraction layer, and each layer of feature extraction layer will perform feature extraction for each frame of speech frame included in the speech data to be recognized. In the following, for the ith frame of speech frame, any two feature extraction layers in the N-layer feature extraction layer, that is, the j-1-th feature extraction layer and the j-th feature extraction layer, are taken as examples for description. Among them, j∈N.

针对第i帧语音帧，第j-1层特征提取层会进行特征提取，从而输出针对第i帧语音帧的输出特征，然后将其输入至第j层特征提取层，第j层特征提取层进行特征提取，确定第i帧语音帧的语音帧特征。For the i-th speech frame, the feature extraction layer of the j-1th layer will perform feature extraction, so as to output the output features for the i-th speech frame, and then input it to the j-th layer of feature extraction layer, and the j-th layer of feature extraction layer Feature extraction is performed to determine the speech frame features of the i-th speech frame.

S2022：通过语音帧特征和第i帧语音帧在语音数据中的前后至少一帧对应的语音帧特征，确定第i帧语音帧在第j层特征提取层的输出特征。S2022: Determine the output feature of the i-th speech frame in the j-th feature extraction layer by using the speech frame feature and the speech frame feature corresponding to at least one frame before and after the i-th speech frame in the speech data.

为了能够进一步提升声学模型的上下文学习能力，可以在确定第i帧语音帧的语音帧特征后，不仅可以结合语音帧的前后音节信息，还可以结合第i帧语音帧在语音数据中的前后至少一帧对应的语音帧特征，即通过语音帧的前后文信息进一步携带更多的上下文信息，确定第i帧语音帧在第j层特征提取层的输出特征。In order to further improve the context learning capability of the acoustic model, after determining the speech frame features of the i-th speech frame, not only the syllable information before and after the speech frame can be combined, but also the pre- and post-syllabic information of the i-th speech frame in the speech data at least The feature of the speech frame corresponding to one frame, that is, further context information is carried by the context information of the speech frame, to determine the output feature of the i-th speech frame in the j-th feature extraction layer.

继续参见图3，该图每一个上层节点连接下层三个节点，每一个节点对应一帧语音帧特征，S2022可以为通过语音帧特征、第i帧语音帧在语音数据中的前一帧对应的语音帧特征和后一帧对应的语音帧特征，确定第i帧语音帧在第j层特征提取层的输出特征。可以理解的是，还可以考虑当前帧的前三帧对应的语音帧特征和后一帧对应的语音帧特征等，本领域技术人员可以根据实际需要进行设置，本申请对此不做具体限定。Continue to refer to Fig. 3, in this figure, each upper node is connected to the lower three nodes, each node corresponds to a frame of voice frame feature, S2022 can be through the voice frame feature, the i-th frame of voice frame in the voice data corresponding to the previous frame The speech frame feature and the speech frame feature corresponding to the next frame determine the output feature of the i-th speech frame in the j-th feature extraction layer. It can be understood that the speech frame features corresponding to the first three frames of the current frame and the speech frame features corresponding to the next frame can also be considered, and those skilled in the art can set according to actual needs, which are not specifically limited in this application.

作为一种可能的实现方式，TDNN包括的N层特征提取层均可以实现上述S2021-S2022，直至所有的特征提取层均针对第i帧语音帧进行特征提取，进而确定第i帧语音帧对应的音节概率分布。As a possible implementation, the N-layer feature extraction layers included in the TDNN can implement the above S2021-S2022, until all the feature extraction layers perform feature extraction for the i-th speech frame, and then determine the i-th frame corresponding to the speech frame. Syllable probability distribution.

S2023：根据第i帧语音帧在第j层特征提取层的输出特征，确定第i帧语音帧对应的音节概率分布。S2023: Determine the probability distribution of syllables corresponding to the i-th speech frame according to the output features of the i-th speech frame in the j-th layer feature extraction layer.

由此，将待识别的语音数据输入至TDNN后，可以得到第i帧语音帧对应的音节概率分布，从而获得语音数据所包括语音帧分别对应的音节概率分布，进而确定语音数据对应的语音识别结果。Thus, after inputting the speech data to be recognized into the TDNN, the probability distribution of the syllables corresponding to the i-th speech frame can be obtained, thereby obtaining the probability distribution of the syllables corresponding to the speech frames included in the speech data, and then determining the speech recognition corresponding to the speech data. result.

作为一种可能的实现方式，下面对本申请实施例提供的声学模型的训练过程进行说明。As a possible implementation manner, the following describes the training process of the acoustic model provided by the embodiment of the present application.

S301：获取与语音数据属于同一语种的语音样本。S301: Acquire a voice sample belonging to the same language as the voice data.

属于同一语种的语音数据在发音规则上基本相同，对应的音节基本相同，例如，汉语对应的音节与英语对应的音节不同，故可以基于语音数据属于的语种获取语音样本，进行声学模型的训练。Speech data belonging to the same language have basically the same pronunciation rules, and the corresponding syllables are basically the same. For example, the syllables corresponding to Chinese are different from the syllables corresponding to English, so speech samples can be obtained based on the language to which the speech data belongs, and the acoustic model can be trained.

S302：根据语音样本作为初始声学模型的输入数据，通过初始声学模型包括的初始时延神经网络得到预测结果。S302: According to the speech sample as the input data of the initial acoustic model, obtain the prediction result through the initial time delay neural network included in the initial acoustic model.

将语音样本输入至初始声学模型包括的初始时延神经网络中，得到预测结果。相比于语音样本标签，预测结果可以包括两种，正确预测结果和错误预测结果。需要说明的是，正确预测结果为预测结果和对应的样本标签之间的误差小于阈值，错误预测结果为预测结果和对应的样本标签之间的误差大于或等于阈值。本申请实施例不具体限定阈值，本领域技术人员可以根据实际需要进行确定。Input the speech samples into the initial time delay neural network included in the initial acoustic model to obtain the prediction result. Compared with speech sample labels, prediction results can include two types, correct prediction results and wrong prediction results. It should be noted that the correct prediction result is that the error between the prediction result and the corresponding sample label is less than the threshold, and the wrong prediction result is that the error between the prediction result and the corresponding sample label is greater than or equal to the threshold. The embodiments of the present application do not specifically limit the threshold, and those skilled in the art can determine it according to actual needs.

S303：根据预测结果与语音样本的样本标签确定损失函数。S303: Determine a loss function according to the prediction result and the sample label of the speech sample.

其中，损失函数中包括指导权重，该指导权重用于提高正确预测结果在初始时延神经网络中的识别路径的影响，降低错误预测结果在初始时延神经网络中的识别路径的影响。Among them, the loss function includes a guide weight, which is used to improve the influence of the correct prediction result on the identification path in the initial delay neural network, and reduce the influence of the wrong prediction result on the identification path in the initial delay neural network.

作为一种可能的实现方式，输入至初始TDNN中的语音样本可以为多个。例如，语音样本包括100个，可以将20个作为一批，分5次输入至初始TDNN中确定损失函数。若20个语音样本输入至初始TDNN中，得到5个正确预测结果和15个错误预测结果。在确定损失函数时，通过指导权重将5个正确预测结果所在的识别路径的影响提高，将15个错误预测结果所在的识别路径的影响降低，以便于后续训练可以更加信任影响较高的识别路径，从而使得正确预测结果对应的识别路径的影响越来越高，错误预测结果对应的识别路径的影响越来越低，避免陷入局部最优。As a possible implementation manner, there may be multiple speech samples input into the initial TDNN. For example, if there are 100 speech samples, 20 can be used as a batch and input into the initial TDNN in 5 times to determine the loss function. If 20 speech samples are input into the initial TDNN, 5 correct prediction results and 15 wrong prediction results are obtained. When determining the loss function, the influence of the recognition paths where the 5 correct prediction results are located is increased by guiding the weight, and the influence of the recognition paths where the 15 incorrect prediction results are located is reduced, so that the subsequent training can be more trustworthy. , so that the influence of the recognition path corresponding to the correct prediction result is getting higher and higher, and the influence of the recognition path corresponding to the wrong prediction result is getting lower and lower, so as to avoid falling into a local optimum.

其中，初始TDNN包括多层网络，一层网络中包括多个节点，每个节点从前一层网络得到的输出特征不同，从而学习到的内容不同，最后形成了多种识别路径。如图3所示，若该TDNN模型为初始TDNN模型，第二层第一个节点可以从第一层左数第一个节点、第二个节点和第三个节点得到不同的输出特征，从而形成了三条路径，继续向上，直至最后一层，进而形成了多条从第一层至最后一层的节点间的识别路径，不同识别路径对应最后的预测结果可能不同，进而根据预测结果和样本标签，不同识别路径对应的影响不同，如不同识别路径对应的权重不同。Among them, the initial TDNN includes a multi-layer network, and one layer of the network includes multiple nodes. Each node obtains different output characteristics from the previous layer of network, so the learned content is different, and finally a variety of identification paths are formed. As shown in Figure 3, if the TDNN model is the initial TDNN model, the first node of the second layer can obtain different output features from the first node, the second node and the third node from the left of the first layer, so that Three paths are formed, continue up to the last layer, and then form multiple identification paths between nodes from the first layer to the last layer. The final prediction results corresponding to different identification paths may be different, and then according to the prediction results and samples Labels, different recognition paths have different influences, such as different weights corresponding to different recognition paths.

作为一种可能的实现方式，可以在帧级交叉熵(cross-entropy，CE)损失函数的基础上添加免词格最大互信息(Lattic-free Maximum Mutual Information，LF-MMI)损失函数，即采用基于bi-phone的LF-MMI模型替代基于mono-phone的nnet3模型，用于指导初始TDNN模型的误差传递，相比于仅采用CE损失函数，该种方式可以使训练可以更加快速和稳定，从而通过TDNN进一步提升关键词检测的鲁棒性和覆盖率。LF-MMI模型引入了空格(blank)用来吸收不确定的边界，即吸收重合、没意义的输出，如“ni3 hao3 hao3”或“ni3ni3 hao3”等。As a possible implementation, the loss function of Lattic-free Maximum Mutual Information (LF-MMI) can be added to the frame-level cross-entropy (CE) loss function, that is, using The bi-phone-based LF-MMI model replaces the mono-phone-based nnet3 model and is used to guide the error transfer of the initial TDNN model. Compared with only the CE loss function, this method can make the training faster and more stable. The robustness and coverage of keyword detection are further improved by TDNN. The LF-MMI model introduces blanks to absorb uncertain boundaries, that is, to absorb coincident and meaningless outputs, such as "ni3 hao3 hao3" or "ni3ni3 hao3", etc.

S304：根据损失函数调整初始声学模型中初始时延神经网络的模型参数，以得到声学模型。S304: Adjust the model parameters of the initial time-delay neural network in the initial acoustic model according to the loss function to obtain the acoustic model.

根据损失函数调整初始声学模型中初始时延神经网络的模型参数，从而得到包括调整好的时延神经网络的声学模型。The model parameters of the initial delay neural network in the initial acoustic model are adjusted according to the loss function, so as to obtain the acoustic model including the adjusted delay neural network.

作为一种可能的实现方式，若初始声学模型中包括初始时延神经网络和初始解码器，则可以根据损失函数调整初始声学模型中初始时延神经网络和初始解码器的模型参数，得到包括调整好的时延神经网络和解码器的声学模型。As a possible implementation, if the initial acoustic model includes the initial time delay neural network and the initial decoder, the model parameters of the initial time delay neural network and the initial decoder in the initial acoustic model can be adjusted according to the loss function, and the parameters of the initial time delay neural network and the initial decoder can be adjusted according to the loss function. Acoustic models of good time-delay neural networks and decoders.

由此，通过损失函数包括的指导权重，提高正确预测结果在初始时延神经网络中的识别路径的影响，降低错误预测结果在初始时延神经网络中识别路径的影响，从而实现通过区分训练，使得正确预测结果的识别概率更大，其他的概率尽量小，不仅使得训练的声学模型的识别率提升，而且声学模型的训练更加快速和稳定。Therefore, through the guidance weight included in the loss function, the influence of the correct prediction result in the identification path in the initial delay neural network is improved, and the influence of the wrong prediction result in the identification path in the initial time delay neural network is reduced, so as to realize the difference training, The recognition probability of the correct prediction result is made larger, and other probabilities are as small as possible, which not only improves the recognition rate of the trained acoustic model, but also makes the training of the acoustic model faster and more stable.

作为一种可能的实现方式，若用于训练的语音样本的数量较少，可能会影响声学模型的正确率，由此，针对于如小语种等语音样本数量不足的场景，可以基于数据增广增加语音样本数量。具体如下：As a possible implementation, if the number of speech samples used for training is small, the accuracy of the acoustic model may be affected. Therefore, for scenarios where the number of speech samples is insufficient, such as small languages, data augmentation can be used. Increase the number of speech samples. details as follows:

S3031：获取与语音数据属于同一语种的待处理语音样本。S3031: Acquire to-be-processed voice samples belonging to the same language as the voice data.

S3032：根据待处理语音样本进行数据增广得到增广样本。S3032: Perform data augmentation according to the speech samples to be processed to obtain augmented samples.

其中，增广样本的样本标签基于所对应待处理语音样本的样本标签确定。本申请实施例不具体限定数据增广的方式，下面以三种方式为例进行说明。The sample labels of the augmented samples are determined based on the sample labels of the corresponding speech samples to be processed. The embodiments of the present application do not specifically limit the manner of data augmentation, and the following three manners are used as examples for description.

方式一：通过语音编码方式，针对待处理语音样本生成不同采样率、不同声道，不同编码格式的增广样本。Mode 1: Through the voice coding method, augmented samples with different sampling rates, different channels, and different coding formats are generated for the voice samples to be processed.

方式二：通过语速和音量扰动方式，在预设区间随机产生乘法因子改变待处理语音样本的速度和音量，得到增广样本。其中，预设区间可以为0.9到1.1之间，本领域技术人员可以根据实际需要进行添加，本申请对此不做具体限定。Method 2: Through the speech rate and volume perturbation method, a multiplicative factor is randomly generated in a preset interval to change the speed and volume of the speech sample to be processed, and an augmented sample is obtained. The preset interval may be between 0.9 and 1.1, and those skilled in the art may add it according to actual needs, which is not specifically limited in this application.

方式三：通过添加背景音乐和噪声方式，即使用各种背景音乐和各种场景的噪声，添加在较为干净的语音样本中，得到增广样本。Method 3: The augmented sample is obtained by adding background music and noise, that is, using various background music and noise of various scenes, and adding it to a relatively clean speech sample.

需要说明的是，可以通过上述三种方式的一种或多种方式组合生成增广样本。It should be noted that the augmented samples may be generated by combining one or more of the above three methods.

S3033：根据待处理语音样本和增广样本得到语音样本。S3033: Obtain a speech sample according to the to-be-processed speech sample and the augmented sample.

由上述技术方案可知，由于语音之间存在很大的差异性，比如说音视频的采样率、编码率、声道差异，还有说话人的语速、语调、环境干扰等，这些都会导致语音关键词检测的难度剧增。如果想采集大量某一场景的语音数据并且进行标注，进而得到大量的语音样本是一件成本极高的事情，甚至在某些低资源，小语种场景下语音样本数量更少。所以，在语音样本数量较少的场景中，通过如S3031-S3033的方式扩充语音样本的数量，可以实现在有限的语音样本的情况下，提升声学模型在不同场景下的鲁棒性。It can be seen from the above technical solutions that due to the great differences between voices, such as the sampling rate, coding rate, channel difference of audio and video, as well as the speaker's speech rate, intonation, environmental interference, etc., these will lead to speech. The difficulty of keyword detection has increased dramatically. If you want to collect a large amount of voice data of a certain scene and mark it, and then obtain a large number of voice samples, it is a very expensive thing, and even in some low-resource and small language scenarios, the number of voice samples is even less. Therefore, in a scenario with a small number of speech samples, by expanding the number of speech samples by means of S3031-S3033, the robustness of the acoustic model in different scenarios can be improved in the case of limited speech samples.

在训练声学模型的时候，每一帧都需要有一个输出作为标签。但是在数据增广后，增广样本的语音帧与样本标签不对应，基于此，可以通过对齐(alignment)方式将增广样本的语音帧与样本标签一一对应，即用一个模型将输出序列与输入特征一一对应起来，具体如下：When training an acoustic model, each frame needs to have an output as a label. However, after data augmentation, the speech frames of the augmented samples do not correspond to the sample labels. Based on this, the speech frames of the augmented samples and the sample labels can be aligned one-to-one, that is, a model is used to convert the output sequence One-to-one correspondence with the input features, as follows:

将增广样本对应的待处理语音样的样本标签作为增广样本的待定标签；The sample label of the to-be-processed speech sample corresponding to the augmented sample is used as the to-be-determined label of the augmented sample;

根据增广样本的语音帧与待定标签进行标签对齐处理，得到增广样本的样本标签。The label alignment process is performed according to the speech frame of the augmented sample and the undetermined label, and the sample label of the augmented sample is obtained.

相关技术中，通常是通过GMM来实现的，通过GMM计算每一帧在潜在label上的得分，再结合动态规划输出最终的对齐标签。但是随着标注语料越多和计算能力越大，深度神经网络模型的效果也慢慢超越了GMM模型。为了获得更好的效果，可以使用基于ChainModel的声学模型替代GMM模型，使得输出标签序列的准确率更高。In the related art, it is usually realized by GMM. The score of each frame on the potential label is calculated by GMM, and the final alignment label is output in combination with dynamic programming. However, with more labeled corpus and greater computing power, the effect of the deep neural network model has gradually surpassed the GMM model. In order to obtain better results, the ChainModel-based acoustic model can be used to replace the GMM model, which makes the output label sequence more accurate.

接下来，将结合图5和图6，针对关键词识别的场景，以声学模型包括TDNN和解码器为例，对本申请实施例提供的语音识别方法进行说明。Next, with reference to FIG. 5 and FIG. 6 , for the scenario of keyword recognition, the speech recognition method provided by the embodiment of the present application will be described by taking an acoustic model including a TDNN and a decoder as an example.

参见图5，该图为本申请实施例提供的一种语音识别系统的应用场景实施例的示意图。语音识别系统包括特征提取模块501、TDNN模块502、解码模块503和词表生成模块504，下面分别进行说明。Referring to FIG. 5 , this figure is a schematic diagram of an embodiment of an application scenario of a speech recognition system provided by an embodiment of the present application. The speech recognition system includes a feature extraction module 501, a TDNN module 502, a decoding module 503 and a vocabulary generation module 504, which will be described below.

特征提取模块501通过数字信号处理算法将连续的待识别的语音数据转化成离散的向量表示，这些向量能够有效地表征待识别的语音数据对应的语音特征，从而有利于后续的语音任务。The feature extraction module 501 converts the continuous speech data to be recognized into discrete vector representations through digital signal processing algorithms, and these vectors can effectively represent the speech features corresponding to the speech data to be recognized, thereby facilitating subsequent speech tasks.

作为一种可能的实现方式，在语音特征提取过程中，还可以进行快速定点化处理，即用整数来模拟浮点数。具体地，通过线性映射，根据数据的动态范围把浮点型数值映射到整数型数值，如int8、int16或int32等，通过测试，该种方式的速度比kaldi工具(一个开源的语音识别工具)快了近一倍。As a possible implementation manner, in the process of speech feature extraction, fast fixed-point processing can also be performed, that is, integers are used to simulate floating-point numbers. Specifically, through linear mapping, floating-point values are mapped to integer values according to the dynamic range of the data, such as int8, int16 or int32, etc. Through testing, this method is faster than the kaldi tool (an open source speech recognition tool) Almost twice as fast.

TDNN模块502用于根据语音特征，确定语音数据所包括语音帧分别对应的音节概率分布。The TDNN module 502 is configured to determine the probability distribution of the syllables corresponding to the speech frames included in the speech data according to the speech features.

解码模块503确定用于确定语音数据中是否包括关键词的语音识别结果。The decoding module 503 determines a speech recognition result for determining whether a keyword is included in the speech data.

其中，解码器和TDNN可以组成如图6所示的声学模型，将提取的语音特征输入至TDNN中，得到音节概率分布，将音节概率分布输入至解码器。同时，将语音特征进行数据对齐后输入至解码器，解码器根据音节概率分布和匹配词表，确定用于标识语音数据中是否包括关键词的语音识别结果。The decoder and the TDNN can form an acoustic model as shown in Figure 6, and the extracted speech features are input into the TDNN to obtain the syllable probability distribution, and the syllable probability distribution is input to the decoder. At the same time, the speech features are data aligned and input to the decoder, and the decoder determines the speech recognition result used to identify whether the speech data includes keywords according to the probability distribution of syllables and the matching vocabulary.

词表生成模块504用于根据词典生成关键词对应的音节序列。The vocabulary list generation module 504 is configured to generate the syllable sequence corresponding to the keyword according to the dictionary.

由此，通过级连上述四个模块，语音识别系统可以高效检测出待识别的语音数据中所有的关键词及其起止位置。Thus, by cascading the above four modules, the speech recognition system can efficiently detect all keywords and their start and end positions in the speech data to be recognized.

在该关键词识别的场景中，使用了时长为27.5小时的待识别的语音数据作为测试集，并且通过表2列出了两种关键词检测方法的效果。从结果可以看到，相比传统基于关键词/填充的关键词检测系统，基于本申请实施例提出的关键词检测，在准确率和覆盖率上都有明显的提升，整体的F1从65.99％提升到75.57％。这个结果说明了本申请实施例提供的语音识别方法能够有效地提升关键词的召回个数，同时也能够更加有效地降低关键词的虚警个数。In this keyword recognition scenario, the speech data to be recognized with a duration of 27.5 hours is used as the test set, and Table 2 lists the effects of the two keyword detection methods. It can be seen from the results that, compared with the traditional keyword/fill-based keyword detection system, the keyword detection proposed in the embodiments of the present application has obvious improvements in accuracy and coverage, and the overall F1 increases from 65.99% boosted to 75.57%. This result shows that the speech recognition method provided by the embodiment of the present application can effectively increase the number of recalled keywords, and can also more effectively reduce the number of false alarms of keywords.

表2 性能对比实验结果Table 2 Performance comparison experimental results

方法method 准确率Accuracy 召回率recall F1F1 基于关键词/填充的关键词检测Keyword/stuffing-based keyword detection 86.76％86.76% 53.24％53.24% 65.99％65.99% 基于本申请实施例的关键词检测Keyword Detection Based on the Embodiments of the Application 92.00％92.00% 64.12％64.12% 75.57％75.57% 提升效果boost effect +5.23％+5.23% +10.89％+10.89% +9.58％+9.58%

其中，F1是统计学中用来衡量二分类模型精确度的一种指标。它同时兼顾了分类模型的精确率和召回率。F1分数可以看作是模型精确率和召回率的一种调和平均，它的最大值是1，最小值是0。Among them, F1 is an indicator used in statistics to measure the accuracy of the binary classification model. It takes into account both the precision and recall of the classification model. The F1 score can be regarded as a harmonic average of the model's precision and recall, with a maximum value of 1 and a minimum value of 0.

针对上述实施例提供的语音识别方法，本申请实施例还提供了一种语音识别装置。With respect to the speech recognition method provided by the foregoing embodiments, an embodiment of the present application further provides a speech recognition apparatus.

参见图7，该图为本申请实施例提供的一种语音识别装置的示意图。如图7所示，该语音识别装置700包括：获取单元701、音节概率分布确定单元702和语音识别结果确定单元703；Referring to FIG. 7 , this figure is a schematic diagram of a speech recognition apparatus provided by an embodiment of the present application. As shown in FIG. 7 , the speech recognition apparatus 700 includes: an acquisition unit 701, a syllable probability distribution determination unit 702, and a speech recognition result determination unit 703;

所述获取单元701，用于获取声学模型和待识别的语音数据，所述声学模型包括时延神经网络，所述时延神经网络的输出层包括与多个音节分别对应的声学建模单元；The obtaining unit 701 is configured to obtain an acoustic model and speech data to be recognized, the acoustic model includes a time-delay neural network, and the output layer of the time-delay neural network includes an acoustic modeling unit corresponding to a plurality of syllables respectively;

所述音节概率分布确定单元702，用于将所述语音数据作为所述时延神经网络的输入数据，通过所述时延神经网络确定所述语音数据所包括语音帧分别对应的音节概率分布，所述音节概率分布用于标识所述语音帧与所述多个音节分别对应的概率；The syllable probability distribution determining unit 702 is configured to use the speech data as the input data of the time-delay neural network, and determine the syllable probability distributions corresponding to the speech frames included in the speech data through the time-delay neural network, The syllable probability distribution is used to identify the probability that the speech frame corresponds to the multiple syllables respectively;

所述语音识别结果确定单元703，用于根据所述音节概率分布确定所述语音数据对应的语音识别结果。The speech recognition result determining unit 703 is configured to determine the speech recognition result corresponding to the speech data according to the syllable probability distribution.

作为一种可能的实现方式，所述声学模型还包括解码器，所述语音识别结果确定单元703，用于：As a possible implementation manner, the acoustic model further includes a decoder, and the speech recognition result determination unit 703 is configured to:

根据所述语音数据对应的唤醒场景，确定与所述唤醒场景对应的关键词；According to the wake-up scene corresponding to the voice data, determine the keyword corresponding to the wake-up scene;

根据所述音节概率分布，通过所述解码器确定用于标识所述语音数据中是否包括所述关键词的语音识别结果；According to the probability distribution of the syllables, determine, by the decoder, a speech recognition result for identifying whether the speech data includes the keyword;

所述方法还包括：The method also includes:

若所述语音识别结果指示所述语音数据中包括所述关键词，将所述唤醒场景对应终端设备唤醒。If the voice recognition result indicates that the voice data includes the keyword, wake up the terminal device corresponding to the wake-up scene.

作为一种可能的实现方式，所述装置还包括匹配词表构建单元，用于：As a possible implementation manner, the apparatus further includes a matching vocabulary building unit for:

以音节为划分粒度构建针对所述关键词的匹配词表；Constructing a matching vocabulary for the keyword with syllables as the division granularity;

所述语音识别结果确定单元703，用于：The speech recognition result determination unit 703 is used for:

根据所述音节概率分布和所述匹配词表，通过所述解码器确定用于标识所述语音数据中是否包括所述关键词的语音识别结果。According to the syllable probability distribution and the matching vocabulary, a speech recognition result for identifying whether the keyword is included in the speech data is determined by the decoder.

作为一种可能的实现方式，所述时延神经网络包括N层特征提取层，j∈N，针对所述语音数据所包括语音帧中的第i帧语音帧，所述音节概率分布确定单元702，用于：As a possible implementation manner, the time delay neural network includes N layers of feature extraction layers, j∈N, and for the i-th speech frame in the speech frames included in the speech data, the syllable probability distribution determining unit 702 , for:

根据第j-1层特征提取层针对所述第i帧语音帧的输出特征，通过第j层特征提取层确定所述第i帧语音帧的语音帧特征；According to the output feature of the i-th frame of speech frame according to the feature extraction layer of the j-1 layer, the speech frame feature of the i-th frame of speech frame is determined by the feature extraction layer of the j-th layer;

通过所述语音帧特征和所述第i帧语音帧在所述语音数据中的前后至少一帧对应的语音帧特征，确定所述第i帧语音帧在所述第j层特征提取层的输出特征；Determine the output of the i-th voice frame in the j-th feature extraction layer by using the voice frame feature and the voice frame feature corresponding to at least one frame before and after the i-th voice frame in the voice data feature;

根据所述第i帧语音帧在所述第j层特征提取层的输出特征，确定所述第i帧语音帧对应的音节概率分布。The syllable probability distribution corresponding to the i-th speech frame is determined according to the output feature of the i-th speech frame in the j-th feature extraction layer.

作为一种可能的实现方式，所述匹配词表构建单元，用于：As a possible implementation manner, the matching vocabulary building unit is used for:

根据发音相似程度确定所述关键词所包括音节的相似音节；Determine the similar syllables of the syllables included in the keyword according to the degree of pronunciation similarity;

若所述关键词中具有多音字，确定所述多音字对应的多音字音节；If there is a polyphonic word in the keyword, determine the polyphonic syllable corresponding to the polyphonic word;

根据所述相似音节和所述多音字音节中的至少一个，以及所述关键词对应的音节，以音节为划分粒度构建针对所述关键词的匹配词表。According to at least one of the similar syllables and the polysyllabic syllables, and the syllables corresponding to the keywords, a matching vocabulary for the keywords is constructed with a syllable as a division granularity.

作为一种可能的实现方式，所述装置还包括训练单元，用于：As a possible implementation manner, the apparatus further includes a training unit for:

获取与所述语音数据属于同一语种的语音样本；obtaining a voice sample belonging to the same language as the voice data;

根据所述语音样本作为初始声学模型的输入数据，通过所述初始声学模型包括的初始时延神经网络得到预测结果；According to the speech sample as the input data of the initial acoustic model, the prediction result is obtained through the initial time delay neural network included in the initial acoustic model;

根据所述预测结果与所述语音样本的样本标签确定损失函数，所述损失函数中包括指导权重，所述指导权重用于提高正确预测结果在所述初始时延神经网络中的识别路径的影响，降低错误预测结果在所述初始时延神经网络中的识别路径的影响；A loss function is determined according to the prediction result and the sample label of the speech sample, and the loss function includes a guide weight, and the guide weight is used to improve the influence of the correct prediction result on the recognition path in the initial delay neural network , reducing the influence of the wrong prediction result on the identification path in the initial delay neural network;

根据所述损失函数调整所述初始声学模型中初始时延神经网络的模型参数，以得到所述声学模型。The model parameters of the initial time-delay neural network in the initial acoustic model are adjusted according to the loss function to obtain the acoustic model.

作为一种可能的实现方式，所述训练单元，用于：As a possible implementation manner, the training unit is used for:

获取与所述语音数据属于同一语种的待处理语音样本；Acquiring to-be-processed voice samples belonging to the same language as the voice data;

根据所述待处理语音样本进行数据增广得到增广样本，所述增广样本的样本标签基于所对应待处理语音样本的样本标签确定；Data augmentation is performed according to the to-be-processed speech sample to obtain an augmented sample, and the sample label of the augmented sample is determined based on the sample label of the corresponding to-be-processed speech sample;

根据所述待处理语音样本和所述增广样本得到所述语音样本。The speech sample is obtained according to the to-be-processed speech sample and the augmented sample.

作为一种可能的实现方式，所述训练单元，还用于：As a possible implementation manner, the training unit is also used for:

将所述增广样本对应的待处理语音样的样本标签作为所述增广样本的待定标签；Using the sample label of the to-be-processed speech sample corresponding to the augmented sample as the to-be-determined label of the augmented sample;

根据所述增广样本的语音帧与所述待定标签进行标签对齐处理，得到所述增广样本的样本标签。Perform label alignment processing on the speech frame of the augmented sample and the undetermined label to obtain the sample label of the augmented sample.

前述所述的语音识别设备可以为一种计算机设备，该计算机设备可以为服务器，还可以为终端设备，前述所述的语音识别装置可以内置于服务器获终端设备中，下面将从硬件实体化的角度对本申请实施例提供的计算机设备进行介绍。其中，图8所示为服务器的结构示意图，图9所示为终端设备的结构示意图。The aforementioned voice recognition device can be a computer device, which can be a server or a terminal device, and the aforementioned voice recognition device can be built into a server or a terminal device. The computer equipment provided by the embodiments of the present application will be introduced from a perspective. Among them, FIG. 8 is a schematic structural diagram of a server, and FIG. 9 is a schematic structural diagram of a terminal device.

参见图8，图8是本申请实施例提供的一种服务器结构示意图，该服务器1400可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(CentralProcessing Units，CPU)1422(例如，一个或一个以上处理器)和存储器1432，一个或一个以上存储应用程序1442或数据1444的存储介质1430(例如一个或一个以上海量存储设备)。其中，存储器1432和存储介质1430可以是短暂存储或持久存储。存储在存储介质1430的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器中的一系列指令操作。更进一步地，CPU 1422可以设置为与存储介质1430通信，在服务器1400上执行存储介质1430中的一系列指令操作。Referring to FIG. 8, FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 1400 may vary greatly due to different configurations or performance, and may include one or more central processing units (Central Processing Units, CPU) 1422 (eg, one or more processors) and memory 1432, one or more storage media 1430 (eg, one or more mass storage devices) storing applications 1442 or data 1444. Among them, the memory 1432 and the storage medium 1430 may be short-term storage or persistent storage. The program stored in the storage medium 1430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the CPU 1422 may be configured to communicate with the storage medium 1430 to execute a series of instruction operations in the storage medium 1430 on the server 1400 .

服务器1400还可以包括一个或一个以上电源1426，一个或一个以上有线或无线网络接口1450，一个或一个以上输入输出接口1458，和/或，一个或一个以上操作系统1441，例如Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TM等等。Server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458, and/or, one or more operating systems 1441, such as Windows Server ^™ , Mac OS X ^TM , Unix ^TM , Linux ^TM , FreeBSD ^TM and many more.

上述实施例中由服务器所执行的步骤可以基于该图8所示的服务器结构。The steps performed by the server in the above embodiment may be based on the server structure shown in FIG. 8 .

其中，CPU 1422用于执行如下步骤：Among them, the CPU 1422 is used to perform the following steps:

可选的，CPU 1422还可以执行本申请实施例中语音识别方法任一具体实现方式的方法步骤。Optionally, the CPU 1422 may also execute the method steps of any specific implementation manner of the speech recognition method in the embodiment of the present application.

参见图9，图9为本申请实施例提供的一种终端设备的结构示意图。图9示出的是与本申请实施例提供的终端设备相关的智能手机的部分结构的框图，该智能手机包括：射频(Radio Frequency，简称RF)电路1510、存储器1520、输入单元1530、显示单元1540、传感器1550、音频电路1560、无线保真(Wireless Fidelity，简称WiFi)模块1570、处理器1580、以及电源1590等部件。本领域技术人员可以理解，图9中示出的智能手机结构并不构成对智能手机的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Referring to FIG. 9 , FIG. 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. FIG. 9 is a block diagram showing a partial structure of a smart phone related to a terminal device provided by an embodiment of the present application. The smart phone includes: a radio frequency (Radio Frequency, RF for short) circuit 1510 , a memory 1520 , an input unit 1530 , and a display unit 1540, a sensor 1550, an audio circuit 1560, a wireless fidelity (Wireless Fidelity, WiFi for short) module 1570, a processor 1580, a power supply 1590 and other components. Those skilled in the art can understand that the structure of the smart phone shown in FIG. 9 does not constitute a limitation on the smart phone, and may include more or less components than the one shown, or combine some components, or arrange different components.

下面结合图9对智能手机的各个构成部件进行具体的介绍：The following describes the various components of the smart phone in detail with reference to FIG. 9 :

RF电路1510可用于收发信息或通话过程中，信号的接收和发送，特别地，将基站的下行信息接收后，给处理器1580处理；另外，将设计上行的数据发送给基站。通常，RF电路1510包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low NoiseAmplifier，简称LNA)、双工器等。此外，RF电路1510还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议，包括但不限于全球移动通讯系统(Global System of Mobile communication，简称GSM)、通用分组无线服务(GeneralPacket Radio Service，简称GPRS)、码分多址(Code Division Multiple Access，简称CDMA)、宽带码分多址(Wideband Code Division Multiple Access，简称WCDMA)、长期演进(Long Term Evolution，简称LTE)、电子邮件、短消息服务(Short Messaging Service，简称SMS)等。The RF circuit 1510 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by the processor 1580; in addition, it sends the designed uplink data to the base station. Generally, the RF circuit 1510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA for short), a duplexer, and the like. In addition, RF circuitry 1510 may also communicate with networks and other devices via wireless communications. The above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM for short), General Packet Radio Service (GPRS for short), Code Division Multiple Access (Code Division Multiple Access) Division Multiple Access (CDMA for short), Wideband Code Division Multiple Access (WCDMA for short), Long Term Evolution (LTE for short), E-mail, Short Messaging Service (Short Messaging Service, SMS for short), etc. .

存储器1520可用于存储软件程序以及模块，处理器1580通过运行存储在存储器1520的软件程序以及模块，从而实现智能手机的各种功能应用以及数据处理。存储器1520可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据智能手机的使用所创建的数据(比如音频数据、电话本等)等。此外，存储器1520可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 1520 can be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 1520 . The memory 1520 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required for at least one function, and the like; Data created by the use of the smartphone (such as audio data, phonebook, etc.), etc. Additionally, memory 1520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

输入单元1530可用于接收输入的数字或字符信息，以及产生与智能手机的用户设置以及功能控制有关的键信号输入。具体地，输入单元1530可包括触控面板1531以及其他输入设备1532。触控面板1531，也称为触摸屏，可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1531上或在触控面板1531附近的操作)，并根据预先设定的程式驱动相应的连接装置。可选的，触控面板1531可包括触摸检测装置和触摸控制器两个部分。其中，触摸检测装置检测用户的触摸方位，并检测触摸操作带来的信号，将信号传送给触摸控制器；触摸控制器从触摸检测装置上接收触摸信息，并将它转换成触点坐标，再送给处理器1580，并能接收处理器1580发来的命令并加以执行。此外，可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板1531。除了触控面板1531，输入单元1530还可以包括其他输入设备1532。具体地，其他输入设备1532可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。The input unit 1530 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the smartphone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532 . The touch panel 1531, also known as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 1531). operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 1531 may include two parts, a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller. To the processor 1580, and can receive the command sent by the processor 1580 and execute it. In addition, the touch panel 1531 can be realized by various types of resistive, capacitive, infrared, and surface acoustic waves. Besides the touch panel 1531 , the input unit 1530 may also include other input devices 1532 . Specifically, other input devices 1532 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.

显示单元1540可用于显示由用户输入的信息或提供给用户的信息以及智能手机的各种菜单。显示单元1540可包括显示面板1541，可选的，可以采用液晶显示器(LiquidCrystal Display，简称LCD)、有机发光二极管(Organic Light-Emitting Diode，简称OLED)等形式来配置显示面板1541。进一步的，触控面板1531可覆盖显示面板1541，当触控面板1531检测到在其上或附近的触摸操作后，传送给处理器1580以确定触摸事件的类型，随后处理器1580根据触摸事件的类型在显示面板1541上提供相应的视觉输出。虽然在图9中，触控面板1531与显示面板1541是作为两个独立的部件来实现智能手机的输入和输入功能，但是在某些实施例中，可以将触控面板1531与显示面板1541集成而实现智能手机的输入和输出功能。The display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the smartphone. The display unit 1540 may include a display panel 1541 . Optionally, the display panel 1541 may be configured in the form of a liquid crystal display (LCD for short), an organic light-emitting diode (OLED for short). Further, the touch panel 1531 may cover the display panel 1541. When the touch panel 1531 detects a touch operation on or near it, it transmits it to the processor 1580 to determine the type of the touch event, and then the processor 1580 determines the type of the touch event according to the touch event. Type provides corresponding visual output on display panel 1541. Although in FIG. 9, the touch panel 1531 and the display panel 1541 are used as two independent components to realize the input and input functions of the smartphone, in some embodiments, the touch panel 1531 and the display panel 1541 may be integrated And realize the input and output functions of smart phones.

智能手机还可包括至少一种传感器1550，比如光传感器、运动传感器以及其他传感器。具体地，光传感器可包括环境光传感器及接近传感器，其中，环境光传感器可根据环境光线的明暗来调节显示面板1541的亮度，接近传感器可在智能手机移动到耳边时，关闭显示面板1541和/或背光。作为运动传感器的一种，加速计传感器可检测各个方向上(一般为三轴)加速度的大小，静止时可检测出重力的大小及方向，可用于识别智能手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等；至于智能手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器，在此不再赘述。The smartphone may also include at least one sensor 1550, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1541 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 1541 and the display panel 1541 when the smartphone is moved to the ear. / or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when stationary, and can be used for applications that recognize the posture of smartphones (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured on smartphones, here No longer.

音频电路1560、扬声器1561，传声器1562可提供用户与智能手机之间的音频接口。音频电路1560可将接收到的音频数据转换后的电信号，传输到扬声器1561，由扬声器1561转换为声音信号输出；另一方面，传声器1562将收集的声音信号转换为电信号，由音频电路1560接收后转换为音频数据，再将音频数据输出处理器1580处理后，经RF电路1510以发送给比如另一智能手机，或者将音频数据输出至存储器1520以便进一步处理。The audio circuit 1560, the speaker 1561, and the microphone 1562 can provide an audio interface between the user and the smartphone. The audio circuit 1560 can convert the received audio data into an electrical signal, and transmit it to the speaker 1561, and the speaker 1561 converts it into a sound signal for output; on the other hand, the microphone 1562 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 1560 After receiving, it is converted into audio data, and then the audio data is output to the processor 1580 for processing, and then sent to, for example, another smartphone via the RF circuit 1510, or the audio data is output to the memory 1520 for further processing.

WiFi属于短距离无线传输技术，智能手机通过WiFi模块1570可以帮助用户收发电子邮件、浏览网页和访问流式媒体等，它为用户提供了无线的宽带互联网访问。虽然图9示出了WiFi模块1570，但是可以理解的是，其并不属于智能手机的必须构成，完全可以根据需要在不改变发明的本质的范围内而省略。WiFi is a short-distance wireless transmission technology. The smartphone can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 1570. It provides users with wireless broadband Internet access. Although FIG. 9 shows the WiFi module 1570, it can be understood that it is not a necessary component of a smart phone, and can be completely omitted as required within the scope of not changing the essence of the invention.

处理器1580是智能手机的控制中心，利用各种接口和线路连接整个智能手机的各个部分，通过运行或执行存储在存储器1520内的软件程序和/或模块，以及调用存储在存储器1520内的数据，执行智能手机的各种功能和处理数据。可选的，处理器1580可包括一个或多个处理单元；优选的，处理器1580可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器1580中。The processor 1580 is the control center of the smart phone, using various interfaces and lines to connect various parts of the entire smart phone, by running or executing the software programs and/or modules stored in the memory 1520, and calling the data stored in the memory 1520. , perform various functions of the smartphone and process data. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 1580.

智能手机还包括给各个部件供电的电源1590(比如电池)，优选的，电源可以通过电源管理系统与处理器1580逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The smart phone also includes a power supply 1590 (such as a battery) for supplying power to various components. Preferably, the power supply can be logically connected to the processor 1580 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.

尽管未示出，智能手机还可以包括摄像头、蓝牙模块等，在此不再赘述。Although not shown, the smart phone may also include a camera, a Bluetooth module, etc., which will not be repeated here.

在本申请实施例中，该智能手机所包括的存储器1520可以存储程序代码，并将所述程序代码传输给所述处理器。In this embodiment of the present application, the memory 1520 included in the smart phone may store program codes, and transmit the program codes to the processor.

该智能手机所包括的处理器1580可以根据所述程序代码中的指令执行上述实施例提供的语音识别方法。The processor 1580 included in the smart phone can execute the speech recognition method provided by the above embodiment according to the instructions in the program code.

本申请实施例还提供一种计算机可读存储介质，用于存储计算机程序，该计算机程序用于执行上述实施例提供的语音识别方法。Embodiments of the present application further provide a computer-readable storage medium for storing a computer program, where the computer program is used to execute the speech recognition method provided by the foregoing embodiments.

本申请实施例还提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述方面的各种可选实现方式中提供的语音识别方法。Embodiments of the present application also provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the speech recognition methods provided in various optional implementations of the above aspects.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质可以是下述介质中的至少一种：只读存储器(英文：Read-only Memory，缩写：ROM)、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by program instructions related to hardware, the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, the execution includes: The steps of the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (English: Read-only Memory, abbreviation: ROM), RAM, magnetic disk or optical disk and other various storage media medium of program code.

需要说明的是，本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于设备及系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的设备及系统实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。It should be noted that each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. place. In particular, for the device and system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts. The device and system embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

以上所述，仅为本申请的一种具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。而且本申请在上述各方面提供的实现方式的基础上，还可以进行进一步组合以提供更多实现方式。因此，本申请的保护范围应该以权利要求的保护范围为准。The above is only a specific embodiment of the present application, but the protection scope of the present application is not limited to this. Substitutions should be covered within the protection scope of this application. Moreover, on the basis of the implementation manners provided by the above aspects, the present application may further combine to provide more implementation manners. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. a speech recognition method, is characterized in that, described method comprises:

Acquiring an acoustic model and the speech data to be recognized, the acoustic model includes a time-delay neural network, and the output layer of the time-delay neural network includes acoustic modeling units corresponding to a plurality of syllables respectively;

The speech data is used as the input data of the time-delay neural network, and the syllable probability distribution corresponding to the speech frames included in the speech data is determined through the time-delay neural network, and the syllable probability distribution is used to identify the speech the probability that the frame corresponds to the plurality of syllables respectively;

The speech recognition result corresponding to the speech data is determined according to the syllable probability distribution.

2. The method according to claim 1, wherein the acoustic model further comprises a decoder, and the determining the speech recognition result corresponding to the speech data according to the syllable probability distribution comprises:

According to the wake-up scene corresponding to the voice data, determine the keyword corresponding to the wake-up scene;

According to the probability distribution of the syllables, determine, by the decoder, a speech recognition result for identifying whether the speech data includes the keyword;

The method also includes:

If the voice recognition result indicates that the voice data includes the keyword, wake up the terminal device corresponding to the wake-up scene.

3. The method according to claim 2, wherein the method further comprises:

Constructing a matching vocabulary for the keyword with syllables as the division granularity;

The determining, by the decoder, according to the syllable probability distribution, the speech recognition result for identifying whether the speech data includes the keyword includes:

According to the syllable probability distribution and the matching vocabulary, a speech recognition result for identifying whether the keyword is included in the speech data is determined by the decoder.

4. The method according to claim 1, wherein the time delay neural network comprises N layers of feature extraction layers, j∈N, for the i-th speech frame in the speech frames included in the speech data, the The voice data is used as the input data of the time-delay neural network, and the probability distribution of syllables corresponding to the speech frames included in the voice data is determined by the time-delay neural network, including:

According to the output feature of the i-th frame of speech frame according to the feature extraction layer of the j-1 layer, the speech frame feature of the i-th frame of speech frame is determined by the feature extraction layer of the j-th layer;

Determine the output of the i-th voice frame in the j-th feature extraction layer by using the voice frame feature and the voice frame feature corresponding to at least one frame before and after the i-th voice frame in the voice data feature;

The syllable probability distribution corresponding to the i-th speech frame is determined according to the output feature of the i-th speech frame in the j-th feature extraction layer.

5. The method according to claim 3, wherein the method further comprises:

Determine the similar syllables of the syllables included in the keyword according to the degree of pronunciation similarity;

If there is a polyphonic word in the keyword, determine the polyphonic syllable corresponding to the polyphonic word;

The described construction of a matching vocabulary for the keyword with syllables as the division granularity, including:

According to at least one of the similar syllables and the polysyllabic syllables, and the syllables corresponding to the keywords, a matching vocabulary for the keywords is constructed with a syllable as a division granularity.

6. The method of claim 1, wherein the method further comprises:

obtaining a voice sample belonging to the same language as the voice data;

According to the speech sample as the input data of the initial acoustic model, the prediction result is obtained through the initial time delay neural network included in the initial acoustic model;

A loss function is determined according to the prediction result and the sample label of the speech sample, and the loss function includes a guide weight, and the guide weight is used to improve the influence of the correct prediction result on the recognition path in the initial delay neural network , reducing the influence of the wrong prediction result on the identification path in the initial delay neural network;

The model parameters of the initial time-delay neural network in the initial acoustic model are adjusted according to the loss function to obtain the acoustic model.

7. The method according to claim 6, wherein the acquiring a voice sample belonging to the same language as the voice data comprises:

Acquiring to-be-processed voice samples belonging to the same language as the voice data;

Data augmentation is performed according to the to-be-processed speech sample to obtain an augmented sample, and the sample label of the augmented sample is determined based on the sample label of the corresponding to-be-processed speech sample;

The speech sample is obtained according to the to-be-processed speech sample and the augmented sample.

8. The method according to claim 7, wherein the method further comprises:

Using the sample label of the to-be-processed speech sample corresponding to the augmented sample as the to-be-determined label of the augmented sample;

Perform label alignment processing on the speech frame of the augmented sample and the undetermined label to obtain the sample label of the augmented sample.

9. A speech recognition device, characterized in that the device comprises an acquisition unit, a syllable probability distribution determination unit and a speech recognition result determination unit;

The acquisition unit is used to acquire an acoustic model and the speech data to be recognized, the acoustic model includes a time-delay neural network, and the output layer of the time-delay neural network includes an acoustic modeling unit corresponding to a plurality of syllables respectively;

The syllable probability distribution determining unit is configured to use the speech data as the input data of the time-delay neural network, and determine the syllable probability distributions corresponding to the speech frames included in the speech data through the time-delay neural network. The syllable probability distribution is used to identify the probability that the speech frame corresponds to the multiple syllables respectively;

The speech recognition result determination unit is configured to determine the speech recognition result corresponding to the speech data according to the syllable probability distribution.

10. A computer device, characterized in that the computer device comprises a processor and a memory:

the memory is used to store program code and transmit the program code to the processor;

The processor is configured to execute the speech recognition method according to any one of claims 1-8 according to the instructions in the program code.

11. A computer-readable storage medium, wherein the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the speech recognition method according to any one of claims 1-8.

12. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the speech recognition method of any one of claims 1-8.