CN114170997A - Pronunciation skill detection method, device, storage medium and electronic device - Google Patents

Pronunciation skill detection method, device, storage medium and electronic device Download PDF

Info

Publication number
CN114170997A
CN114170997A CN202111620731.2A CN202111620731A CN114170997A CN 114170997 A CN114170997 A CN 114170997A CN 202111620731 A CN202111620731 A CN 202111620731A CN 114170997 A CN114170997 A CN 114170997A
Authority
CN
China
Prior art keywords
matrix
feature
phoneme
pronunciation
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111620731.2A
Other languages
Chinese (zh)
Inventor
李芳足
吴奎
金海�
李�浩
盛志超
竺博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111620731.2A priority Critical patent/CN114170997A/en
Publication of CN114170997A publication Critical patent/CN114170997A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

一种发音技巧检测方法、装置、存储介质及电子设备。其中,方法包括获取待检测文本,将待检测文本转换为对应的音素序列;获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征;将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。本申请能够提升发音技巧检测的准确性。

Figure 202111620731

A pronunciation skill detection method, device, storage medium and electronic device. The method includes acquiring the text to be detected, converting the text to be detected into a corresponding phoneme sequence; acquiring the audio to be detected obtained by the speaker speaking the text to be detected, and extracting the acoustic features of the audio to be detected; The trained pronunciation skill detection model performs pronunciation skill detection processing to obtain a first detection result and a second detection result; wherein, the first detection result is used to represent whether the pronunciation skill needs to be used to speak the text to be detected, and the second detection result is used to represent Whether the speaker uses pronunciation skills to speak the text to be detected. The present application can improve the accuracy of pronunciation skill detection.

Figure 202111620731

Description

发音技巧检测方法、装置、存储介质及电子设备Pronunciation skill detection method, device, storage medium and electronic device

技术领域technical field

本申请涉及语音识别技术领域,具体涉及一种发音技巧检测方法、装置、存储介质及电子设备。The present application relates to the technical field of speech recognition, and in particular to a pronunciation skill detection method, device, storage medium and electronic device.

背景技术Background technique

目前,对于任一种语言而言,无论是还汉语还是英语,口语均是掌握这些语言的重中之重。比如,对于英语学习者来说,口语发音往往是他们在英语学习过程中的重点,也是薄弱点,是否准确的采用连读、失去爆破和浊化等发音技巧,反映了英语学习者的口语能力。相关技术中,通常采用人工听力检测的方式来对发音者的发音能力进行检测,然而,人工主观判断以及听觉疲劳等因素将影响发音技巧检测结果的准确性。At present, for any language, whether it is Chinese or English, spoken language is the top priority for mastering these languages. For example, for English learners, oral pronunciation is often the focus and weak point in their English learning process. Whether the pronunciation skills such as continuous reading, loss of blasting and voicing are used accurately reflects the oral ability of English learners . In the related art, artificial hearing detection is usually used to detect the pronunciation ability of the speaker. However, factors such as artificial subjective judgment and auditory fatigue will affect the accuracy of the pronunciation skills detection result.

发明内容SUMMARY OF THE INVENTION

本申请提供了一种发音技巧检测方法、装置、存储介质及电子设备,能够提升发音技巧检测的准确性。The present application provides a pronunciation skill detection method, device, storage medium and electronic device, which can improve the accuracy of pronunciation skill detection.

本申请提供的发音技巧检测方法,包括:The pronunciation skills detection methods provided in this application include:

获取待检测文本,将待检测文本转换为对应的音素序列;Obtain the text to be detected, and convert the text to be detected into a corresponding phoneme sequence;

获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征;Obtain the audio to be detected obtained by the speaker speaking the text to be detected, and extract the acoustic features of the audio to be detected;

将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;Input the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model for pronunciation skill detection processing, and obtain the first detection result and the second detection result;

其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。The first detection result is used to represent whether the text to be detected needs to be spoken by using pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

本申请提供的发音技巧检测装置,包括:The pronunciation skill detection device provided by this application includes:

第一获取模块,用于获取待检测文本,将待检测文本转换为对应的音素序列;The first acquisition module is used to acquire the text to be detected, and convert the text to be detected into a corresponding phoneme sequence;

第二获取模块,用于获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征;The second acquisition module is used to acquire the audio to be detected obtained by the speaker speaking the text to be detected, and to extract the acoustic features of the audio to be detected;

检测模块,用于将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;The detection module is used to input the phoneme sequence and acoustic features into the trained pronunciation skill detection model for pronunciation skill detection processing, and obtain the first detection result and the second detection result;

其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。The first detection result is used to represent whether the text to be detected needs to be spoken by using pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

本申请提供的存储介质,其上存储有计算机程序,当所述计算机程序被处理器加载时执行如本申请提供的发音技巧检测方法中的步骤。The storage medium provided by the present application has a computer program stored thereon, and when the computer program is loaded by the processor, the steps in the pronunciation skill detection method provided by the present application are executed.

本申请提供的电子设备,包括处理器和存储器,所述存储器存有计算机程序,所述处理器通过加载所述计算机程序,用于执行本申请提供的发音技巧检测方法中的步骤。The electronic device provided by the present application includes a processor and a memory, wherein the memory stores a computer program, and the processor is configured to execute the steps in the pronunciation skill detection method provided by the present application by loading the computer program.

本申请中,获取待检测文本,以及说话人说出待检测文本得到的待检测音频,使用该待检测音频以及待检测文本对说话人的发音进行检测,其中,将待检测文本转换为对应的音素序列,以及提取待检测音频的声学特征,然后将该音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果。第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。相较于相关技术,本申请通过采用基于人工智能的发音技巧检测方式来代替传统的人工听力检测,能够避免人工的主观判断以及听觉疲劳,从而提升发音技巧检测的准确性。In this application, the text to be detected and the audio to be detected obtained by the speaker speaking the text to be detected are obtained, and the audio to be detected and the text to be detected are used to detect the speaker's pronunciation, wherein the text to be detected is converted into corresponding The phoneme sequence and the acoustic features of the audio to be detected are extracted, and then the phoneme sequence and acoustic features are input into the trained pronunciation skill detection model for pronunciation skill detection processing, and the first detection result and the second detection result are obtained. The first detection result is used to represent whether the text to be detected needs to be spoken by using pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected. Compared with the related art, the present application adopts the artificial intelligence-based pronunciation skill detection method to replace the traditional artificial hearing detection, which can avoid artificial subjective judgment and auditory fatigue, thereby improving the accuracy of pronunciation skill detection.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained from these drawings without creative effort.

图1为本申请实施例提供的发音技巧检测系统的场景示意图。FIG. 1 is a schematic diagram of a scene of a pronunciation skill detection system provided by an embodiment of the present application.

图2本申请实施例提供的发音技巧检测方法的流程示意图。FIG. 2 is a schematic flowchart of a pronunciation skill detection method provided by an embodiment of the present application.

图3是本申请实施例中提取声学特征的示例图。FIG. 3 is an example diagram of extracting acoustic features in an embodiment of the present application.

图4是本申请实施例提供的发音技巧检测模型的结构框图。FIG. 4 is a structural block diagram of a pronunciation skill detection model provided by an embodiment of the present application.

图5是发音技巧检测模型中音素特征提取网络的一结构框图。Figure 5 is a structural block diagram of the phoneme feature extraction network in the pronunciation skill detection model.

图6是音素特征提取网络中音素特征模块内部的音素特征提取子模块的结构框图。FIG. 6 is a structural block diagram of the phoneme feature extraction sub-module inside the phoneme feature module in the phoneme feature extraction network.

图7是发音技巧检测模型中音素特征提取网络的另一结构框图。FIG. 7 is another structural block diagram of the phoneme feature extraction network in the pronunciation skill detection model.

图8是发音技巧检测模型中声学特征增强网络内部的特征编码模块的结构框图。Figure 8 is a structural block diagram of the feature encoding module inside the acoustic feature enhancement network in the pronunciation skill detection model.

图9是发音技巧检测模型中声学特征增强网络内部的声学特征增强模块的细化结构框图。Figure 9 is a block diagram of the refined structure of the acoustic feature enhancement module inside the acoustic feature enhancement network in the pronunciation skill detection model.

图10是发音技巧检测模型中特征融合网络的结构框图。Figure 10 is a block diagram of the feature fusion network in the pronunciation skill detection model.

图11是发音技巧检测模型中第一发音技巧检测网络的结构框图。Fig. 11 is a structural block diagram of the first pronunciation skill detection network in the pronunciation skill detection model.

图12是发音技巧检测模型中第二发音技巧检测网络内部的分支检测网络的结构框图。FIG. 12 is a structural block diagram of a branch detection network inside the second pronunciation skill detection network in the pronunciation skill detection model.

图13是本申请实施例提供的发音技巧检测装置的结构框图。FIG. 13 is a structural block diagram of a pronunciation skill detection apparatus provided by an embodiment of the present application.

图14是本申请实施例提供的电子设备的结构框图。FIG. 14 is a structural block diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

需要说明的是,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其他具体实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。It should be noted that the principles of the present application are exemplified by being implemented in a suitable computing environment. The following description is based on the illustrated specific embodiments of the present application and should not be construed as limiting other specific embodiments of the present application not detailed herein. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application.

本申请以下实施例中所涉及的诸如第一和第二等关系术语仅用于将一个对象或者操作与另一个对象或者操作区分开来,并不用于限定这些对象或操作之间存在着实际的顺序关系。在本申请实施例的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。The relational terms such as first and second involved in the following embodiments of the present application are only used to distinguish one object or operation from another object or operation, and are not used to limit the existence of actual relationship between these objects or operations. sequence relationship. In the description of the embodiments of the present application, "plurality" means two or more, unless otherwise expressly and specifically defined.

人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能、感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括机器学习(Machine Learning,ML)技术,其中,深度学习(Deep Learning,DL)是机器学习中一个新的研究方向,它被引入机器学习以使其更接近于最初的目标,即人工智能。目前,深度学习主要应用在计算机视觉、自然语言处理等领域。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes machine learning (ML) technology, among which, deep learning (DL) is a new research direction in machine learning, which is introduced into machine learning to make it closer to the original goal , namely artificial intelligence. At present, deep learning is mainly used in computer vision, natural language processing and other fields.

深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字、图像和声音数据的解释有很大的帮助。利用深度学习技术,以及对应的训练数据集,能够训练得到实现不同功能的网络模型,比如,基于一训练数据集能够训练得到用于性别分类的深度学习网络,基于另一训练数据集能够训练得到图像优化的深度学习网络等。Deep learning is to learn the inherent laws and representation levels of sample data, and the information obtained during these learning processes is of great help to the interpretation of data such as text, images and sounds. Using deep learning technology and corresponding training data sets, network models that implement different functions can be trained. For example, a deep learning network for gender classification can be trained based on a training data set, and a deep learning network for gender classification can be trained based on another training data set. Image-optimized deep learning networks, etc.

为了能够提高发音技巧检测的效率,本申请将深度学习引入到发音技巧检测中,相应提供一种发音技巧检测方法、发音技巧检测装置、存储介质以及电子设备。其中,发音技巧检测方法可由电子设备执行。In order to improve the efficiency of pronunciation skill detection, the present application introduces deep learning into pronunciation skill detection, and accordingly provides a pronunciation skill detection method, a pronunciation skill detection device, a storage medium and an electronic device. Wherein, the pronunciation skill detection method can be performed by an electronic device.

请参照图1,本申请还提供一种发音技巧检测系统,如图1所示,该发音技巧检测系统包括电子设备100,比如,电子设备可以获取用于发音技巧检测的待检测文本,并将该待检测文本转换为对应的音素序列,当电子设备还配置有麦克风时,可以在说话人说出待检测文本期间进行音频采集,从而获取到说话人说出待检测文本得到的待检测音频,并提取得到待检测音频的声学特征,之后,进一步将获得的音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果,其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。Please refer to FIG. 1 , the present application also provides a pronunciation skill detection system. As shown in FIG. 1 , the pronunciation skill detection system includes an electronic device 100. For example, the electronic device can obtain the text to be detected for pronunciation skill detection, and The text to be detected is converted into a corresponding phoneme sequence, and when the electronic device is also equipped with a microphone, audio collection can be performed during the speaker speaking the text to be detected, so as to obtain the audio to be detected obtained by the speaker speaking the text to be detected, And extract the acoustic features of the audio to be detected, and then further input the obtained phoneme sequence and acoustic features into the trained pronunciation skill detection model for pronunciation skill detection processing, and obtain the first detection result and the second detection result. The detection result is used to represent whether the text to be detected needs to be spoken using pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

电子设备100可以是任何配置有处理器而具备处理能力的设备,比如智能手机、平板电脑、掌上电脑、笔记本电脑等具备处理器的移动式电子设备,或者台式电脑、电视、服务器等具备处理器的固定式电子设备。The electronic device 100 may be any device equipped with a processor and capable of processing, such as a mobile electronic device with a processor such as a smart phone, a tablet computer, a PDA, a notebook computer, or a desktop computer, a TV, a server, and the like with a processor. stationary electronic equipment.

另外,如图1所示,该发音技巧检测系统还可以包括存储设备200,用于存储数据,包括但不限于发音技巧检测过程中得到的原始数据、中间数据以及结果数据等,比如,电子设备100可以将获取到的待检测文本、待检测音频,由待检测文本转换得到的音素序列、从待检测音频提取的声学特征,以及发音技巧检测模型输出的第一检测结果和第二检测结果存入存储设备200中。In addition, as shown in FIG. 1 , the pronunciation skill detection system may also include a storage device 200 for storing data, including but not limited to the original data, intermediate data and result data obtained during the pronunciation skill detection process, for example, an electronic device 100 can store the acquired text to be detected, the audio to be detected, the phoneme sequence converted from the text to be detected, the acoustic features extracted from the audio to be detected, and the first detection result and the second detection result output by the pronunciation skill detection model. into the storage device 200.

需要说明的是,图1所示的发音技巧检测系统的场景示意图仅仅是一个示例,本申请实施例描述的发音技巧检测系统以及场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着发音技巧检测系统的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。It should be noted that the schematic diagram of the scene of the pronunciation skill detection system shown in FIG. 1 is only an example. The pronunciation skill detection system and the scene described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, not It is known to those of ordinary skill in the art that, with the evolution of the pronunciation skill detection system and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application have the same effect on similar technical problems. Be applicable.

请参照图2,图2为本申请实施例提供的发音技巧检测方法的流程示意图。如图2所示,本申请实施例提供的发音技巧检测方法的流程可以如下:Please refer to FIG. 2 , which is a schematic flowchart of a pronunciation skill detection method provided by an embodiment of the present application. As shown in Figure 2, the process of the pronunciation skill detection method provided by the embodiment of the present application may be as follows:

在S310中,获取待检测文本,将待检测文本转换为对应的音素序列。In S310, the text to be detected is acquired, and the text to be detected is converted into a corresponding phoneme sequence.

其中,待检测文本代指用于发音技巧检测的文本,发音技巧检测包括检测是否需要采用发音技巧说出待检测文本,以及检测说话人是否采用发音技巧说出待检测文本。Wherein, the text to be detected refers to the text used for pronunciation skill detection, and the pronunciation skill detection includes detecting whether to use pronunciation skills to speak the text to be detected, and detecting whether the speaker uses pronunciation skills to speak the text to be detected.

应当说明的是,如汉语和英语等不同语种均具备各自的发音技巧,以英语为例,存在连读、失去爆破和浊化等发音技巧。It should be noted that different languages such as Chinese and English have their own pronunciation skills. Taking English as an example, there are pronunciation skills such as continuous reading, loss of blasting and turbidization.

其中,连读是指两个相邻单词的首尾音素自然的拼读在一起,中间不停顿。Among them, continuous reading refers to the natural spelling of the first and last phonemes of two adjacent words together without a pause in between.

失去爆破是指当两个爆破音(如p、d、t、k、g)相邻时,前一个爆破音只按其发音部位做好发音口形以形成阻碍,但是不爆破出来,待稍微停顿后再发出后面的辅音。前一个爆破音则称之为失去爆破,如goo(d)bye。Loss of blasting means that when two blasting sounds (such as p, d, t, k, g) are adjacent, the former blasting sound is only mouth-shaped according to its pronunciation position to form an obstacle, but it does not blast out, and waits for a little pause. Then pronounce the following consonants. The previous plosive is called a lost plosive, as in goo(d)bye.

浊化是值当一个清辅音前的音是/s/,该清辅音又有其相对应的浊辅音,且该清辅音后还有元音,此时将该清辅音读成其对应的浊辅音。以speak为例,清辅音/p/之前有/s/这个音,/p/对应的浊辅音是/b/,/p/后还有元音/i:/,此时将原本的/spi:k/读作/sbi:k/。Voiced is when the sound before a voiceless consonant is /s/, the voiceless consonant has its corresponding voiced consonant, and there is a vowel after the voiceless consonant, then the voiceless consonant is read as its corresponding voiced consonant consonant. Taking speak as an example, the unvoiced consonant /p/ is preceded by the sound /s/, the corresponding voiced consonant of /p/ is /b/, and /p/ is followed by the vowel /i:/. At this time, the original /spi is used. :k/ is read as /sbi:k/.

如上,本申请提供的发音技巧检测方法可以用于对任一语种的发音技巧进行发音技巧检测中,相应的,根据现实发音技巧检测的需求,待检测文本可以是任一语种的文本。电子设备在获取到的待检测文本之后,进一步将获取到的待检测文本转换为对应的音素序列。比如,电子设备可以按照发音词典将获取到的待检测文本转换为对应的音素序列。As above, the pronunciation skill detection method provided by the present application can be used for pronunciation skill detection of pronunciation skills in any language. Correspondingly, according to the actual pronunciation skill detection requirements, the text to be detected can be text in any language. After acquiring the text to be detected, the electronic device further converts the acquired text to be detected into a corresponding phoneme sequence. For example, the electronic device can convert the acquired text to be detected into a corresponding phoneme sequence according to the pronunciation dictionary.

在一可选的实施例中,将待检测文本转换为对应的音素序列,包括:In an optional embodiment, the text to be detected is converted into a corresponding phoneme sequence, including:

去除待检测文本中不发音的文本单元,得到新的待检测文本;Remove the silent text units in the text to be detected to obtain a new text to be detected;

将新的待检测文本中的每一文本单元转换为对应的音素单元,得到音素序列。Convert each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.

可以理解的是,对任一文本而言,在说出该文本时,并不是其中所有的文本单元均需要发音,比如,对于文本中的标点符号,则不需要进行发音。It can be understood that, for any text, when the text is spoken, not all text units in the text need to be pronounced, for example, punctuation marks in the text do not need to be pronounced.

因此,为了排除不发音文本单元的干扰,以提升发音技巧检测的准确性,电子设备在将待检测文本转换为对应的音素序列时,先去除待检测文本中不发音的文本单元(比如标点符号、表情符号等),得到新的待检测文本,再按照发音词典,将新的待检测文本中的每一文本单元转换为对应的音素序列。Therefore, in order to eliminate the interference of silent text units and improve the accuracy of pronunciation skill detection, when the electronic device converts the text to be detected into the corresponding phoneme sequence, it first removes the silent text units (such as punctuation marks) in the text to be detected. , emoticons, etc.) to obtain a new text to be detected, and then convert each text unit in the new text to be detected into a corresponding phoneme sequence according to the pronunciation dictionary.

比如,当需要进行英语的发音技巧检测时,电子设备获取到英语的待检测文本为“Please turn on the light.”,该待检测文本中的文本单元-标点符号“.”为不发音的文本单元,去除该不发音的文本单元“.”后,得到新的待检测文本为“Please turn on thelight”,进一步根据发音词典将该新的的待检测文本中的每一文本单元转换为对应的音素单元,得到音素序列

Figure BDA0003437921050000061
Figure BDA0003437921050000062
For example, when it is necessary to detect English pronunciation skills, the electronic device obtains the text to be detected in English as "Please turn on the light.", and the text unit-punctuation mark "." in the text to be detected is silent text unit, after removing the silent text unit ".", the new text to be detected is "Please turn on the light", and further according to the pronunciation dictionary, each text unit in the new text to be detected is converted into a corresponding phoneme unit, get phoneme sequence
Figure BDA0003437921050000061
Figure BDA0003437921050000062

此外,为了更清楚的表征音素序列,电子设备还可以在音素序列前后分别添加起始标志和结束标志,通过起始标志表征音素序列的开始,以及通过结束标志表征音素序列的结束。对于起始标志和结束标志具体配置为何,此处不做具体限制,可由本领域技术人员根据实际需要进行配置。In addition, in order to more clearly characterize the phoneme sequence, the electronic device may also add a start mark and an end mark before and after the phoneme sequence, the start mark indicates the start of the phoneme sequence, and the end mark indicates the end of the phoneme sequence. The specific configuration of the start flag and the end flag is not specifically limited here, and can be configured by those skilled in the art according to actual needs.

比如,可以配置起始标志为“<bos>”,配置结束标志为“<eos>”,电子设备在以上音素序列

Figure BDA0003437921050000063
添加起始标志和结束标志后变更为
Figure BDA0003437921050000064
For example, you can configure the start flag as "<bos>" and configure the end flag as "<eos>", and the electronic device can be configured in the above phoneme sequence
Figure BDA0003437921050000063
After adding the start flag and end flag, change it to
Figure BDA0003437921050000064

在S320中,获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征。In S320, the to-be-detected audio obtained by the speaker speaking the to-be-detected text is acquired, and the acoustic features of the to-be-detected audio are extracted.

本实施例中,电子设备除了将待检测文本转换为对应的音素序列之外,还获取说话人说出待检测文本得到的待检测音频。此处对于待检测音频的数据格式不做具体限制,可由本领域技术人员根据现实检测需要进行配置。In this embodiment, in addition to converting the text to be detected into a corresponding phoneme sequence, the electronic device also acquires the audio to be detected obtained by the speaker speaking the text to be detected. There is no specific limitation on the data format of the audio to be detected here, which can be configured by those skilled in the art according to actual detection needs.

其中,说话人可以是真实的人,也可以是虚拟的人。Among them, the speaker can be a real person or a virtual person.

比如,当说话人为真实的人时,电子设备可以通过配置的音频采集设备(可以是内置的音频采集设备,也可以是外置的音频采集设备)对该真实的人说出的待检测文本的语音进行音频采集,将采集得到的音频作为待检测音频;此外,电子设备还可以从其它电子设备处获取其它电子设备已采集的,真实的人说出待检测文本的待检测音频。相应的,利用此时获取到的待检测音频,电子设备可以应用本申请提供的发音技巧检测方法对该真实的人的发音能力进行检测。For example, when the speaker is a real person, the electronic device can use the configured audio capture device (either a built-in audio capture device or an external audio capture device) to detect the text to be detected spoken by the real person. The audio is collected by voice, and the collected audio is used as the audio to be detected; in addition, the electronic device can also obtain the audio to be detected from other electronic devices that have been collected by other electronic devices, and a real person speaks the text to be detected. Correspondingly, using the audio to be detected obtained at this time, the electronic device can use the pronunciation skill detection method provided by the present application to detect the pronunciation ability of the real person.

又比如,当说话人为虚拟的人时,比如基于人工智能的语音合成软件,电子设备可以将待检测文本直接输入该语音合成软件,由语音合成软件进行语音合成,并输出合成得到的音频,将该音频作为待检测音频。相应的,利用此时获取到的待检测音频,电子设备可以用于本申请提供的发音技巧检测方法对该语音合成软件的语音合成能力进行检测。For another example, when the speaker is a virtual person, such as speech synthesis software based on artificial intelligence, the electronic device can directly input the text to be detected into the speech synthesis software, the speech synthesis software performs speech synthesis, and outputs the synthesized audio, and This audio is used as the audio to be detected. Correspondingly, using the audio to be detected obtained at this time, the electronic device can be used to detect the speech synthesis capability of the speech synthesis software in the pronunciation skill detection method provided in this application.

如上,电子设备在获取到说话人说出待检测文本得到的待检测音频之后,还提取待检测音频的声学特征。其中,声学特征是指表示语音声学特性的物理量,也是声音诸要素声学表现的统称,如表示音色的能量集中区、共振峰频率、共振峰强度和带宽等,以及表示语音韵律特性的时长、基频、平均语声功率等。As above, after acquiring the audio to be detected obtained by the speaker speaking the text to be detected, the electronic device further extracts the acoustic features of the audio to be detected. Among them, the acoustic feature refers to the physical quantity that represents the acoustic characteristics of speech, and is also the general term for the acoustic performance of various elements of the sound, such as the energy concentration area, formant frequency, formant intensity and bandwidth that represent the timbre, as well as the duration, base, etc. that represent the prosody characteristics of speech. frequency, average speech power, etc.

在一可选的实施例中,为了进一步提升发音技巧检测的准确性,提取待检测音频的声学特征,包括:In an optional embodiment, in order to further improve the accuracy of pronunciation skill detection, the acoustic features of the audio to be detected are extracted, including:

提取待检测音频的Filterbank特征、基频特征和能量特征;Extract the Filterbank feature, fundamental frequency feature and energy feature of the audio to be detected;

融合Filterbank特征、基频特征和能量特征,得到声学特征。Acoustic features are obtained by fusing Filterbank features, fundamental frequency features and energy features.

本实施例中,将Filterbank特征、基频特征以及能量特征作为与发音技巧相关的声学特征,相应的,在提取用于发音技巧检测的声学特征时,电子设备提取待检测音频的Filterbank特征、基频特征以及能量特征。其中,对于提取的Filterbank特征的维度,此处不做具体限制,可由本领域技术人员根据实际需要进行配置,比如,本实施例中,电子设备可以提取待检测音频40维的Filterbank特征。In this embodiment, Filterbank features, fundamental frequency features and energy features are used as acoustic features related to pronunciation skills. Correspondingly, when extracting acoustic features for pronunciation skills detection, the electronic device extracts the Filterbank features, base features of the audio to be detected. frequency characteristics and energy characteristics. The dimension of the extracted Filterbank feature is not specifically limited here, and can be configured by those skilled in the art according to actual needs. For example, in this embodiment, the electronic device can extract the 40-dimensional Filterbank feature of the audio to be detected.

如上,在提取到待检测音频的Filterbank特征、基频特征和能量特征之后,电子设备进一步按照配置的融合策略,融合Filterbank特征、基频特征和能量特征得到融合特征,将该融合特征作为用于发音技巧检测的声学特征。此处对于融合策略的配置不做具体限制,可由本领域技术人员根据实际需要配置。As above, after extracting the Filterbank feature, fundamental frequency feature and energy feature of the audio to be detected, the electronic device further fuses the Filterbank feature, fundamental frequency feature and energy feature according to the configured fusion strategy to obtain a fusion feature, and uses the fusion feature as a Acoustic features for pronunciation skill detection. The configuration of the fusion policy is not specifically limited here, and can be configured by those skilled in the art according to actual needs.

比如,请参照图3,本实施例配置的融合策略为按照时间维度对Filterbank特征、基频特征和能量特征进行拼接,从而得到用于发音技巧检测的声学特征。For example, referring to FIG. 3 , the fusion strategy configured in this embodiment is to splicing Filterbank features, fundamental frequency features, and energy features according to the time dimension, so as to obtain acoustic features for pronunciation skill detection.

在S330中,将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果。In S330, the phoneme sequence and the acoustic feature are input into the trained pronunciation skill detection model for pronunciation skill detection processing, and a first detection result and a second detection result are obtained.

应当说明的是,本申请针对不同的语种,预先训练对应的发音技巧检测模型,比如,针对汉语,预先训练有用于对汉语进行发音技巧检测的发音技巧检测模型,针对英语,预先训练有用于对英语进行发音技巧检测的发音技巧检测模型。此处对发音技巧检测模型的结构以及训练方式不做具体限制,可由本领域技术人员根据实际需要进行选取。It should be noted that the present application pre-trains corresponding pronunciation skill detection models for different languages. For example, for Chinese, a pronunciation skill detection model for detecting pronunciation skills in Chinese is pre-trained. Pronunciation skill detection model for pronunciation skill detection in English. The structure and training method of the pronunciation skill detection model are not specifically limited here, and can be selected by those skilled in the art according to actual needs.

其中,发音技巧检测模型被配置为以来源于说话人说出待检测文本的待检测音频的声学特征,以及来源于待检测文本的音素序列为输入,相应输出用于表征是否需要采用发音技巧说出待检测文本的检测结果,以及用于表征说话人是否采用发音技巧说出待检测文本的检测结果。The pronunciation skill detection model is configured to take the acoustic features of the audio to be detected derived from the speaker speaking the text to be detected, and the phoneme sequence derived from the text to be detected as input, and the corresponding output is used to indicate whether the pronunciation skill is required to be used. The detection result of the text to be detected is obtained, and the detection result used to characterize whether the speaker uses pronunciation skills to speak the text to be detected.

相应的,本实施例中,在获取到如上音素序列以及声学特征之后,电子设备即将获取到的音素序列和声学特征输入与待检测文本语种匹配的,已训练的发音技巧检测模型中进行发音技巧检测处理,得到发音技巧检测模型输出的第一检测结果和第二检测结果。其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。Correspondingly, in this embodiment, after obtaining the above phoneme sequence and acoustic feature, if the phoneme sequence and acoustic feature input that the electronic device is about to obtain matches the language of the text to be detected, the pronunciation skill has been trained in the pronunciation skill detection model. The detection process is performed to obtain the first detection result and the second detection result output by the pronunciation skill detection model. The first detection result is used to represent whether the text to be detected needs to be spoken by using pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

其中,以上述待检测文本“Please turn on the light”为例,根据专家知识可知,音素“n”和“ɑ”为连读音素搭配,turn和on需要连读,针对该待检测文本对应的音素序列和声学特征,发音技巧检测模型输出的第一检测结果将表征需要采用发音技巧“连读”输出待检测文本,取决于说话人是否采用发音技巧“连读”说出待检测文本,发音技巧检测模型将输出对于的第二检测结果。Among them, taking the above-mentioned text to be detected "Please turn on the light" as an example, according to expert knowledge, the phonemes "n" and "ɑ" are linked phonemes, and turn and on need to be linked together. The phoneme sequence and acoustic features, the first detection result output by the pronunciation skill detection model will indicate the need to use the pronunciation skill "continuous reading" to output the text to be detected, depending on whether the speaker uses the pronunciation skill "continuous reading" to speak the text to be detected, and the pronunciation The skill detection model will output a second detection result for .

此外,音素序列可以原始的音素序列进行输入,也可以进行数字编码后,将音素序列转换为数字形式的音素序列,即转换后的音素序列中以数字来表征对应的音素。相应的,若音素序列以数字形式进行输入,则在对发音技巧检测模型进行训练时,需要以数字形式的音素序列样本进行训练。In addition, the phoneme sequence can be input as the original phoneme sequence, or after digital encoding, the phoneme sequence can be converted into a phoneme sequence in digital form, that is, the corresponding phoneme is represented by numbers in the converted phoneme sequence. Correspondingly, if the phoneme sequence is input in the form of numbers, when training the pronunciation skill detection model, it needs to be trained with the phoneme sequence samples in the form of numbers.

在一可选的实施例中,发音技巧检测模型包括音素特征提取网络、声学特征增强网络、特征融合网络、第一发音技巧检测网络以及第二发音技巧检测网络,将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果,包括:In an optional embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network. The trained pronunciation skill detection model performs pronunciation skill detection processing to obtain the first detection result and the second detection result, including:

将音素序列输入音素特征提取网络进行特征提取处理,得到音素特征矩阵;Input the phoneme sequence into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix;

将音素特征矩阵输入第一发音技巧检测网络进行发音技巧检测处理,得到第一检测结果;Input the phoneme feature matrix into the first pronunciation skill detection network to perform pronunciation skill detection processing, and obtain the first detection result;

若第一检测结果表征需要采用发音技巧说出待检测文本,则将声学特征输入声学特征增强网络进行特征增强处理,得到增强声学特征矩阵;If the first detection result representation needs to use pronunciation skills to speak the text to be detected, the acoustic features are input into the acoustic feature enhancement network for feature enhancement processing, and an enhanced acoustic feature matrix is obtained;

将增强声学特征矩阵和音素特征矩阵输入特征融合网络进行特征融合处理,得到融合特征矩阵;Input the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix;

将融合特征矩阵输入第二发音技巧检测网络进行发音技巧检测处理,得到第二检测结果。The fusion feature matrix is input into the second pronunciation skill detection network for pronunciation skill detection processing, and the second detection result is obtained.

请参照图4,本实施例提供的发音技巧检测模型由5大部分组成,分别为音素特征提取网络、声学特征增强网络、特征融合网络、第一发音技巧检测网络以及第二发音技巧检测网络。Please refer to FIG. 4 , the pronunciation skill detection model provided by this embodiment consists of five major parts, namely, a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network.

其中,音素特征提取网络被配置对输入的音素序列进行特征提取,得到反映音素序列中音素间相互关系的音素特征矩阵。Wherein, the phoneme feature extraction network is configured to perform feature extraction on the input phoneme sequence to obtain a phoneme feature matrix reflecting the mutual relationship between the phonemes in the phoneme sequence.

声学特征增强网络被配置为对输入的声学特征进行特征增强处理,增强其中与发音技巧更相关的特征,得到增强声学特征矩阵。The acoustic feature enhancement network is configured to perform feature enhancement processing on the input acoustic features, and enhance the features more relevant to pronunciation skills to obtain an enhanced acoustic feature matrix.

特征融合网络被配置为对输入的音素特征矩阵和增强声学特征矩阵进行特征融合,不做音素与声学特征的交互信息,得到融合特征矩阵。The feature fusion network is configured to perform feature fusion on the input phoneme feature matrix and the enhanced acoustic feature matrix, and obtain the fusion feature matrix without the interaction information between the phoneme and the acoustic feature.

第一发音技巧检测网络被配置为对输入的音素特征矩阵进行发音技巧检测处理,输出用于表征是否需要采用发音技巧说出待检测文本的第一检测结果。The first pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input phoneme feature matrix, and output a first detection result used to represent whether a pronunciation skill needs to be used to speak the text to be detected.

第二发音技巧检测网络被配置为对输入的融合特征矩阵进行发音技巧检测处理,输出用于表征说话人是否采用发音技巧说出待检测文本的的第二检测结果。The second pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input fusion feature matrix, and output a second detection result used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

相应的,本实施例中,在将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理时,电子设备可以将音素序列音素特征提取网络进行特征提取处理,得到音素特征矩阵,然后将音素特征矩阵输入第一发音技巧检测网络进行发音技巧检测处理,得到第一检测结果。Correspondingly, in this embodiment, when the phoneme sequence and acoustic features are input into the trained pronunciation skill detection model for pronunciation skill detection processing, the electronic device can perform feature extraction processing on the phoneme sequence phoneme feature extraction network to obtain a phoneme feature matrix, Then, the phoneme feature matrix is input into the first pronunciation skill detection network for pronunciation skill detection processing, and the first detection result is obtained.

同时,电子设备将声学特征输入声学特征增强网络进行特征增强处理,得到增强声学特征矩阵,并将增强声学特征特征和音素特征矩阵输入特征融合网络进行特征融合处理,得到融合特征矩阵,以及将融合得到的融合特征矩阵输入第二发音技巧检测网络进行发音技巧检测处理,得到第二检测结果。At the same time, the electronic device inputs the acoustic feature into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix, and inputs the enhanced acoustic feature feature and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix, and the fusion The obtained fusion feature matrix is input into the second pronunciation skill detection network for pronunciation skill detection processing, and the second detection result is obtained.

此外,电子设备还可以根据第一检测结果确定是否输出第二检测结果。其中,电子设备在获得发音技巧检测模型进行发音技巧检测处理所得到的第一检测结果和第二检测结果之后,根据该第一检测结果确定是否需要采用发音技巧说出待检测文本,若确定需要采用发音技巧说出待检测文本,则电子设备同时输出第一检测结果和第二检测结果,通过此时的第一检测结果表征需要采用发音技巧说出待检测文本,以及通过此时的第二检测结果表征说话人是否采用发音技巧说出待检测文本;若确定不需要采用发音技巧说出待检测文本,则电子设备可以丢弃第二检测结果,仅输出第一检测结果。In addition, the electronic device may also determine whether to output the second detection result according to the first detection result. Wherein, after obtaining the first detection result and the second detection result obtained by the pronunciation skill detection model by the pronunciation skill detection model, the electronic device determines whether to use the pronunciation skill to speak the text to be detected according to the first detection result. Use pronunciation skills to speak the text to be detected, then the electronic device outputs the first detection result and the second detection result at the same time, the first detection result at this time indicates that the text to be detected needs to be spoken by pronunciation skills, and the second detection result at this time The detection result represents whether the speaker uses pronunciation skills to speak the text to be detected; if it is determined that the text to be detected does not need to be spoken by pronunciation skills, the electronic device can discard the second detection result and output only the first detection result.

在其它实施例中,若第一检测结果表征需要采用发音技巧说出待检测文本,则电子设备将声学特征输入声学特征增强网络进行特征增强处理,得到增强声学特征矩阵。之后,电子设备进一步将增强声学特征特征和音素特征矩阵输入特征融合网络进行特征融合处理,得到融合特征矩阵。最后,电子设备将融合得到的融合特征矩阵输入第二发音技巧检测网络进行发音技巧检测处理,得到第二检测结果。In other embodiments, if the first detection result indicates that the text to be detected needs to be spoken using pronunciation skills, the electronic device inputs the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix. After that, the electronic device further inputs the enhanced acoustic feature feature and the phoneme feature matrix into the feature fusion network to perform feature fusion processing to obtain a fusion feature matrix. Finally, the electronic device inputs the fusion feature matrix obtained by fusion into the second pronunciation skill detection network for pronunciation skill detection processing, and obtains a second detection result.

此外,若第一检测结果表征不需要采用发音技巧说出待检测文本,则没有必要再去做进一步的发音技巧检测,此时电子设备不再利用声学特征进行发音技巧检测,可以仅输出第一检测结果。In addition, if the first detection result indicates that pronunciation skills do not need to be used to say the text to be detected, there is no need to perform further pronunciation skills detection. At this time, the electronic device no longer uses acoustic features to detect pronunciation skills, and can only output the first Test results.

在一可选的实施例中,音素特征提取网络包括音素嵌入模块和音素特征提取模块,将音素序列输入音素特征提取网络进行特征提取处理,得到音素特征矩阵,包括:In an optional embodiment, the phoneme feature extraction network includes a phoneme embedding module and a phoneme feature extraction module, and the phoneme sequence is input into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix, including:

将音素序列输入音素嵌入模块进行嵌入处理,得到音素向量矩阵;Input the phoneme sequence into the phoneme embedding module for embedding processing, and obtain the phoneme vector matrix;

将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵。The phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.

请参照图5,本实施例中,音素特征提取网络由两部分组成,分别为音素嵌入模块和音素特征提取模块,其中,音素嵌入模块被配置为对输入的音素序列进行嵌入处理,将其向量化,得到音素向量矩阵;音素特征提取模块被配置为对输入的音素向量矩阵进行特征提取,得到反映音素序列中音素间相互关系的音素特征矩阵。Referring to FIG. 5 , in this embodiment, the phoneme feature extraction network consists of two parts, namely a phoneme embedding module and a phoneme feature extraction module, wherein the phoneme embedding module is configured to perform embedding processing on the input phoneme sequence, and the vector to obtain a phoneme vector matrix; the phoneme feature extraction module is configured to perform feature extraction on the input phoneme vector matrix to obtain a phoneme feature matrix reflecting the relationship between the phonemes in the phoneme sequence.

相应的,本实施例中,在将音素序列输入音素特征提取网络进行特征提取处理时,电子设备先将音素序列输入音素嵌入模块进行嵌入处理,得到音素向量矩阵,然后再将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵。Correspondingly, in this embodiment, when inputting the phoneme sequence into the phoneme feature extraction network for feature extraction processing, the electronic device first inputs the phoneme sequence into the phoneme embedding module for embedding processing to obtain a phoneme vector matrix, and then inputs the phoneme vector matrix into the phoneme. The feature extraction module performs feature extraction processing to obtain a phoneme feature matrix.

其中,音素特征提取模块包括至少1个音素特征提取子模块,将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵,包括:The phoneme feature extraction module includes at least one phoneme feature extraction sub-module, and the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix, including:

在音素特征提取子模块为1个时,将音素向量矩阵输入音素特征提取子模块进行特征提取处理,得到音素特征矩阵;或者,When the number of phoneme feature extraction submodules is one, the phoneme vector matrix is input into the phoneme feature extraction submodule for feature extraction processing to obtain a phoneme feature matrix; or,

在音素特征提取子模块为N个时,将音素向量矩阵输入N个音素特征提取子模块依次进行特征提取处理,得到音素特征矩阵,N为大于1的整数。此处对于N的取值不做具体限制,可由本领域技术人员根据实际需要进行配置,比如,可以将N配置为2。When the number of phoneme feature extraction submodules is N, the phoneme vector matrix is input into the N phoneme feature extraction submodules to perform feature extraction processing in sequence to obtain a phoneme feature matrix, where N is an integer greater than 1. The value of N is not specifically limited here, and can be configured by those skilled in the art according to actual needs. For example, N can be configured as 2.

本实施例中,音频特征提取模块可以由1个音素特征提取子模块组成,也可以由N个音素特征提取子模块依次连接组成。其中,在音频特征提取模块由N个音素特征提取子模块组成时,每一音频提取子模块进行相同的特征提取处理。以下以1个音频特征提取子模块的特征提取处理过程为例进行说明。In this embodiment, the audio feature extraction module may be composed of one phoneme feature extraction sub-module, or may be composed of N phoneme feature extraction sub-modules connected in sequence. Wherein, when the audio feature extraction module consists of N phoneme feature extraction sub-modules, each audio extraction sub-module performs the same feature extraction process. The following takes the feature extraction process of one audio feature extraction sub-module as an example for description.

请参照图6,音频特征提取子模块由3个子层组成,分别为第一矩阵转换层、第一多头注意力层以及第一矩阵融合层,其中,Please refer to FIG. 6 , the audio feature extraction sub-module consists of 3 sub-layers, namely the first matrix conversion layer, the first multi-head attention layer and the first matrix fusion layer, wherein,

第一矩阵转换层被配置为对输入的矩阵进行矩阵转换处理,将输入的矩阵分别转换为查询矩阵、键矩阵和值矩阵;The first matrix conversion layer is configured to perform matrix conversion processing on the input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;

第一多头注意力层被配置为对输入的查询矩阵、键矩阵和值矩阵进行注意力增强处理,得到注意力增强矩阵;The first multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, key matrix and value matrix to obtain an attention enhancement matrix;

第一矩阵融合层被配置为对第一矩阵转换层的输入矩阵和第一多头注意力层的输出矩阵进行矩阵融合处理,得到融合矩阵。The first matrix fusion layer is configured to perform matrix fusion processing on the input matrix of the first matrix conversion layer and the output matrix of the first multi-head attention layer to obtain a fusion matrix.

相应的,在音素特征提取子模块为1个时,电子设备可以按照如下方式提取得到音素特征矩阵:Correspondingly, when the number of phoneme feature extraction submodules is one, the electronic device can extract the phoneme feature matrix as follows:

将音素向量矩阵输入第一矩阵转换层进行矩阵转换处理,得到查询矩阵、键矩阵和值矩阵,分别记为第一查询矩阵、第一键矩阵和第一值矩阵;Input the phoneme vector matrix into the first matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, which are respectively denoted as a first query matrix, a first key matrix and a first value matrix;

将第一查询矩阵、第一键矩阵和第一值矩阵输入第一多头注意力层进行注意力增强处理,得到注意力增强矩阵,记为第一注意力增强矩阵;Input the first query matrix, the first key matrix and the first value matrix into the first multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, which is denoted as the first attention enhancement matrix;

将第一注意力增强矩阵和音素向量矩阵输入第一矩阵融合层进行矩阵融合处理,得到融合矩阵,记为音素特征矩阵。The first attention enhancement matrix and the phoneme vector matrix are input into the first matrix fusion layer for matrix fusion processing to obtain a fusion matrix, which is recorded as a phoneme feature matrix.

可以理解的是,当音素特征提取子模块为N个时,只需要用将音素向量矩阵输入N个依次连接的音素特征提取子模块中的第1个音素特征提取子模块,由N个音素特征子模块按照如上方式依次进行特征提取处理,将第N个音素特征子模块输出的融合特征作为音素特征矩阵。It can be understood that when there are N phoneme feature extraction submodules, it is only necessary to input the phoneme vector matrix into the first phoneme feature extraction submodule in the N successively connected phoneme feature extraction submodules. The sub-modules sequentially perform feature extraction processing in the above manner, and use the fusion feature output by the N-th phoneme feature sub-module as a phoneme feature matrix.

其中,对于第一矩阵融合层如何进行矩阵的融合,本实施例中不做具体限制,可由本领域技术人员根据实际需要进行配置。There is no specific limitation on how the first matrix fusion layer performs matrix fusion in this embodiment, and can be configured by those skilled in the art according to actual needs.

比如,第一矩阵融合层可以包括两个子层,分别为加法层和层归一化层,在进行矩阵的融合时,先通过加法层对输入的两个矩阵相加,得到和值矩阵,然后再通过层归一化层对和值矩阵进行层归一化处理,得到融合矩阵。For example, the first matrix fusion layer may include two sub-layers, namely an addition layer and a layer normalization layer. When performing matrix fusion, the addition layer first adds the two input matrices to obtain a sum value matrix, and then Then, the sum value matrix is normalized by the layer normalization layer, and the fusion matrix is obtained.

在一可选的实施例中,请参照图7,音素特征提取网络还包括第一位置编码模块和第二矩阵融合层,将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵之前,还包括:In an optional embodiment, please refer to FIG. 7 , the phoneme feature extraction network also includes a first position encoding module and a second matrix fusion layer, and the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix. Before, also included:

将音素向量矩阵输入第一位置编码模块进行位置编码处理,得到第一位置编码矩阵;Input the phoneme vector matrix into the first position encoding module for position encoding processing to obtain the first position encoding matrix;

将第一位置编码矩阵和音素向量矩阵输入第二矩阵融合层进行矩阵融合处理,得到音素位置融合矩阵;Inputting the first position encoding matrix and the phoneme vector matrix into the second matrix fusion layer to perform matrix fusion processing to obtain a phoneme position fusion matrix;

将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵,包括:Input the phoneme vector matrix into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix, including:

将音素位置融合矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵。The phoneme position fusion matrix is input into the phoneme feature extraction module for feature extraction processing, and the phoneme feature matrix is obtained.

本实施例中,为了进一步提升发音技巧检测的准确性,本实施例中并不将原始的音素向量矩阵输入音素特征提取模块进行特征提取,而是对其进行位置编码后,将携带了位置信息的音素向量矩阵输入音素特征提取模块进行特征提取。In this embodiment, in order to further improve the accuracy of pronunciation skill detection, in this embodiment, the original phoneme vector matrix is not input into the phoneme feature extraction module for feature extraction, but after position encoding, the position information will be carried. The phoneme vector matrix is input to the phoneme feature extraction module for feature extraction.

其中,电子设备首先将音素向量矩阵输入第一位置编码模块进行位置编码处理,得到位置编码矩阵,记为第一位置编码矩阵,该第一位置编码矩阵表征了音素向量矩阵中各矩阵单元的位置信息,可以是相对位置信息,也可以是绝对位置信息。Wherein, the electronic device first inputs the phoneme vector matrix into the first position coding module for position coding processing to obtain a position coding matrix, which is denoted as the first position coding matrix, and the first position coding matrix represents the position of each matrix unit in the phoneme vector matrix The information can be relative position information or absolute position information.

在获取到如上第一位置编码矩阵之后,电子设备即将第一位置编码矩阵和音素向量矩阵输入第二矩阵融合层进行矩阵融合处理,得到融合矩阵,记为音素位置融合矩阵。之后,电子设备进一步将该携带了位置信息的音素位置融合矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵。其中,对于音素特征提取模块如何进行特征提取处理,具体请参照以上实施例的相关说明,此处不再赘述。After obtaining the first position encoding matrix as above, the electronic device inputs the first position encoding matrix and the phoneme vector matrix into the second matrix fusion layer for matrix fusion processing to obtain a fusion matrix, which is recorded as a phoneme position fusion matrix. After that, the electronic device further inputs the phoneme position fusion matrix carrying the position information into the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix. Wherein, for how the phoneme feature extraction module performs the feature extraction process, please refer to the relevant description of the above embodiment for details, and details are not repeated here.

此外,还应当说明的是,本实施例对于第二矩阵融合层的矩阵融合方式不做具体限制,可由本领域技术人员根据实际需要进行配置。比如,第二矩阵融合层被配置为对输入的两个矩阵进行相加处理,将相加得到的和值矩阵作为融合矩阵输出。In addition, it should also be noted that this embodiment does not specifically limit the matrix fusion manner of the second matrix fusion layer, which can be configured by those skilled in the art according to actual needs. For example, the second matrix fusion layer is configured to perform addition processing on two input matrices, and output the sum value matrix obtained by the addition as a fusion matrix.

在一可选的实施例中,声学特征增强网络包括特征编码模块和至少1个声学特征增强模块,将声学特征输入声学特征增强网络进行特征增强处理,得到增强声学特征矩阵,包括:In an optional embodiment, the acoustic feature enhancement network includes a feature encoding module and at least one acoustic feature enhancement module, and the acoustic features are input into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix, including:

将声学特征输入特征编码模块进行特征编码处理,得到声学特征矩阵;Input the acoustic features into the feature encoding module for feature encoding processing to obtain an acoustic feature matrix;

在声学特征增强模块为1个时,将声学特征矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵;或者,When the number of acoustic feature enhancement modules is one, input the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix; or,

在声学特征增强模块为M个时,将声学特征矩阵输入M个声学特征增强模块依次进行特征增强处理,得到增强声学特征矩阵,M为大于1的整数。此处对于M的取值不做具体限制,可由本领域技术人员根据实际需要进行配置,比如,可以将M配置为4。When the number of acoustic feature enhancement modules is M, the acoustic feature matrix is input into the M acoustic feature enhancement modules to perform feature enhancement processing in turn to obtain an enhanced acoustic feature matrix, where M is an integer greater than 1. The value of M is not specifically limited here, and can be configured by those skilled in the art according to actual needs. For example, M can be configured as 4.

应当说明的是,以上实施例中获取到Filterbank特征、基频特征以及能量特征均以特征图的形式呈现,相应的,由Filterbank特征、基频特征和能量特征融合得到的声学特征也以特征图的形式呈现。It should be noted that the Filterbank features, fundamental frequency features and energy features obtained in the above embodiments are all presented in the form of feature maps. Correspondingly, the acoustic features obtained by the fusion of Filterbank features, fundamental frequency features and energy features are also represented by feature maps. presented in the form of.

为了能够有效的对声学特征进行特征增强处理,本实施例中,声学特征增强网络由1个特征编码模块和至少1个声学特征增强模块组成,其中,特征编码模块被配置为对声学特征进行编码处理,压缩特征维度,得到对应的声学特征矩阵;声学特征增强模块被配置为对输入的声学特征矩阵进行特征增强处理,增强其中与发音技巧相关的特征,得到增强声学特征矩阵。In order to effectively perform feature enhancement processing on acoustic features, in this embodiment, the acoustic feature enhancement network consists of one feature encoding module and at least one acoustic feature enhancement module, wherein the feature encoding module is configured to encode acoustic features Process, compress the feature dimension, and obtain the corresponding acoustic feature matrix; the acoustic feature enhancement module is configured to perform feature enhancement processing on the input acoustic feature matrix, enhance the features related to pronunciation skills, and obtain the enhanced acoustic feature matrix.

相应的,在将声学特征输入声学特征增强网络进行特征增强处理时,电子设备首先将声学特征输入特征编码模块进行特征编码处理,得到声学特征矩阵。Correspondingly, when the acoustic features are input into the acoustic feature enhancement network for feature enhancement processing, the electronic device first inputs the acoustic features into the feature encoding module for feature encoding processing to obtain an acoustic feature matrix.

请参照图8,特征编码模块由4个子层组成,包括第一卷积层、第一池化层、第二卷积层和第二池化层,其中,Please refer to Figure 8, the feature encoding module consists of 4 sub-layers, including a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer, wherein,

第一卷积层被配置为对输入的特征图进行卷积处理,得到对应的卷积结果;The first convolution layer is configured to perform convolution processing on the input feature map to obtain a corresponding convolution result;

第一池化层被配置为对第一卷积层输出的卷积结果进行池化处理,得到对应的池化结果;The first pooling layer is configured to perform pooling processing on the convolution result output by the first convolution layer to obtain a corresponding pooling result;

第二卷积层被配置为对第一池化层输出的池化结果进行卷积处理,得到对应的卷积结果;The second convolution layer is configured to perform convolution processing on the pooling result output by the first pooling layer to obtain a corresponding convolution result;

第二池化层被配置为对第二卷积层输出的卷积结果进行池化处理,得到对应特征图的特征矩阵。The second pooling layer is configured to perform pooling processing on the convolution result output by the second convolution layer to obtain a feature matrix corresponding to the feature map.

相应的,电子设备可以按照如下方式将声学业主输入特征编码模块进行特征编码处理:Correspondingly, the electronic device can input the acoustic owner into the feature encoding module for feature encoding processing in the following manner:

将声学特征输入第一卷积层进行卷积处理,得到卷积结果,记为第一卷积结果;Input the acoustic features into the first convolution layer for convolution processing, and obtain the convolution result, which is recorded as the first convolution result;

将第一卷积结果输入第一池化层进行池化处理,得到池化结果,记为第一池化结果;The first convolution result is input into the first pooling layer for pooling processing, and the pooling result is obtained, which is recorded as the first pooling result;

将第一池化结果输入第二卷积层进行卷积处理,得到卷积结果,记为第二卷积结果;The first pooling result is input into the second convolution layer for convolution processing, and the convolution result is obtained, which is recorded as the second convolution result;

将第二卷积结果输入第二池化层进行池化处理,得到声学特征矩阵。The second convolution result is input into the second pooling layer for pooling processing, and the acoustic feature matrix is obtained.

应当说明的是,本实施例中对于第一卷积层和第二卷积层的卷积核大小、步长、padding大小,第一池化层和第二池化层的池化核大小、步长均不做具体限制,可由本领域技术人员根据实际需要进行配置。It should be noted that in this embodiment, the convolution kernel size, stride size, and padding size of the first convolutional layer and the second convolutional layer, the pooling kernel size of the first pooling layer and the second pooling layer, The step size is not specifically limited, and can be configured by those skilled in the art according to actual needs.

比如,本实施例中中,配置第一卷积层的卷积核大小为[3,3]、步长为[1,1]、padding大小为[1,1],配置第二卷积层的卷积核大小为[3,3]、步长为[1,1]、padding大小为[1,1],配置第一池化层的池化类型为最大池化、池化核大小为[2,2]、步长为[1,1],配置第二池化层的池化类型为最大池化、池化核大小为[2,2]、步长为[1,1]。For example, in this embodiment, the size of the convolution kernel of the first convolutional layer is [3,3], the stride is [1,1], the padding size is [1,1], and the second convolutional layer is configured The size of the convolution kernel is [3,3], the stride is [1,1], the padding size is [1,1], the pooling type of the first pooling layer is max pooling, and the pooling kernel size is [2,2], the step size is [1,1], the pooling type of the second pooling layer is configured as max pooling, the pooling kernel size is [2,2], and the step size is [1,1].

进一步的,当声学特征增强模块为1个时,电子设备直接将声学特征矩阵输入该声学特征增强模块进行特征增强处理,得到增强声学特征矩阵。Further, when the number of acoustic feature enhancement modules is one, the electronic device directly inputs the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix.

以下以1个声学特征增强模块的特征增强处理过程为例进行说明。The following takes the feature enhancement processing process of one acoustic feature enhancement module as an example for description.

请参照图9,声学特征增强模块由6个子层组成,分别为第二矩阵转换层、第二多头注意力层、第三矩阵融合层、第三卷积层、逆卷积层和第四矩阵融合层。其中,Please refer to Figure 9, the acoustic feature enhancement module consists of 6 sub-layers, namely the second matrix conversion layer, the second multi-head attention layer, the third matrix fusion layer, the third convolution layer, the deconvolution layer and the fourth layer. Matrix fusion layer. in,

第二矩阵转换层被配置为对输入的矩阵进行矩阵转换处理,将输入的矩阵分别转换为查询矩阵、键矩阵和值矩阵;The second matrix conversion layer is configured to perform matrix conversion processing on the input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;

第二多头注意力层被配置为对输入的查询矩阵、键矩阵和值矩阵进行注意力增强处理,得到注意力增强矩阵;The second multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, key matrix and value matrix to obtain an attention enhancement matrix;

第三矩阵融合层被配置为对第二矩阵转换层的输入矩阵和第二多头注意力层的输出矩阵进行矩阵融合处理,得到融合矩阵;The third matrix fusion layer is configured to perform matrix fusion processing on the input matrix of the second matrix conversion layer and the output matrix of the second multi-head attention layer to obtain a fusion matrix;

第三卷积层被配置为对输入的融合矩阵进行卷积处理,得到卷积结果;The third convolution layer is configured to perform convolution processing on the input fusion matrix to obtain a convolution result;

逆卷积层被配置为对输入的卷积结果进行逆卷积处理,得到矩阵形式的逆卷积结果;The deconvolution layer is configured to perform deconvolution processing on the input convolution result to obtain the deconvolution result in the form of a matrix;

第四矩阵融合层被配置为对第三矩阵融合层输出的融合矩阵和逆卷积层输出的逆卷积结果进行矩阵融合处理,得到融合矩阵。The fourth matrix fusion layer is configured to perform matrix fusion processing on the fusion matrix output by the third matrix fusion layer and the deconvolution result output by the deconvolution layer to obtain a fusion matrix.

相应的,在声学特征增强模块为1个时,电子设备可以按照如下方式增强得到的增强声学特征矩阵:Correspondingly, when the number of acoustic feature enhancement modules is one, the electronic device can enhance the obtained enhanced acoustic feature matrix as follows:

将声学特征矩阵输入第二矩阵转换层进行矩阵转换处理,得到查询矩阵、键矩阵和值矩阵,分别记为第二查询矩阵、第二键矩阵和第二值矩阵;Input the acoustic feature matrix into the second matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, which are respectively denoted as the second query matrix, the second key matrix and the second value matrix;

将第二查询矩阵、第二键矩阵和第二值矩阵输入第二多头注意力层进行注意力增强处理,得到注意力增强矩阵,记为第二注意力增强矩阵;Input the second query matrix, the second key matrix and the second value matrix into the second multi-head attention layer for attention enhancement processing to obtain the attention enhancement matrix, which is denoted as the second attention enhancement matrix;

将第二注意力增强矩阵和声学特征矩阵输入第三矩阵融合层进行矩阵融合处理,得到融合矩阵,记为声学融合矩阵;The second attention enhancement matrix and the acoustic feature matrix are input into the third matrix fusion layer for matrix fusion processing, and the fusion matrix is obtained, which is recorded as the acoustic fusion matrix;

将声学融合矩阵输入第三卷积层进行卷积处理,得到第三卷积结果;Input the acoustic fusion matrix into the third convolution layer for convolution processing, and obtain the third convolution result;

将第三卷积结果输入逆卷积层进行逆卷积处理,得到矩阵形式的逆卷积结果;The third convolution result is input into the deconvolution layer for deconvolution processing, and the deconvolution result in matrix form is obtained;

将声学融合矩阵和逆卷积结果输入第四矩阵融合层进行矩阵融合处理,得到融合矩阵,记为增强声学特征矩阵。The acoustic fusion matrix and the deconvolution result are input into the fourth matrix fusion layer for matrix fusion processing, and the fusion matrix is obtained, which is denoted as the enhanced acoustic feature matrix.

其中,对于第三矩阵融合层和第四矩阵融合层如何进行矩阵的融合,本实施例中不做具体限制,可由本领域技术人员根据实际需要进行配置。There is no specific limitation on how the third matrix fusion layer and the fourth matrix fusion layer perform matrix fusion in this embodiment, and can be configured by those skilled in the art according to actual needs.

比如,第三矩阵融合层和第四矩阵融合层结构相同,均包括两个子层,分别为加法层和层归一化层,在进行矩阵的融合时,先通过加法层对输入的两个矩阵相加,得到和值矩阵,然后再通过层归一化层对和值矩阵进行层归一化处理,得到融合矩阵。For example, the third matrix fusion layer and the fourth matrix fusion layer have the same structure, and both include two sub-layers, which are the addition layer and the layer normalization layer. Add to get the sum value matrix, and then perform layer normalization processing on the sum value matrix through the layer normalization layer to obtain the fusion matrix.

应当说明的是,在声学特征增强模块为M个时,每一声学特征增强模块进行相同的特征增强处理,只需要用将声学特征矩阵输入M个依次连接的声学特征增强模块中的第1个声学特征增强模块,由M个声学特征增强模块按照如上方式依次进行特征增强处理,将第M个声学特征增强模块输出的融合特征作为增强声学特征矩阵。It should be noted that when there are M acoustic feature enhancement modules, each acoustic feature enhancement module performs the same feature enhancement processing, and only needs to input the acoustic feature matrix into the first one of the M sequentially connected acoustic feature enhancement modules. In the acoustic feature enhancement module, the M acoustic feature enhancement modules perform feature enhancement processing in sequence in the above manner, and the fusion feature output by the Mth acoustic feature enhancement module is used as the enhanced acoustic feature matrix.

此外,本实施例对于以上第三卷积层、逆卷积层中的卷积核大小、步长大小以及padding大小的配置不做具体限制,可由本领域技术人员根据实际需要取值。In addition, this embodiment does not impose specific restrictions on the configuration of the convolution kernel size, stride size, and padding size in the third convolution layer and the deconvolution layer, which can be selected by those skilled in the art according to actual needs.

可以理解的是,本实施通过在对声学特征的增强过程中增强卷积处理和逆卷积处理,能够更有效的对特征图形式的声学特征进行增强,最终提取到其中与发音技巧更相关的特征。It can be understood that this implementation can more effectively enhance the acoustic features in the form of feature maps by enhancing the convolution processing and deconvolution processing in the process of enhancing the acoustic features, and finally extract the features that are more related to pronunciation skills. feature.

在一可选的实施例中,声学特征增强网络还包括第二位置编码模块和第五矩阵融合层,将声学特征矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵之前,还包括:In an optional embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, the acoustic feature matrix is input into the acoustic feature enhancement module for feature enhancement processing, and before the enhanced acoustic feature matrix is obtained, it also includes: :

将声学特征矩阵输入第二位置编码模块进行位置编码处理,得到第二位置编码矩阵;Input the acoustic feature matrix into the second position encoding module for position encoding processing to obtain the second position encoding matrix;

将第二位置编码矩阵和声学特征矩阵输入第五矩阵融合层进行矩阵融合处理,得到声学位置融合矩阵;The second position encoding matrix and the acoustic feature matrix are input into the fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;

将声学特征矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵,包括:Input the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing, and obtain the enhanced acoustic feature matrix, including:

将声学位置融合矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵。The acoustic position fusion matrix is input into the acoustic feature enhancement module for feature enhancement processing, and the enhanced acoustic feature matrix is obtained.

本实施例中,为了进一步提升发音技巧检测的准确性,本实施例中并不将原始的声学特征矩阵输入声学特征增强模块进行特征增强,而是对其进行位置编码后,将携带了位置信息的声学特征矩阵输入声学特征增强模块进行特征增强。In this embodiment, in order to further improve the accuracy of pronunciation skill detection, in this embodiment, the original acoustic feature matrix is not input into the acoustic feature enhancement module for feature enhancement, but the position information will be carried after position encoding. The acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement.

其中,电子设备首先将声学特征矩阵输入第二位置编码模块进行位置编码处理,得到位置编码矩阵,记为第二位置编码矩阵,该第二位置编码矩阵表征了声学特征矩阵中各矩阵单元的位置信息,可以是相对位置信息,也可以是绝对位置信息。The electronic device first inputs the acoustic feature matrix into the second position encoding module for position encoding processing to obtain a position encoding matrix, denoted as the second position encoding matrix, and the second position encoding matrix represents the position of each matrix unit in the acoustic feature matrix The information can be relative position information or absolute position information.

在获取到如上第二位置编码矩阵之后,电子设备即将第二位置编码矩阵和声学特征矩阵输入第五矩阵融合层进行矩阵融合处理,得到融合矩阵,记为声学位置融合矩阵。之后,电子设备进一步将该携带了位置信息的声学位置融合矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵。其中,对于声学特征增强模块如何进行特征增强处理,具体请参照以上实施例的相关说明,此处不再赘述。After obtaining the second position encoding matrix as above, the electronic device inputs the second position encoding matrix and the acoustic feature matrix into the fifth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, which is recorded as an acoustic position fusion matrix. After that, the electronic device further inputs the acoustic position fusion matrix carrying the position information into the acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix. For details on how the acoustic feature enhancement module performs feature enhancement processing, please refer to the relevant descriptions in the above embodiments, which will not be repeated here.

此外,还应当说明的是,本实施例对于第五矩阵融合层的矩阵融合方式不做具体限制,可由本领域技术人员根据实际需要进行配置。比如,第五矩阵融合层被配置为对输入的两个矩阵进行相加处理,将相加得到的和值矩阵作为融合矩阵输出。In addition, it should also be noted that this embodiment does not specifically limit the matrix fusion manner of the fifth matrix fusion layer, which can be configured by those skilled in the art according to actual needs. For example, the fifth matrix fusion layer is configured to perform addition processing on the two input matrices, and output the sum value matrix obtained by the addition as a fusion matrix.

在一可选的实施例中,请参照图10,特征融合网络包括第三矩阵转换层、第四矩阵转换层、第三多头注意力层、第六矩阵融合层、前馈网络层以及第七矩阵融合层,其中,In an optional embodiment, please refer to FIG. 10 , the feature fusion network includes a third matrix transformation layer, a fourth matrix transformation layer, a third multi-head attention layer, a sixth matrix fusion layer, a feedforward network layer, and a third matrix transformation layer. Seven matrix fusion layers, where,

第三矩阵转换层被配置为对输入的矩阵进行矩阵转换处理,得到键矩阵和值矩阵;The third matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a key matrix and a value matrix;

第四矩阵转换层被配置为对输入的矩阵进行矩阵转换处理,得到查询矩阵;The fourth matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a query matrix;

第三多头注意力层被配置为对第三矩阵转换层输出的键矩阵、值矩阵以及第四矩阵转换层输出的查询矩阵进行注意力增强处理,得到注意力增强矩阵;The third multi-head attention layer is configured to perform attention enhancement processing on the key matrix and the value matrix output by the third matrix conversion layer and the query matrix output by the fourth matrix conversion layer to obtain an attention enhancement matrix;

第六矩阵融合层被配置为对第三多头注意力层输出的注意力增强矩阵进行矩阵融合处理,得到融合矩阵;The sixth matrix fusion layer is configured to perform matrix fusion processing on the attention enhancement matrix output by the third multi-head attention layer to obtain a fusion matrix;

前馈网络层被配置为对第六矩阵融合层输出的融合矩阵进行前馈计算处理,得到前馈矩阵;The feedforward network layer is configured to perform feedforward calculation processing on the fusion matrix output by the sixth matrix fusion layer to obtain a feedforward matrix;

第七矩阵融合层被配置为对前馈网络层输出的前馈矩阵和第六矩阵融合层输出的融合矩阵进行矩阵融合处理,得到融合特征矩阵。The seventh matrix fusion layer is configured to perform matrix fusion processing on the feedforward matrix output by the feedforward network layer and the fusion matrix output by the sixth matrix fusion layer to obtain a fusion feature matrix.

相应的,电子设备可以按照如下方式将增强声学特征矩阵和音素特征矩阵输入特征融合网络进行特征融合处理:Correspondingly, the electronic device can input the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing in the following manner:

将增强声学特征矩阵输入第三矩阵转换层进行矩阵转换处理,得到键矩阵和值矩阵,分别记为第三键矩阵和第三值矩阵;Input the enhanced acoustic feature matrix into the third matrix conversion layer for matrix conversion processing to obtain a key matrix and a value matrix, which are respectively denoted as the third key matrix and the third value matrix;

将音素特征矩阵输入第四矩阵转换层进行矩阵转换处理,得到查询矩阵,记为第三查询矩阵;Input the phoneme feature matrix into the fourth matrix conversion layer for matrix conversion processing to obtain a query matrix, which is denoted as the third query matrix;

将第三查询矩阵、第三键矩阵和第三值矩阵输入第三多头注意力层进行注意力增强处理,得到注意力增强矩阵,记为第三注意力增强矩阵;Input the third query matrix, the third key matrix and the third value matrix into the third multi-head attention layer for attention enhancement processing, and obtain the attention enhancement matrix, which is recorded as the third attention enhancement matrix;

将第三注意力增强矩阵和第三查询矩阵输入第六矩阵融合层进行矩阵融合处理,得到融合矩阵,记为声学音素融合矩阵;Input the third attention enhancement matrix and the third query matrix into the sixth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, which is recorded as an acoustic phoneme fusion matrix;

将声学音素融合矩阵输入前馈网络层进行前馈计算处理,得到前馈矩阵;Input the acoustic phoneme fusion matrix into the feedforward network layer for feedforward calculation processing to obtain a feedforward matrix;

将前馈矩阵和声学音素融合矩阵输入第七矩阵融合层进行矩阵融合处理,得到融合矩阵,记为融合特征矩阵。The feedforward matrix and the acoustic phoneme fusion matrix are input into the seventh matrix fusion layer for matrix fusion processing to obtain a fusion matrix, which is denoted as fusion feature matrix.

其中,对于第六矩阵融合层和第七矩阵融合层如何进行矩阵的融合,本实施例中不做具体限制,可由本领域技术人员根据实际需要进行配置。There is no specific limitation in this embodiment on how the sixth matrix fusion layer and the seventh matrix fusion layer perform matrix fusion, and can be configured by those skilled in the art according to actual needs.

比如,第六矩阵融合层和第七矩阵融合层结构相同,均包括两个子层,分别为加法层和层归一化层,在进行矩阵的融合时,先通过加法层对输入的两个矩阵相加,得到和值矩阵,然后再通过层归一化层对和值矩阵进行层归一化处理,得到融合矩阵。For example, the sixth matrix fusion layer and the seventh matrix fusion layer have the same structure, and both include two sublayers, namely the addition layer and the layer normalization layer. Add to get the sum value matrix, and then perform layer normalization processing on the sum value matrix through the layer normalization layer to obtain the fusion matrix.

在一可选的实施例中,请参照图11,第一发音技巧检测网络包括第一全连接层和第一分类函数层,将音素特征矩阵输入第一发音技巧检测网络进行发音技巧检测处理,得到第一检测结果,包括:In an optional embodiment, please refer to FIG. 11 , the first pronunciation skill detection network includes the first fully connected layer and the first classification function layer, and the phoneme feature matrix is input into the first pronunciation skill detection network to carry out pronunciation skill detection processing, Get the first test result, including:

将音素特征矩阵输入第一全连接层进行全连接处理,得到第一全连接结果;Input the phoneme feature matrix into the first full connection layer for full connection processing to obtain the first full connection result;

将第一全连接结果输入第一分类函数层进行分类处理,得到第一检测结果。The first full connection result is input into the first classification function layer for classification processing to obtain the first detection result.

应当说明的是,由于本实施例在于对多类发音技巧进行发音技巧检测,相应的,第一分类函数层可以采用任一多分类函数。It should be noted that, since the present embodiment is to perform pronunciation skill detection on multiple types of pronunciation skills, correspondingly, the first classification function layer may adopt any multiple classification function.

以Softmax函数为例,Softmax函数输出向量的维度与期望检测的发音技巧的数量匹配,比如,以英语语种为例,期望检测的发音技巧包括连读、失去爆破和浊化,则Softmax函数的输出向量包括四个维度的元素,其中,1个元素用于表征是否需要采用发音技巧“连读”说出待检测文本,1个元素用于表征是否需要采用发音技巧“失去爆破”说出待检测文本,1个元素用于表征是否需要采用发音技巧“浊化”说出待检测文本,一个元素用于表征不需要采用发音技巧说出待检测文本。Taking the Softmax function as an example, the dimension of the output vector of the Softmax function matches the number of pronunciation skills expected to be detected. For example, taking the English language as an example, the expected detection pronunciation skills include continuous reading, loss of blasting and turbidity, then the output of the Softmax function The vector includes four-dimensional elements, of which 1 element is used to indicate whether the pronunciation skill "continuous reading" is required to speak the text to be detected, and 1 element is used to indicate whether the pronunciation skill "lost blasting" needs to be used to speak the to-be-detected text Text, 1 element is used to represent whether the text to be detected needs to be spoken by the pronunciation skill "voicing", and one element is used to represent the text to be detected without using the pronunciation skill.

相应的,将第一全连接结果输入Softmax函数,得到Softmax函数的4维输出向量,将该4维输出向量作为第一检测结果,根据该第一检测结果,即可确定是否需要采用发音技巧说出待检测文本,以及在需要采用发音技巧说出待检测文本时,具体需要采用何种发音技巧说出待检测文本。Correspondingly, the first full connection result is input into the Softmax function to obtain a 4-dimensional output vector of the Softmax function, the 4-dimensional output vector is used as the first detection result, and according to the first detection result, it can be determined whether to use pronunciation skills To output the text to be detected, and when pronunciation skills are required to speak the text to be detected, what kind of pronunciation skills need to be used to speak the text to be detected.

在一可选的实施例中,第二发音技巧检测网络包括L个分支检测网络,每一分支检测网络对应不同的发音技巧,L为大于1的整数,将融合特征矩阵输入第二发音技巧检测网络进行发音技巧检测处理,得到第二检测结果,包括:In an optional embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponds to a different pronunciation skill, L is an integer greater than 1, and the fusion feature matrix is input into the second pronunciation skill detection. The network performs pronunciation skill detection processing to obtain a second detection result, including:

将融合特征矩阵输入每一分支检测网络进行发音技巧检测,得到每一分支检测网络的分支发音技巧检测结果,每一分支检测网络的分支发音技巧检测结果表征说话人是否采用每一分支检测网络对应的发音技巧说出待检测文本;The fusion feature matrix is input into each branch detection network for pronunciation skill detection, and the branch pronunciation skill detection result of each branch detection network is obtained. The branch pronunciation skill detection result of each branch detection network indicates whether the speaker adopts the corresponding branch detection network. The pronunciation skills speak the text to be detected;

根据每一分支检测网络的分支发音技巧检测结果得到第二检测结果。The second detection result is obtained according to the branch pronunciation skill detection result of each branch detection network.

其中,每一分支检测网络对应一发音技巧,被配置检测说话人是否采用其对应的发音技巧说出待检测文本,相应的,有多少个分支检测网络,就会得到多少个分支发音技巧检测结果,Among them, each branch detection network corresponds to a pronunciation skill, and is configured to detect whether the speaker uses its corresponding pronunciation skill to speak the text to be detected. Correspondingly, how many branch detection networks there are will get as many branch pronunciation skills detection results. ,

比如,以英语语种为例,当期望检测的发音技巧包括连读、失去爆破和浊化时,L取值为3,即第二发音技巧检测网络将包括3个分支检测网络,其中,1分支检测网络对应发音技巧“连读”,1分支检测网络对应发音技巧“失去爆破”,1分支检测网络对应发音技巧“浊化”。相应的,这3个分支检测网络将各自输出1个分支发音技巧检测结果,共得到3个分支发音技巧检测结果组合为第二检测结果。此时,第二检测结果即表征了说话人是否采用发音技巧说出待检测文本,以及在说话人采用发音技巧说出待检测文本时,具体采用何种发音技巧说出待检测文本。For example, taking the English language as an example, when the desired pronunciation skills include continuous reading, loss of blasting and voicing, the value of L is 3, that is, the second pronunciation skill detection network will include 3 branch detection networks, of which 1 branch The detection network corresponds to the pronunciation skill "Continuous Reading", the 1 branch detection network corresponds to the pronunciation skill "Lost blasting", and the 1 branch detection network corresponds to the pronunciation skill "turbidization". Correspondingly, the three branch detection networks will each output one branch pronunciation skill detection result, and a total of three branch pronunciation skill detection results are obtained and combined into the second detection result. At this time, the second detection result represents whether the speaker uses pronunciation skills to speak the text to be detected, and when the speaker uses pronunciation skills to speak the text to be detected, what pronunciation skills are used to speak the text to be detected.

应当说明的是,每一分支检测网络的结构相同,以下以一分支检测网络为例进行说明,请参照图12,分支检测网络包括第二全连接层和第二分类函数层,将融合特征矩阵输入每一分支检测网络进行发音技巧检测,得到每一分支检测网络的分支发音技巧检测结果,包括:It should be noted that the structure of each branch detection network is the same. The following will take a branch detection network as an example for description. Please refer to FIG. 12. The branch detection network includes a second fully connected layer and a second classification function layer, and the feature matrix Input each branch detection network for pronunciation skill detection, and obtain the branch pronunciation skill detection results of each branch detection network, including:

将融合特征矩阵输入第二全连接层进行全连接处理,得到第二全连接结果;Input the fusion feature matrix into the second fully connected layer for full connection processing to obtain the second fully connected result;

将第二全连接结果输入第二分类函数层进行分类处理,得到分支发音技巧检测结果。The second full connection result is input into the second classification function layer for classification processing, and the branch pronunciation skill detection result is obtained.

应当说明的是,第二分类函数层可以采用任一二分类函数。It should be noted that the second classification function layer can adopt any binary classification function.

以sigmoid函数为例,sigmoid函数的输出值位于[0,1],经过模型训练,sigmoid函数的输出可以表征说话人采用其所在分支检测网络对应的发音技巧说出待检测文本的概率。比如,当sigmoid函数的输出值达到预设阈值(可由本领域技术人员根据实际需要取经验值)时,即可判定说话人采用了其所在分支检测网络对应的发音技巧说出待检测文本。Taking the sigmoid function as an example, the output value of the sigmoid function is located in [0, 1]. After model training, the output of the sigmoid function can represent the probability that the speaker uses the pronunciation skills corresponding to the branch detection network to speak the text to be detected. For example, when the output value of the sigmoid function reaches a preset threshold (experience values can be obtained by those skilled in the art according to actual needs), it can be determined that the speaker uses the pronunciation skills corresponding to the branch detection network where he is located to speak the text to be detected.

相应的,将第二全连接结果输入sigmoid函数,得到sigmoid函数的输出值,将该输出值作为分支发音技巧检测结果。根据该分支发音技巧检测结果,即可判定说话人是否采用其所在分支检测网络对应的发音技巧说出待检测文本。Correspondingly, the second full connection result is input into the sigmoid function to obtain the output value of the sigmoid function, and the output value is used as the branch pronunciation skill detection result. According to the detection result of the branch pronunciation skill, it can be determined whether the speaker speaks the text to be detected by using the pronunciation skill corresponding to the branch detection network where the speaker is located.

在一可选的实施例中,获取待检测文本,将待检测文本转换为对应的音素序列之前,还包括:In an optional embodiment, before acquiring the text to be detected and converting the text to be detected into a corresponding phoneme sequence, the method further includes:

获取已知需要采用不同发音技巧说出的多类第一样本文本,将每一类第一样本文本转换为对应的正样本音素序列;Obtain multiple types of first sample texts that are known to be spoken by different pronunciation skills, and convert each type of first sample text into a corresponding positive sample phoneme sequence;

获取样本用户采用不同发音技巧说出每一类第一样本文本的第一样本音频,提取每一类第一样本文本的第一样本音频的正样本声学特征;Obtaining a sample user uses different pronunciation skills to speak the first sample audio of each type of first sample text, and extracts positive sample acoustic features of the first sample audio of each type of first sample text;

获取已知不需要采用发音技巧说出的第二样本文本,将第二样本文本转换为对应的负样本音素序列;Obtaining a second sample text that is known not to be spoken by pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;

获取样本用户说出第二样本文本的第二样本音频,提取第二样本音频的负样本声学特征;Obtain the second sample audio of the second sample text spoken by the sample user, and extract the negative sample acoustic features of the second sample audio;

根据每一类正样本音素序列、每一类正样本声学特征、负样本音素序列以及负样本声学特征进行模型训练,得到发音技巧检测模型。Model training is performed according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature, and a pronunciation skill detection model is obtained.

本实施例中,并不人为的去构建声学特征样本和音素序列样本,而是从数据驱动的思路出发,让模型从大量的数据中学习不同的发音技巧。以下以一特定语种为例进行说明。In this embodiment, acoustic feature samples and phoneme sequence samples are not constructed artificially, but a data-driven approach is used to allow the model to learn different pronunciation skills from a large amount of data. The following description takes a specific language as an example.

针对该语种,电子设备分别获取已知需要采用不同发音技巧说出的多类第一样本文本。此处对于获取的每一类发音技巧的第一样本文本的数量不做具体限制,可由本领域技术人员根据实际需要进行配置。For this language, the electronic device obtains, respectively, multiple types of first sample texts that are known to need to be spoken with different pronunciation skills. There is no specific limitation on the quantity of the acquired first sample texts for each type of pronunciation skill, which can be configured by those skilled in the art according to actual needs.

针对获取到每一类发音技巧的第一样本文本(以下简称为每一类第一样本文本),电子设备分别将每一类第一样本文本转换为对于的音素序列,记为正样本音素序列。其中,对于如何将第一样本文本转换为音素序列,可参照以上实施例中将待检测文本转换为音素序列的方式相应实施,此处不再赘述。For the first sample text obtained for each type of pronunciation skill (hereinafter referred to as the first sample text of each type), the electronic device converts each type of first sample text into corresponding phoneme sequences, which are marked as positive Sample phoneme sequence. Wherein, how to convert the first sample text into a phoneme sequence can be implemented with reference to the method of converting the text to be detected into a phoneme sequence in the above embodiment, which will not be repeated here.

电子设备还获取样本用户采用不同发音技巧说出每一类第一样本文本的音频,记为第一样本音频,以及提取每一类第一样本文本的第一样本音频的声学特征,记为正样本声学特征。其中,样本用户可以为具备发音技巧的真实的人,也可以为具备发音技巧的虚拟的人,相应的,对于如何获取每一类第一样本文本的第一样本音频,以及如何提取正样本声学特征,可参照以上实施例中获取待检测音频以及提取待检测音频的声学特征的方式相应实施,此处不再赘述。The electronic device also obtains the audio of the sample user speaking each type of first sample text using different pronunciation skills, which is recorded as the first sample audio, and extracts the acoustic features of the first sample audio of each type of first sample text. , denoted as positive sample acoustic features. Among them, the sample user can be a real person with pronunciation skills, or a virtual person with pronunciation skills. Correspondingly, for how to obtain the first sample audio of each type of first sample text, and how to extract the positive The sample acoustic features can be implemented with reference to the manners of acquiring the audio to be detected and extracting the acoustic features of the audio to be detected in the above embodiments, which will not be repeated here.

此外,电子设备还获取已知不需要采用发音技巧说出的文本,记为第二样本文本,以及将第二样本文本转换为对应的音素序列,记为负样本音素序列。其中,对于如何将第二样本文本转换为音素序列,可参照以上实施例中将待检测文本转换为音素序列的方式相应实施,此处不再赘述。In addition, the electronic device also acquires the text known to be spoken without using pronunciation skills, denoted as the second sample text, and converts the second sample text into a corresponding phoneme sequence, denoted as a negative sample phoneme sequence. Wherein, how to convert the second sample text into a phoneme sequence can be implemented with reference to the method of converting the text to be detected into a phoneme sequence in the above embodiment, which will not be repeated here.

电子设备还获取样本用户说出第二样本文本的音频,记为第二样本音频,以及提取第二样本音频的声学特征,记为负样本声学特征。其中,对于如何获取第二样本文本的第二样本音频,以及如何提取负样本声学特征,可参照以上实施例中获取待检测音频以及提取待检测音频的声学特征的方式相应实施,此处不再赘述。The electronic device further acquires the audio of the sample user speaking the second sample text, denoted as the second sample audio, and extracts the acoustic feature of the second sample audio, denoted as the negative sample acoustic feature. Wherein, how to obtain the second sample audio of the second sample text and how to extract the negative sample acoustic features can be implemented with reference to the methods of obtaining the audio to be detected and extracting the acoustic features of the audio to be detected in the above embodiment, which is not repeated here. Repeat.

应当说明的是,本实施例对于样本用户的数量不做具体限制,可由本领域技术人员根据实际需要进行配置,比如,本实施例中通过500个样本用户来获取如上正样本音素序列、正样本声学特征、负样本音素序列以及负样本声学特征。It should be noted that this embodiment does not specifically limit the number of sample users, which can be configured by those skilled in the art according to actual needs. For example, in this embodiment, 500 sample users are used to obtain the above positive sample phoneme sequence, positive sample Acoustic features, negative sample phoneme sequences, and negative sample acoustic features.

在获取到如上正样本音素序列、正样本声学特征、负样本音素序列以及负样本声学特征之后,电子设备即根据每一类正样本音素序列、每一类正样本声学特征、负样本音素序列以及负样本声学特征进行模型训练,直至满足预设停止条件时,得到发音技巧检测模型。其中,预设停止条件可以配置为训练过程中对模型的迭代次数达到预设次数,或者模型收敛。After acquiring the positive sample phoneme sequence, positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature as above, the electronic device will determine according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and The negative sample acoustic features are used for model training until the preset stopping conditions are met, and the pronunciation skill detection model is obtained. The preset stop condition may be configured as the number of iterations of the model reaches the preset number of times during the training process, or the model converges.

请参照图13,为更好的执行本申请所提供的发音技巧检测方法,本申请进一步提供一种发音技巧检测装置400,如图13所示,该发音技巧检测装置400包括:Please refer to FIG. 13 , in order to better implement the pronunciation skill detection method provided by the present application, the present application further provides a pronunciation skill detection device 400, as shown in FIG. 13 , the pronunciation skill detection device 400 includes:

第一获取模块410,用于获取待检测文本,将待检测文本转换为对应的音素序列;The first obtaining module 410 is used to obtain the text to be detected, and convert the text to be detected into a corresponding phoneme sequence;

第二获取模块420,用于获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征;The second acquisition module 420 is configured to acquire the audio to be detected obtained by the speaker speaking the text to be detected, and to extract the acoustic features of the audio to be detected;

检测模块430,用于将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;The detection module 430 is used for inputting the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model for pronunciation skill detection processing to obtain the first detection result and the second detection result;

其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。The first detection result is used to represent whether the text to be detected needs to be spoken by pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

在一可选的实施例中,发音技巧检测模型包括音素特征提取网络、声学特征增强网络、特征融合网络、第一发音技巧检测网络以及第二发音技巧检测网络,检测模块430用于:In an optional embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the detection module 430 is used for:

将音素序列输入音素特征提取网络进行特征提取处理,得到音素特征矩阵;Input the phoneme sequence into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix;

将音素特征矩阵输入第一发音技巧检测网络进行发音技巧检测处理,得到第一检测结果;Input the phoneme feature matrix into the first pronunciation skill detection network for pronunciation skill detection processing, and obtain the first detection result;

若第一检测结果表征需要采用发音技巧说出待检测文本,则将声学特征输入声学特征增强网络进行特征增强处理,得到增强声学特征矩阵;If the first detection result representation needs to use pronunciation skills to speak the text to be detected, the acoustic features are input into the acoustic feature enhancement network for feature enhancement processing, and an enhanced acoustic feature matrix is obtained;

将增强声学特征矩阵和音素特征矩阵输入特征融合网络进行特征融合处理,得到融合特征矩阵;Input the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix;

将融合特征矩阵输入第二发音技巧检测网络进行发音技巧检测处理,得到第二检测结果。The fusion feature matrix is input into the second pronunciation skill detection network for pronunciation skill detection processing, and the second detection result is obtained.

在一可选的实施例中,音素特征提取网络包括音素嵌入模块和音素特征提取模块,检测模块430用于:In an optional embodiment, the phoneme feature extraction network includes a phoneme embedding module and a phoneme feature extraction module, and the detection module 430 is used for:

将音素序列输入音素嵌入模块进行嵌入处理,得到音素向量矩阵;Input the phoneme sequence into the phoneme embedding module for embedding processing, and obtain the phoneme vector matrix;

将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵。The phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.

在一可选的实施例中,音素特征提取模块包括至少1个音素特征提取子模块,检测模块430用于:In an optional embodiment, the phoneme feature extraction module includes at least one phoneme feature extraction sub-module, and the detection module 430 is used for:

在音素特征提取子模块为1个时,将音素向量矩阵输入音素特征提取子模块进行特征提取处理,得到音素特征矩阵;或者,When the number of phoneme feature extraction submodules is one, the phoneme vector matrix is input into the phoneme feature extraction submodule for feature extraction processing to obtain a phoneme feature matrix; or,

在音素特征提取子模块为N个时,将音素向量矩阵输入N个音素特征提取子模块依次进行特征提取处理,得到音素特征矩阵,N为大于1的整数。When the number of phoneme feature extraction submodules is N, the phoneme vector matrix is input into the N phoneme feature extraction submodules to perform feature extraction processing in sequence to obtain a phoneme feature matrix, where N is an integer greater than 1.

在一可选的实施例中,音素特征提取子模块包括第一矩阵转换层、第一多头注意力层和第一矩阵融合层,检测模块430用于:In an optional embodiment, the phoneme feature extraction sub-module includes a first matrix conversion layer, a first multi-head attention layer and a first matrix fusion layer, and the detection module 430 is used for:

将音素向量矩阵输入第一矩阵转换层进行矩阵转换处理,得到第一查询矩阵、第一键矩阵和第一值矩阵;Input the phoneme vector matrix into the first matrix conversion layer for matrix conversion processing to obtain the first query matrix, the first key matrix and the first value matrix;

将第一查询矩阵、第一键矩阵和第一值矩阵输入第一多头注意力层进行注意力增强处理,得到第一注意力增强矩阵;Inputting the first query matrix, the first key matrix and the first value matrix into the first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix;

将第一注意力增强矩阵和音素向量矩阵输入第一矩阵融合层进行矩阵融合处理,得到音素特征矩阵。The first attention enhancement matrix and the phoneme vector matrix are input into the first matrix fusion layer for matrix fusion processing to obtain the phoneme feature matrix.

在一可选的实施例中,音素特征提取网络还包括第一位置编码模块和第二矩阵融合层,在将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵之前,检测模块430还用于:In an optional embodiment, the phoneme feature extraction network further includes a first position encoding module and a second matrix fusion layer, and before the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing, and before the phoneme feature matrix is obtained, the detection module 430 is also used for:

将音素向量矩阵输入第一位置编码模块进行位置编码处理,得到第一位置编码矩阵;Input the phoneme vector matrix into the first position encoding module for position encoding processing to obtain the first position encoding matrix;

将第一位置编码矩阵和音素向量矩阵输入第二矩阵融合层进行矩阵融合处理,得到音素位置融合矩阵;Inputting the first position encoding matrix and the phoneme vector matrix into the second matrix fusion layer to perform matrix fusion processing to obtain a phoneme position fusion matrix;

在将音素向量矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵时,检测模块430用于将音素位置融合矩阵输入音素特征提取模块进行特征提取处理,得到音素特征矩阵。When the phoneme vector matrix is input into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix, the detection module 430 is configured to input the phoneme position fusion matrix into the phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.

在一可选的实施例中,声学特征增强网络包括特征编码模块和至少1个声学特征增强模块,检测模块430用于:In an optional embodiment, the acoustic feature enhancement network includes a feature encoding module and at least one acoustic feature enhancement module, and the detection module 430 is used for:

将声学特征输入特征编码模块进行特征编码处理,得到声学特征矩阵;Input the acoustic features into the feature encoding module for feature encoding processing to obtain an acoustic feature matrix;

在声学特征增强模块为1个时,将声学特征矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵;或者,When the number of acoustic feature enhancement modules is one, input the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix; or,

在声学特征增强模块为M个时,将声学特征矩阵输入M个声学特征增强模块依次进行特征增强处理,得到增强声学特征矩阵,M为大于1的整数。When the number of acoustic feature enhancement modules is M, the acoustic feature matrix is input into the M acoustic feature enhancement modules to perform feature enhancement processing in turn to obtain an enhanced acoustic feature matrix, where M is an integer greater than 1.

在一可选的实施例中,特征编码模块包括第一卷积层、第一池化层、第二卷积层和第二池化层,检测模块430用于:In an optional embodiment, the feature encoding module includes a first convolution layer, a first pooling layer, a second convolution layer, and a second pooling layer, and the detection module 430 is used for:

将声学特征输入第一卷积层进行卷积处理,得到第一卷积结果;Input the acoustic features into the first convolution layer for convolution processing to obtain the first convolution result;

将第一卷积结果输入第一池化层进行池化处理,得到第一池化结果;Input the first convolution result into the first pooling layer for pooling processing to obtain the first pooling result;

将第一池化结果输入第二卷积层进行卷积处理,得到第二卷积结果;Input the first pooling result into the second convolution layer for convolution processing to obtain the second convolution result;

将第二卷积结果输入第二池化层进行池化处理,得到声学特征矩阵。The second convolution result is input into the second pooling layer for pooling processing, and the acoustic feature matrix is obtained.

在一可选的实施例中,声学特征增强模块包括第二矩阵转换层、第二多头注意力层、第三矩阵融合层、第三卷积层、逆卷积层和第四矩阵融合层,检测模块430用于:In an optional embodiment, the acoustic feature enhancement module includes a second matrix transformation layer, a second multi-head attention layer, a third matrix fusion layer, a third convolution layer, a deconvolution layer, and a fourth matrix fusion layer. , the detection module 430 is used for:

将声学特征矩阵输入第二矩阵转换层进行矩阵转换处理,得到第二查询矩阵、第二键矩阵和第二值矩阵;Input the acoustic feature matrix into the second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix;

将第二查询矩阵、第二键矩阵和第二值矩阵输入第二多头注意力层进行注意力增强处理,得到第二注意力增强矩阵;Input the second query matrix, the second key matrix and the second value matrix into the second multi-head attention layer for attention enhancement processing to obtain the second attention enhancement matrix;

将第二注意力增强矩阵和声学特征矩阵输入第三矩阵融合层进行矩阵融合处理,得到声学融合矩阵;Input the second attention enhancement matrix and the acoustic feature matrix into the third matrix fusion layer for matrix fusion processing to obtain the acoustic fusion matrix;

将声学融合矩阵输入第三卷积层进行卷积处理,得到第三卷积结果;Input the acoustic fusion matrix into the third convolution layer for convolution processing, and obtain the third convolution result;

将第三卷积结果输入逆卷积层进行逆卷积处理,得到逆卷积结果;Input the third convolution result into the deconvolution layer for deconvolution processing to obtain the deconvolution result;

将声学融合矩阵和逆卷积结果输入第四矩阵融合层进行矩阵融合处理,得到增强声学特征矩阵。The acoustic fusion matrix and the deconvolution result are input into the fourth matrix fusion layer for matrix fusion processing, and the enhanced acoustic feature matrix is obtained.

在一可选的实施例中,声学特征增强网络还包括第二位置编码模块和第五矩阵融合层,在将声学特征矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵之前,检测模块430还用于:In an optional embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before the acoustic feature matrix is input into the acoustic feature enhancement module for feature enhancement processing, and before the enhanced acoustic feature matrix is obtained, detection is performed. Module 430 is also used to:

将声学特征矩阵输入第二位置编码模块进行位置编码处理,得到第二位置编码矩阵;Input the acoustic feature matrix into the second position encoding module for position encoding processing to obtain the second position encoding matrix;

将第二位置编码矩阵和声学特征矩阵输入第五矩阵融合层进行矩阵融合处理,得到声学位置融合矩阵;The second position encoding matrix and the acoustic feature matrix are input into the fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;

在将声学特征矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵时,检测模块430用于:When the acoustic feature matrix is input into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix, the detection module 430 is used for:

将声学位置融合矩阵输入声学特征增强模块进行特征增强处理,得到增强声学特征矩阵。The acoustic position fusion matrix is input into the acoustic feature enhancement module for feature enhancement processing, and the enhanced acoustic feature matrix is obtained.

在一可选的实施例中,特征融合网络包括第三矩阵转换层、第四矩阵转换层、第三多头注意力层、第六矩阵融合层、前馈网络层以及第七矩阵融合层,检测模块430用于:In an optional embodiment, the feature fusion network includes a third matrix conversion layer, a fourth matrix conversion layer, a third multi-head attention layer, a sixth matrix fusion layer, a feedforward network layer, and a seventh matrix fusion layer, The detection module 430 is used to:

将增强声学特征矩阵输入第三矩阵转换层进行矩阵转换处理,得到第三键矩阵和第三值矩阵;Input the enhanced acoustic feature matrix into the third matrix conversion layer for matrix conversion processing to obtain the third key matrix and the third value matrix;

将音素特征矩阵输入第四矩阵转换层进行矩阵转换处理,得到第三查询矩阵;Input the phoneme feature matrix into the fourth matrix conversion layer for matrix conversion processing to obtain a third query matrix;

将第三查询矩阵、第三键矩阵和第三值矩阵输入第三多头注意力层进行注意力增强处理,得到第三注意力增强矩阵;Input the third query matrix, the third key matrix and the third value matrix into the third multi-head attention layer for attention enhancement processing to obtain the third attention enhancement matrix;

将第三注意力增强矩阵和第三查询矩阵输入第六矩阵融合层进行矩阵融合处理,得到声学音素融合矩阵;Input the third attention enhancement matrix and the third query matrix into the sixth matrix fusion layer for matrix fusion processing, and obtain the acoustic phoneme fusion matrix;

将声学音素融合矩阵输入前馈网络层进行前馈计算处理,得到前馈矩阵;Input the acoustic phoneme fusion matrix into the feedforward network layer for feedforward calculation processing to obtain a feedforward matrix;

将前馈矩阵和声学音素融合矩阵输入第七矩阵融合层进行矩阵融合处理,得到融合特征矩阵。The feedforward matrix and the acoustic phoneme fusion matrix are input into the seventh matrix fusion layer for matrix fusion processing, and the fusion feature matrix is obtained.

在一可选的实施例中,第一发音技巧检测网络包括第一全连接层和第一分类函数层,检测模块430用于:In an optional embodiment, the first pronunciation skill detection network includes a first fully connected layer and a first classification function layer, and the detection module 430 is used for:

将音素特征矩阵输入第一全连接层进行全连接处理,得到第一全连接结果;Input the phoneme feature matrix into the first full connection layer for full connection processing to obtain the first full connection result;

将第一全连接结果输入第一分类函数层进行分类处理,得到第一检测结果。The first full connection result is input into the first classification function layer for classification processing to obtain the first detection result.

在一可选的实施例中,第二发音技巧检测网络包括L个分支检测网络,每一分支检测网络对应不同的发音技巧,L为大于1的整数,检测模块430用于:In an optional embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponds to a different pronunciation skill, L is an integer greater than 1, and the detection module 430 is used for:

将融合特征矩阵输入每一分支检测网络进行发音技巧检测,得到每一分支检测网络的分支发音技巧检测结果,每一分支检测网络的分支发音技巧检测结果表征说话人是否采用每一分支检测网络对应的发音技巧说出待检测文本;The fusion feature matrix is input into each branch detection network for pronunciation skill detection, and the branch pronunciation skill detection result of each branch detection network is obtained. The branch pronunciation skill detection result of each branch detection network indicates whether the speaker adopts the corresponding branch detection network. The pronunciation skills of say the text to be detected;

根据每一分支检测网络的分支发音技巧检测结果得到第二检测结果。The second detection result is obtained according to the branch pronunciation skill detection result of each branch detection network.

在一可选的实施例中,分支检测网络包括第二全连接层和第二分类函数层,检测模块430用于:In an optional embodiment, the branch detection network includes a second fully connected layer and a second classification function layer, and the detection module 430 is used for:

将融合特征矩阵输入第二全连接层进行全连接处理,得到第二全连接结果;Input the fusion feature matrix into the second fully connected layer for full connection processing to obtain the second fully connected result;

将第二全连接结果输入第二分类函数层进行分类处理,得到分支发音技巧检测结果。The second full connection result is input into the second classification function layer for classification processing, and the branch pronunciation skill detection result is obtained.

在一可选的实施例中,本申请提供的发音技巧检测装置还包括训练模块,用于:In an optional embodiment, the pronunciation skill detection device provided by the present application further includes a training module for:

获取已知需要采用不同发音技巧说出的多类第一样本文本,将每一类第一样本文本转换为对应的正样本音素序列;Obtain multiple types of first sample texts that are known to be spoken by different pronunciation skills, and convert each type of first sample text into a corresponding positive sample phoneme sequence;

获取样本用户采用不同发音技巧说出每一类第一样本文本的第一样本音频,提取每一类第一样本文本的第一样本音频的正样本声学特征;Obtaining a sample user uses different pronunciation skills to speak the first sample audio of each type of first sample text, and extracts positive sample acoustic features of the first sample audio of each type of first sample text;

获取已知不需要采用发音技巧说出的第二样本文本,将第二样本文本转换为对应的负样本音素序列;Obtaining a second sample text that is known not to be spoken by pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;

获取样本用户说出第二样本文本的第二样本音频,提取第二样本音频的负样本声学特征;Obtain the second sample audio of the second sample text spoken by the sample user, and extract the negative sample acoustic features of the second sample audio;

根据每一类正样本音素序列、每一类正样本声学特征、负样本音素序列以及负样本声学特征进行模型训练,得到发音技巧检测模型。Model training is performed according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature, and a pronunciation skill detection model is obtained.

在一可选的实施例中,第二获取模块420用于:In an optional embodiment, the second obtaining module 420 is used for:

提取待检测音频的Filterbank特征、基频特征和能量特征;Extract the Filterbank feature, fundamental frequency feature and energy feature of the audio to be detected;

融合Filterbank特征、基频特征和能量特征,得到声学特征。Acoustic features are obtained by fusing Filterbank features, fundamental frequency features and energy features.

在一可选的实施例中,第一获取模块410用于:In an optional embodiment, the first obtaining module 410 is used for:

去除待检测文本中不发音的文本单元,得到新的待检测文本;Remove the silent text units in the text to be detected to obtain a new text to be detected;

将新的待检测文本中的每一文本单元转换为对应的音素单元,得到音素序列。Convert each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.

应当说明的是,本申请实施例提供的发音技巧检测装置400与上文实施例中的发音技巧检测方法属于同一构思,其具体实现过程详见以上相关实施例,此处不再赘述。It should be noted that the pronunciation skill detection apparatus 400 provided in the embodiment of the present application and the pronunciation skill detection method in the above embodiment belong to the same concept, and the specific implementation process thereof is detailed in the above related embodiments, and will not be repeated here.

本申请实施例还提供一种电子设备,包括存储器和处理器,其中处理器通过调用存储器中存储的计算机程序,用于执行本实施例提供的发音技巧检测方法中的步骤。An embodiment of the present application further provides an electronic device, including a memory and a processor, wherein the processor is configured to execute the steps in the pronunciation skill detection method provided by the present embodiment by calling a computer program stored in the memory.

请参照图14,图14为本申请实施例提供的电子设备100的结构示意图。Please refer to FIG. 14 , which is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application.

该电子设备100可以包括网络接口110、存储器120、处理器130以及屏幕组件等部件。本领域技术人员可以理解,图14中示出的电子设备100结构并不构成对电子设备100的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The electronic device 100 may include components such as a network interface 110, a memory 120, a processor 130, and a screen assembly. Those skilled in the art can understand that the structure of the electronic device 100 shown in FIG. 14 does not constitute a limitation on the electronic device 100, and may include more or less components than the one shown, or combine some components, or different components layout.

网络接口110可以用于进行设备之间的网络连接。The network interface 110 may be used to make network connections between devices.

存储器120可用于存储计算机程序和数据。存储器120存储的计算机程序中包含有可执行代码。计算机程序可以划分为各种功能模块。处理器130通过运行存储在存储器120的计算机程序,从而执行各种功能应用以及数据处理。Memory 120 may be used to store computer programs and data. The computer programs stored in the memory 120 contain executable codes. A computer program can be divided into various functional modules. The processor 130 executes various functional applications and data processing by executing computer programs stored in the memory 120 .

处理器130是电子设备100的控制中心,利用各种接口和线路连接整个电子设备100的各个部分,通过运行或执行存储在存储器120内的计算机程序,以及调用存储在存储器120内的数据,执行电子设备100的各种功能和处理数据,从而对电子设备100进行整体控制。The processor 130 is the control center of the electronic device 100, uses various interfaces and lines to connect various parts of the entire electronic device 100, executes or executes the computer program stored in the memory 120, and invokes the data stored in the memory 120 to execute. Various functions and processing data of the electronic device 100 to perform overall control of the electronic device 100 .

在本申请实施例中,电子设备100中的处理器130会按照如下的指令,将一个或一个以上的计算机程序对应的可执行代码加载到存储器120中,并由处理器130来执行本申请提供的发音技巧检测方法中的步骤,比如:In this embodiment of the present application, the processor 130 in the electronic device 100 loads executable codes corresponding to one or more computer programs into the memory 120 according to the following instructions, and the processor 130 executes the executable codes provided by the present application. The steps in the pronunciation skill detection method of , such as:

获取待检测文本,将待检测文本转换为对应的音素序列;Obtain the text to be detected, and convert the text to be detected into a corresponding phoneme sequence;

获取说话人说出待检测文本得到的待检测音频,提取待检测音频的声学特征;Obtain the audio to be detected obtained by the speaker speaking the text to be detected, and extract the acoustic features of the audio to be detected;

将音素序列和声学特征输入已训练的发音技巧检测模型进行发音技巧检测处理,得到第一检测结果和第二检测结果;Input the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model for pronunciation skill detection processing, and obtain the first detection result and the second detection result;

其中,第一检测结果用于表征是否需要采用发音技巧说出待检测文本,第二检测结果用于表征说话人是否采用发音技巧说出待检测文本。The first detection result is used to represent whether the text to be detected needs to be spoken by using pronunciation skills, and the second detection result is used to represent whether the speaker uses pronunciation skills to speak the text to be detected.

应当说明的是,本申请实施例提供的电子设备100与上文实施例中的发音技巧检测方法属于同一构思,其具体实现过程详见以上相关实施例,此处不再赘述。It should be noted that the electronic device 100 provided by the embodiment of the present application and the pronunciation skill detection method in the above embodiment belong to the same concept, and the specific implementation process thereof is detailed in the above related embodiments, which will not be repeated here.

本申请还提供一种计算机可读的存储介质,其上存储有计算机程序,当其存储的计算机程序在本申请实施例提供的电子设备的处理器上执行时,使得电子设备的处理器执行以上任一适于电子设备的发音技巧检测方法中的步骤。其中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM)或者随机存取器(Random Access Memory,RAM)等。The present application further provides a computer-readable storage medium on which a computer program is stored, and when the stored computer program is executed on the processor of the electronic device provided by the embodiment of the present application, the processor of the electronic device executes the above Steps in any pronunciation skill detection method suitable for electronic equipment. The storage medium may be a magnetic disk, an optical disk, a read only memory (Read Only Memory, ROM), or a random access device (Random Access Memory, RAM), or the like.

以上对本申请所提供的一种发音技巧检测方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。A pronunciation skill detection method, device, storage medium and electronic device provided by the present application have been introduced in detail above. The principles and implementations of the present application are described with specific examples. The description of the above embodiments is only used for In order to help understand the method of the present application and its core idea; at the same time, for those skilled in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. In summary, the content of this specification It should not be construed as a limitation of this application.

Claims (20)

1. A pronunciation skill detection method, comprising:
acquiring a text to be detected, and converting the text to be detected into a corresponding phoneme sequence;
acquiring a to-be-detected audio obtained by a speaker speaking the to-be-detected text, and extracting acoustic characteristics of the to-be-detected audio;
inputting the phoneme sequence and the acoustic features into a trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the text to be detected needs to be spoken by pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by pronunciation skills.
2. The pronunciation skill detection method according to claim 1, wherein the pronunciation skill detection model comprises a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network and a second pronunciation skill detection network, and the inputting the phoneme sequence and the acoustic features into the trained pronunciation skill detection model for pronunciation skill detection processing to obtain a first detection result and a second detection result comprises:
inputting the phoneme sequence into the phoneme feature extraction network for feature extraction processing to obtain a phoneme feature matrix;
inputting the phoneme feature matrix into the first pronunciation skill detection network to perform pronunciation skill detection processing to obtain the first detection result;
inputting the acoustic features into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;
inputting the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix;
and inputting the fusion feature matrix into the second pronunciation skill detection network to perform pronunciation skill detection processing to obtain the second detection result.
3. The pronunciation skill detection method as claimed in claim 2, wherein the phoneme feature extraction network comprises a phoneme embedding module and a phoneme feature extraction module, and the inputting the phoneme sequence into the phoneme feature extraction network for feature extraction to obtain a phoneme feature matrix comprises:
inputting the phoneme sequence into the phoneme embedding module for embedding to obtain a phoneme vector matrix;
and inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix.
4. The pronunciation skill detection method as claimed in claim 3, wherein the phone feature extraction module comprises at least 1 phone feature extraction sub-module, and the inputting the phone vector matrix into the phone feature extraction module for feature extraction to obtain the phone feature matrix comprises:
when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule for feature extraction processing to obtain the phoneme feature matrix; or,
and when the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain the phoneme feature matrix, wherein N is an integer greater than 1.
5. The pronunciation skill detection method as claimed in claim 4, wherein the phoneme feature extraction submodule comprises a first matrix conversion layer, a first multi-attention layer and a first matrix fusion layer, and the inputting the phoneme vector matrix into the phoneme feature extraction submodule for feature extraction to obtain the phoneme feature matrix comprises:
inputting the phoneme vector matrix into the first matrix conversion layer for matrix conversion processing to obtain a first query matrix, a first key matrix and a first value matrix;
inputting the first query matrix, the first key matrix and the first value matrix into the first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix;
and inputting the first attention enhancing matrix and the phoneme vector matrix into the first matrix fusion layer for matrix fusion processing to obtain the phoneme feature matrix.
6. The pronunciation skill detection method as claimed in claim 3, wherein the phoneme feature extraction network further comprises a first position coding module and a second matrix fusion layer, and before inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction to obtain the phoneme feature matrix, the method further comprises:
inputting the phoneme vector matrix into the first position coding module for position coding processing to obtain a first position coding matrix;
inputting the first position coding matrix and the phoneme vector matrix into the second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;
the inputting the phoneme vector matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix includes:
and inputting the phoneme position fusion matrix into the phoneme feature extraction module for feature extraction processing to obtain the phoneme feature matrix.
7. The pronunciation skill detection method according to claim 2, wherein the acoustic feature enhancement network comprises a feature coding module and at least 1 acoustic feature enhancement module, and the inputting the acoustic features into the acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix comprises:
inputting the acoustic features into the feature coding module for feature coding processing to obtain an acoustic feature matrix;
when the number of the acoustic feature enhancing modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancing module for feature enhancing processing to obtain the enhanced acoustic feature matrix; or,
and when the number of the acoustic feature enhancing modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancing modules to sequentially perform feature enhancement processing to obtain the enhanced acoustic feature matrix, wherein M is an integer greater than 1.
8. The pronunciation skill detection method as claimed in claim 7, wherein the feature encoding module comprises a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer, and the inputting the acoustic features into the feature encoding module for feature encoding to obtain an acoustic feature matrix comprises:
inputting the acoustic features into the first convolution layer to carry out convolution processing to obtain a first convolution result;
inputting the first convolution result into the first pooling layer for pooling treatment to obtain a first pooling result;
inputting the first pooling result into the second convolution layer for convolution processing to obtain a second convolution result;
and inputting the second convolution result into the second pooling layer for pooling processing to obtain the acoustic feature matrix.
9. The pronunciation skill detection method according to claim 7, wherein the acoustic feature enhancement module comprises a second matrix conversion layer, a second multi-attention layer, a third matrix fusion layer, a third convolution layer, a reverse convolution layer and a fourth matrix fusion layer, and the inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix comprises:
inputting the acoustic feature matrix into the second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix;
inputting the second query matrix, the second key matrix and the second value matrix into the second multi-head attention layer for attention enhancement processing to obtain a second attention enhancement matrix;
inputting the second attention enhancement matrix and the acoustic feature matrix into the third matrix fusion layer for matrix fusion processing to obtain an acoustic fusion matrix;
inputting the acoustic fusion matrix into the third convolution layer for convolution processing to obtain a third convolution result;
inputting the third convolution result into the deconvolution layer for deconvolution processing to obtain a deconvolution result;
and inputting the acoustic fusion matrix and the deconvolution result into the fourth matrix fusion layer for matrix fusion processing to obtain the enhanced acoustic feature matrix.
10. The pronunciation skill detection method according to claim 7, wherein the acoustic feature enhancement network further comprises a second location coding module and a fifth matrix fusion layer, and before inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix, the method further comprises:
inputting the acoustic feature matrix into the second position coding module for position coding processing to obtain a second position coding matrix;
inputting the second position coding matrix and the acoustic feature matrix into the fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;
the inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix includes:
and inputting the acoustic position fusion matrix into the acoustic feature enhancement module for feature enhancement processing to obtain the enhanced acoustic feature matrix.
11. The pronunciation skill detection method according to claim 2, wherein the feature fusion network comprises a third matrix conversion layer, a fourth matrix conversion layer, a third multi-attention layer, a sixth matrix fusion layer, a feedforward network layer and a seventh matrix fusion layer, and the inputting the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network for feature fusion processing to obtain a fusion feature matrix comprises:
inputting the enhanced acoustic feature matrix into the third matrix conversion layer for matrix conversion processing to obtain a third key matrix and a third value matrix;
inputting the phoneme feature matrix into the fourth matrix conversion layer for matrix conversion processing to obtain a third query matrix;
inputting the third query matrix, the third key matrix and the third value matrix into the third multi-head attention layer for attention enhancement processing to obtain a third attention enhancement matrix;
inputting the third attention enhancing matrix and the third query matrix into the sixth matrix fusion layer for matrix fusion processing to obtain an acoustic phoneme fusion matrix;
inputting the acoustic phoneme fusion matrix into the feedforward network layer to perform feedforward calculation processing to obtain a feedforward matrix;
and inputting the feedforward matrix and the acoustic phoneme fusion matrix into the seventh matrix fusion layer for matrix fusion processing to obtain the fusion feature matrix.
12. The pronunciation skill detection method according to claim 2, wherein the first pronunciation skill detection network comprises a first fully connected layer and a first classification function layer, and the inputting the phoneme feature matrix into the first pronunciation skill detection network for pronunciation skill detection processing to obtain the first detection result comprises:
inputting the phoneme feature matrix into the first full-connection layer to perform full-connection processing to obtain a first full-connection result;
and inputting the first full-connection result into the first classification function layer for classification processing to obtain the first detection result.
13. The pronunciation skill detection method according to claim 2, wherein the second pronunciation skill detection network comprises L branch detection networks, each branch detection network corresponding to a different pronunciation skill, L being an integer greater than 1, and the inputting the fused feature matrix into the second pronunciation skill detection network for pronunciation skill detection processing to obtain the second detection result comprises:
inputting the fusion feature matrix into each branch detection network to perform pronunciation skill detection to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether the speaker speaks the text to be detected by adopting pronunciation skill corresponding to each branch detection network;
and obtaining the second detection result according to the branch pronunciation skill detection result of each branch detection network.
14. The pronunciation skill detection method according to claim 13, wherein the branch detection network comprises a second fully connected layer and a second classification function layer, and the inputting the fused feature matrix into each branch detection network for pronunciation skill detection to obtain the branch pronunciation skill detection result of each branch detection network comprises:
inputting the fusion characteristic matrix into the second full-connection layer to perform full-connection processing to obtain a second full-connection result;
and inputting the second full-connection result into the second classification function layer for classification processing to obtain the branch pronunciation skill detection result.
15. The pronunciation skill detection method as claimed in claim 13, wherein before the obtaining the text to be detected and converting the text to be detected into the corresponding phoneme sequence, the method further comprises:
obtaining multiple types of first sample texts which are known to need to be spoken by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;
acquiring first sample audio of each type of first sample text spoken by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic features of the first sample audio of each type of first sample text;
obtaining a second sample text which is known not to be spoken by adopting pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;
acquiring a second sample audio of the second sample text spoken by the sample user, and extracting negative sample acoustic features of the second sample audio;
and performing model training according to each type of the positive sample phoneme sequence, each type of the positive sample acoustic features, the negative sample phoneme sequence and the negative sample acoustic features to obtain the pronunciation skill detection model.
16. The pronunciation skill detection method according to any one of claims 1-15, wherein the extracting the acoustic features of the audio to be detected comprises:
extracting Filterbank characteristics, fundamental frequency characteristics and energy characteristics of the audio to be detected;
and fusing the Filterbank feature, the fundamental frequency feature and the energy feature to obtain the acoustic feature.
17. The pronunciation skill detection method according to any one of claims 1-15, wherein the converting the text to be detected into a corresponding phoneme sequence comprises:
removing unvoiced text units in the text to be detected to obtain a new text to be detected;
and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain the phoneme sequence.
18. A pronunciation skill detection apparatus, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a text to be detected and converting the text to be detected into a corresponding phoneme sequence;
the second acquisition module is used for acquiring the audio to be detected obtained by the speaker speaking the text to be detected and extracting the acoustic characteristics of the audio to be detected;
the detection module is used for inputting the phoneme sequence and the acoustic features into a trained pronunciation skill detection model to carry out pronunciation skill detection processing so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the text to be detected needs to be spoken by pronunciation skills, and the second detection result is used for representing whether the speaker speaks the text to be detected by pronunciation skills.
19. A storage medium having stored thereon a computer program for performing the steps of the pronunciation skill detection method as claimed in any one of claims 1-17 when the computer program is loaded by a processor.
20. An electronic device comprising a processor and a memory, said memory storing a computer program, wherein said processor is adapted to perform the steps of the pronunciation skill detection method as claimed in any one of claims 1 to 17 by loading said computer program.
CN202111620731.2A 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device Pending CN114170997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111620731.2A CN114170997A (en) 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111620731.2A CN114170997A (en) 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN114170997A true CN114170997A (en) 2022-03-11

Family

ID=80488185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111620731.2A Pending CN114170997A (en) 2021-12-28 2021-12-28 Pronunciation skill detection method, device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114170997A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111418006A (en) * 2017-11-29 2020-07-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171661A1 (en) * 2007-12-28 2009-07-02 International Business Machines Corporation Method for assessing pronunciation abilities
CN111785256A (en) * 2020-06-28 2020-10-16 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111968618A (en) * 2020-08-27 2020-11-20 腾讯科技(深圳)有限公司 Speech synthesis method and device
CN112349300A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Voice evaluation method and device
CN113066510A (en) * 2021-04-26 2021-07-02 中国科学院声学研究所 Vowel weak reading detection method and device
CN113345467A (en) * 2021-05-19 2021-09-03 苏州奇梦者网络科技有限公司 Method, device, medium and equipment for evaluating spoken language pronunciation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171661A1 (en) * 2007-12-28 2009-07-02 International Business Machines Corporation Method for assessing pronunciation abilities
CN111785256A (en) * 2020-06-28 2020-10-16 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111968618A (en) * 2020-08-27 2020-11-20 腾讯科技(深圳)有限公司 Speech synthesis method and device
CN112349300A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Voice evaluation method and device
CN113066510A (en) * 2021-04-26 2021-07-02 中国科学院声学研究所 Vowel weak reading detection method and device
CN113345467A (en) * 2021-05-19 2021-09-03 苏州奇梦者网络科技有限公司 Method, device, medium and equipment for evaluating spoken language pronunciation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
W. YING: "English Pronunciation Recognition and Detection Based on HMM-DNN", 2019 11TH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA), 7 October 2019 (2019-10-07), pages 648 - 652 *
姜贝妮: "英语教学系统中的词重音检测", 清华大学学报(自然科学版), vol. 48, no. 10, 15 October 2008 (2008-10-15), pages 1636 - 1639 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111418006A (en) * 2017-11-29 2020-07-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program
CN111418006B (en) * 2017-11-29 2023-09-12 雅马哈株式会社 Speech synthesis method, speech synthesis device, and recording medium

Similar Documents

Publication Publication Date Title
CN111312245B (en) Voice response method, device and storage medium
CN109523989B (en) Speech synthesis method, speech synthesis device, storage medium, and electronic apparatus
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113421547B (en) Voice processing method and related equipment
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
JPH0375860A (en) Personalized terminal
CN115662435B (en) A method and terminal for generating realistic speech of a virtual teacher
CN112397056B (en) Voice evaluation method and computer storage medium
CN113450758B (en) Speech synthesis method, apparatus, equipment and medium
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
CN115206293A (en) A pre-training-based multi-task air traffic control speech recognition method and device
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN114882862A (en) Voice processing method and related equipment
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
KR20220063816A (en) System and method for analyzing multimodal emotion
CN113823271B (en) Training method and device for voice classification model, computer equipment and storage medium
CN114170997A (en) Pronunciation skill detection method, device, storage medium and electronic device
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN117349402A (en) Emotion cause pair identification method and system based on machine reading understanding
CN116959417A (en) Method, apparatus, device, medium, and program product for detecting dialog rounds
CN115132170A (en) Language classification method, device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination