WO2019075828A1 - Voice evaluation method and apparatus - Google Patents

Voice evaluation method and apparatus Download PDF

Info

Publication number
WO2019075828A1
WO2019075828A1 PCT/CN2017/111822 CN2017111822W WO2019075828A1 WO 2019075828 A1 WO2019075828 A1 WO 2019075828A1 CN 2017111822 W CN2017111822 W CN 2017111822W WO 2019075828 A1 WO2019075828 A1 WO 2019075828A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
sequence
temperament
user
Prior art date
Application number
PCT/CN2017/111822
Other languages
French (fr)
Chinese (zh)
Inventor
卢炀
宾晓皎
李明
蔡泽鑫
Original Assignee
深圳市鹰硕音频科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市鹰硕音频科技有限公司 filed Critical 深圳市鹰硕音频科技有限公司
Publication of WO2019075828A1 publication Critical patent/WO2019075828A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/04Electrically-operated educational appliances with audible presentation of the material to be studied
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the invention relates to the field of multimedia teaching technology, in particular to a voice evaluation method and device for multimedia teaching.
  • CN101197084A discloses an automated spoken English evaluation learning system, characterized in that the system comprises detecting a spoken language part, and the detecting spoken part comprises the following steps: [1] establishment of a standard speaker corpus: 1) searching for an English standard Pronunciation person; 2) design the first recorded text according to the requirements of oral English learning requirements and phoneme balance; 3) standard pronunciation person to record the recorded text; [2] collection of oral evaluation corpus: in the simulated English learning software application environment, According to the English learning requirements, the second recording text is designed, and the general speaker is searched for, and the spoken pronunciation of the general speaker is recorded; [3] The annotation of the oral corpus is evaluated: the expert specifies whether the pronunciation of the phoneme in each word is correct; 4) Establishment of standard speech acoustic model: training the acoustic model of standard speech based on the recording in the standard speaker corpus and its associated text; [5] Calculating the error detection
  • CN101650886A discloses a method for automatically detecting a language learner's reading error, which comprises the following steps: 1) front-end processing: pre-processing input speech, performing feature extraction, and extracting features are MFCC feature vectors; 2) constructing Streamlined search space: use the content that the user wants to read as the reference answer, and build a simplified search space based on the reference answer, pronunciation dictionary, multi-sounding model and acoustic model; 3) construct a reading language model: construct the user's reading language model based on the reference answer The language model describes the context content and probability information that the user may read aloud while reading the reference sentence; 4) Search: in the search space, the feature vector obtained by the acoustic model, the spoken language model, and the multi-phone model are obtained. The most matching path of the stream is used as the actual reading result content of the user to form a sequence of recognition results; 5) Alignment: aligning the reference answer with the recognition result to obtain a detection result of multiple reading, missing reading, and
  • a speech recognition system is used to acquire a speech segment corresponding to each basic speech unit in a speech signal, and the acquired speech segment is fused to obtain a valid speech segment sequence corresponding to the speech signal, and the sequence of valid speech segments is extracted. And evaluating a feature, loading a score prediction model corresponding to the feature type of the evaluation feature; calculating the similarity of the evaluation feature corresponding to the score prediction model, and using the similarity as a score of the voice signal.
  • the pronunciation prediction model is used to evaluate the user's pronunciation.
  • the predicted standard pronunciation is often not consistent with the teaching speech example in some aspects (such as pitch, rhythm), so the evaluation result is user speech and predicted speech.
  • the comparison results do not truly reflect the comparison between the user's voice and the teaching voice example.
  • the technical problem to be solved by the present invention is how to simultaneously provide the user with the evaluation result compared with the teaching example voice and the evaluation result of the standard voice comparison predicted by the voice prediction model in the process of language learning, so as to help the user fully understand himself. Learning situation.
  • the present invention provides a speech evaluation method for evaluating a user's language pronunciation in a language learning process, which is characterized by:
  • Step S101 Acquire a voice input of a user by using a recording device of the voice evaluation device;
  • Step S102 performing basic speech unit division on the recorded voice, and obtaining a sequence of voice units of the recorded voice;
  • Step S103 performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit
  • Step S104 comparing the extracted temperament features with the standard voice predicted by the teaching example voice and the voice prediction model
  • step S105 the voice comparison result is marked on the user voice text.
  • the basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
  • the temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
  • the process of comparative analysis with the teaching example speech includes:
  • the temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
  • the process of speech evaluation using a speech prediction model includes:
  • the temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.
  • the evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.
  • the present invention also provides a voice evaluation device, which includes a recording module, a storage module, a voice processing module, a feature extraction module, a voice analysis module, an evaluation module, an annotation module, and a display module, and is characterized in that:
  • a recording module for acquiring a user's voice input
  • a voice processing module configured to perform basic voice unit division on the recorded voice, to obtain a sequence of voice units of the recorded voice
  • a feature extraction module performing feature extraction on the sequence of the phonetic unit, and acquiring a temperament feature of the sequence of the phonetic unit;
  • the speech analysis module compares the extracted temperament features with the standard speech of the teaching example speech and the speech prediction model
  • An annotation module that marks the speech evaluation result on the user's voice text.
  • the voice evaluation device further includes a display module for displaying the user voice text with the voice evaluation result annotation to the user.
  • the speech evaluation method and apparatus of the present invention provide the user with the evaluation result of the user speech and the teaching example speech and the evaluation result of the standard speech predicted by the speech prediction model, so that the user fully understands the pronunciation status of the user and improves the pronunciation accuracy. Sex.
  • FIG. 1 is a flowchart of a voice evaluation method according to an embodiment of the present invention.
  • FIG. 2 is a structural diagram of a voice evaluation apparatus according to an embodiment of the present invention.
  • speech evaluation device as used in the context is a "computer device” and refers to an intelligent electronic device that can perform a predetermined process such as numerical calculation and/or logic calculation by running a predetermined program or instruction, which may include a processor and The memory is executed by the processor to execute a predetermined process pre-stored in the memory to execute a predetermined process, or is executed by hardware such as an ASIC, an FPGA, a DSP, or the like, or a combination of the two.
  • a predetermined process such as numerical calculation and/or logic calculation by running a predetermined program or instruction
  • the memory is executed by the processor to execute a predetermined process pre-stored in the memory to execute a predetermined process, or is executed by hardware such as an ASIC, an FPGA, a DSP, or the like, or a combination of the two.
  • the computer device includes a user device and/or a network device.
  • the user equipment includes, but is not limited to, a computer, a smart phone, a PDA, etc.
  • the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing based computer Or a cloud composed of a network server, wherein cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computers.
  • the computer device can be operated separately to implement the present invention, and can also access the network and implement the present invention by interacting with other computer devices in the network.
  • the network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
  • the "speech evaluation device" described in the present invention may be only a user equipment, that is, the user equipment performs a corresponding operation; or may be composed of a user equipment integrated with a network device or a server. That is, the user equipment cooperates with the network equipment to perform the corresponding operation. Work.
  • the user equipment, the network equipment, the network, and the like are merely examples, and other existing or future possible computer equipment or networks, such as those applicable to the present invention, are also included in the scope of the present invention. It is included here by reference.
  • the present invention can be applied to mobile terminals and non-mobile terminals.
  • the method or apparatus according to the present invention can be used for providing and presenting.
  • Fig. 1 shows a flow chart of a speech evaluation method of the present invention.
  • step S101 the user records the voice input of the user through the recording device of the voice evaluation device in the speaking and reading step of the language learning.
  • the recording device in the voice evaluation device is triggered to enter the recording state.
  • the recording device starts recording the user voice and saves the user's accompanying voice in the storage module of the voice evaluation device for further analysis and use.
  • step S102 the user follows the recorded voice recorded in the storage module, performs basic speech unit division on the recorded voice, and obtains a sequence of the voice unit of the recorded user followed by the voice.
  • the basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
  • acoustic models based on MFCC Fel-Frequency Cepstrum Coefficients
  • acoustic models based on PLP Perceptual Linear Predictive
  • Different acoustic models such as HMM-GMM (Hidden Markov Model-Gaussian Mixture Model), neural network acoustic models based on DBN (Dynamic Beyesian Network), etc., or Different decoding methods such as Viterbi search, A* search, etc., decode the speech signal.
  • Step S103 performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit.
  • the temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
  • step S104 the extracted temperament features are compared and analyzed with the teaching example speech and the standard speech predicted by the speech prediction model.
  • the process of comparative analysis with the teaching example speech is as follows: acquiring the teaching example speech saved in the system, and performing basic speech unit division on the teaching example speech, thereby obtaining the basic speech unit and the speech unit sequence of the teaching example speech, and further extracting the teaching.
  • the temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
  • the method for using the speech prediction model for speech evaluation can adopt the existing speech evaluation technology, that is, the basic speech unit division is performed on the recorded user speech, and the corresponding temperament characteristics to be evaluated are extracted from the speech unit sequence, and corresponding to different temperament features are loaded.
  • the prediction model predicts the corresponding standard pronunciation, and then compares the temperament characteristics of the user's voice with the temperament characteristics of the standard pronunciation, and obtains the corresponding evaluation results.
  • step S105 the voice comparison result is marked on the user voice text and provided to the user.
  • the recorded user voice is further converted into a voice processing module.
  • Voice text The evaluation result of the comparison with the teaching example speech obtained in step S104 and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user. Through the displayed evaluation results, the user can understand the difference between the pronunciation and the pronunciation of the teaching example, and the difference between the pronunciation of the standard speech predicted by the speech prediction model, so that the user can fully understand the pronunciation of the read text. What problems help users further improve the standardization of pronunciation.
  • the comparison result may include a pronunciation evaluation of the basic speech unit, a pronunciation duration evaluation of the basic speech unit, a full-text fluency evaluation, and the like.
  • FIG. 2 shows a speech evaluation apparatus according to an embodiment of the present invention.
  • the voice evaluation device is used to implement the voice evaluation method of the present invention, and after the user performs the spoken language follow-up, the user is simultaneously provided with the evaluation result of the teaching example voice and the evaluation result of the standard voice predicted by the voice prediction model.
  • the voice evaluation device includes a recording module 1, a storage module 2, a voice processing module 3, a feature extraction module 4, a voice analysis module 5, an annotation module 6, and a display module 7.
  • the user records the voice input of the user through the recording module 1 of the voice evaluation device during the speaking and reading section of the language learning.
  • the user after learning the voice example in the courseware, the user enters the follow-up step and triggers the recording module 1 in the voice evaluation device to enter the recording state.
  • the recording module 1 starts recording the user voice, and saves the user's follow-up voice in the storage module 2 of the voice evaluation device for further analysis and use.
  • the voice processing module 3 acquires the user-followed voice recorded in the storage module 2, and performs basic voice unit division on the recorded voice.
  • the basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
  • the feature extraction module 4 further performs feature extraction on the generated speech unit sequence to obtain the temperament feature of the speech unit sequence.
  • the temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
  • the speech analysis module 5 separates the extracted temperament features with the teaching example speech and speech prediction The standard speech predicted by the model was compared and analyzed.
  • the process of comparing and analyzing with the teaching example voice is as follows.
  • the voice analysis module 5 obtains the teaching example voice saved in the storage module 2, and performs basic voice unit division on the teaching example voice, thereby obtaining the basic voice unit and the voice unit of the teaching example voice.
  • Sequences, and further extracting temperament features of the sequence of teaching phonetic units the temperament features of the sequence of teaching speech units corresponding to the tempo characteristics of the sequence of user speech units.
  • the temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
  • the method for using the speech prediction model for speech evaluation can adopt the existing speech evaluation technology, that is, the basic speech unit division is performed on the recorded user speech, and the corresponding temperament characteristics to be evaluated are extracted from the speech unit sequence, and corresponding to different temperament features are loaded.
  • the prediction model predicts the corresponding standard pronunciation, and then compares the temperament characteristics of the user's voice with the temperament characteristics of the standard pronunciation, and obtains the corresponding evaluation results.
  • the labeling module 6 marks the speech comparison result on the user's voice and provides it to the user through the display module 7.
  • the recorded user voice is further converted into a voice text by the voice processing module 3.
  • the evaluation result of the comparison with the teaching example speech and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively recorded on the speech text in a visual manner, and are displayed on the speech text through the display module. user. Through the displayed evaluation results, the user can understand the difference between the pronunciation and the pronunciation of the teaching example, and the difference between the pronunciation of the standard speech predicted by the speech prediction model, so that the user can fully understand the pronunciation of the read text. What problems help users further improve the standardization of pronunciation.
  • the comparison result may include a pronunciation evaluation of the basic speech unit, a pronunciation duration evaluation of the basic speech unit, a full-text fluency evaluation, and the like.
  • the computer readable storage medium may include a read only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like.
  • the speech evaluation method and apparatus of the present invention provide the user with the evaluation result of the user speech and the teaching example speech and the evaluation result of the standard speech predicted by the speech prediction model, so that the user fully understands the pronunciation status of the user and improves the pronunciation accuracy. Sex.

Abstract

A voice evaluation method for evaluating language pronunciation of a user in a language learning process, comprises the following steps: step S101, acquiring a voice input of a user through a voice recording device of a voice evaluation apparatus; step S102, dividing basic voice units of the recorded voice to obtain a sequence of voice units of the recorded voice; step S103, extracting a characteristic of the sequence of voice units, and acquiring temperament characteristics of the sequence of voice units; step S104, respectively comparing the extracted temperament characteristics with a teaching example voice and a standard voice predicted by a voice prediction model to make an analysis; and step S105, marking a voice comparison result on a voice text of the user.

Description

一种语音评价方法及装置Voice evaluation method and device 技术领域Technical field
本发明涉及多媒体教学技术领域,尤其涉及一种用于多媒体教学的语音评价方法及装置。The invention relates to the field of multimedia teaching technology, in particular to a voice evaluation method and device for multimedia teaching.
背景技术Background technique
语言作为一种交流工具,在生活和工作中占有非常重要的地位,不论是学生在学校学习的阶段还是人们在工作的阶段,口语学习都是人们非常重视的学习内容。而随着网络教学的不断普及,网络授课的方式因不受时间和授课地点的约束,受到广大用户的喜爱。因此,目前很多用户更愿意利用闲暇时间,通过网络进行语言学习。As a communication tool, language plays a very important role in life and work. Whether it is students at the stage of school or people at work, oral learning is a learning content that people attach great importance to. With the continuous popularization of online teaching, the way of network teaching is not restricted by time and teaching place, and is loved by users. Therefore, many users are now more willing to use their leisure time to learn languages through the Internet.
在目前的网络教学过程中,当进行发音练习时,一种方式是在视频(或音频)播放一段语音之后,给出一段空闲时间由用户自行进行跟读练习;或者是采用录音的方式,在学员跟读之后向学员播放录音,由学员自我评价发音是否准确;或者还可以由老师进行在线教学,针对学员的发音给出指导和建议。上述现有的教学方式,要么无法针对学员的发音给出针对性的指导意见,导致学习效果不佳,要么需要老师在线教学,需要大量的人力、物力和财力支持。In the current network teaching process, when performing pronunciation practice, one way is to give a free time to the user to perform the following exercises after the video (or audio) plays a voice; or to use the recording method, After the student follows the lecture, the student will play the recording, and the student will self-evaluate whether the pronunciation is accurate; or the teacher can conduct online teaching to give guidance and suggestions for the pronunciation of the student. The above existing teaching methods may not provide targeted guidance for the pronunciation of the students, resulting in poor learning results, or requiring teachers to teach online, requiring a large amount of human, material and financial support.
为解决上述问题,目前提出了根据语音预测模型对学员的语音进行评价。CN101197084A公开了一种自动化英语口语评测学习系统,其特征在于该系统包括有检测口语发音部分,所述的检测口语发音部分包括以下步骤:〔1〕标准发音人语料库的建立:1)寻找英语标准发音人;2)根据英语口语学习要求及音素平衡的原则设计第一录音文本;3)标准发音人对照录音文本进行录音;〔2〕口语评测语料库的收集:在模拟英语学习软件应用环境下,根据英语学习要求设计第二录音文本,同时寻找一般发音人,并对一般发音人的口语发音进行录音;〔3〕口语评测语料库的标注:专家详细标注每个单词中音素的发音是否正确;〔4〕标准语音声学模型的建立:基于标准发音人语料库中的录音及其相关联的文本,训练标准语音的声学模型;〔5〕计算语音的检错参数:1)提取语音的美尔倒谱系数参数;2)基于标准声学模型,以及评测语料库中的一般发音人录音及其文本对应的音素序列,将对一般发音人语音数 据自动切分成以音素为单位的各个音段,同时基于标准模型计算得到各音段作为该音素的第一似然值;3)用标准声学模型对一般发音人语音的每个音段进行识别,同时基于标准声学模型计算得到该音段作为识别结果音素的第二似然值;4)将音段第一似然值除以第二似然值,得到该音段的似然比,作为该语音片段的检错参数;〔6〕建立检错参数向专家所标注发音错误的检错映射模型:在一批评测语音上,将各个音段评测参数和音段的共振峰序列与专家的详细标注进行关联,运用统计的方法得到上述参数与专家详细标注的对应关系,保存这些关系作为从检错参数到专家发音错误标注之间的检错映射模型。In order to solve the above problems, it is proposed to evaluate the voice of the student according to the voice prediction model. CN101197084A discloses an automated spoken English evaluation learning system, characterized in that the system comprises detecting a spoken language part, and the detecting spoken part comprises the following steps: [1] establishment of a standard speaker corpus: 1) searching for an English standard Pronunciation person; 2) design the first recorded text according to the requirements of oral English learning requirements and phoneme balance; 3) standard pronunciation person to record the recorded text; [2] collection of oral evaluation corpus: in the simulated English learning software application environment, According to the English learning requirements, the second recording text is designed, and the general speaker is searched for, and the spoken pronunciation of the general speaker is recorded; [3] The annotation of the oral corpus is evaluated: the expert specifies whether the pronunciation of the phoneme in each word is correct; 4) Establishment of standard speech acoustic model: training the acoustic model of standard speech based on the recording in the standard speaker corpus and its associated text; [5] Calculating the error detection parameters of the speech: 1) Extracting the Mergian cepstrum of the speech Number parameter; 2) based on standard acoustic model, and one in the evaluation corpus Speakers and audio text corresponding phoneme sequences will generally pronounce the number of people voice According to the automatic segmentation, each segment is divided into phonemes, and each segment is calculated as the first likelihood of the phoneme based on the standard model; 3) each segment of the general speaker's voice is identified by the standard acoustic model. And calculating the second likelihood value of the sound segment as the recognition result phoneme based on the standard acoustic model; 4) dividing the first segment likelihood value by the second likelihood value to obtain the likelihood ratio of the sound segment, as The error detection parameter of the speech segment; [6] establishing an error detection mapping model for the error detection parameter of the error detection parameter to the expert: on a batch of evaluation voices, the evaluation parameters of each segment and the formant sequence of the segment and the details of the expert The annotations are associated, and the corresponding relationship between the above parameters and the expert detailed annotations is obtained by statistical methods, and these relationships are saved as an error detection mapping model from the error detection parameters to the expert pronunciation error labels.
CN101650886A公开了一种自动检测语言学习者朗读错误的方法,其特征在于,包含如下步骤:1)前端处理:对输入语音进行预处理,进行特征提取,所提取特征为MFCC特征矢量;2)构建精简搜索空间:将用户所要朗读的内容作为参考答案,并根据参考答案、发音字典、多发音模型和声学模型构建精简的搜索空间;3)构建朗读语言模型:根据参考答案构建用户的朗读语言模型,该语言模型描述用户在朗读该参考语句的时候可能朗读的上下文内容及其概率信息;4)搜索:在搜索空间中,根据声学模型、朗读语言模型和多发音模型搜索得到与输入的特征矢量流最匹配的一条路径,作为用户的实际朗读结果内容,做成识别结果序列;5)对齐:将所述参考答案与识别结果进行对齐,得到用户多读、漏读、错读的检测结果。CN101650886A discloses a method for automatically detecting a language learner's reading error, which comprises the following steps: 1) front-end processing: pre-processing input speech, performing feature extraction, and extracting features are MFCC feature vectors; 2) constructing Streamlined search space: use the content that the user wants to read as the reference answer, and build a simplified search space based on the reference answer, pronunciation dictionary, multi-sounding model and acoustic model; 3) construct a reading language model: construct the user's reading language model based on the reference answer The language model describes the context content and probability information that the user may read aloud while reading the reference sentence; 4) Search: in the search space, the feature vector obtained by the acoustic model, the spoken language model, and the multi-phone model are obtained. The most matching path of the stream is used as the actual reading result content of the user to form a sequence of recognition results; 5) Alignment: aligning the reference answer with the recognition result to obtain a detection result of multiple reading, missing reading, and misreading by the user.
现有技术中利用语音识别系统获取语音信号中各基本语音单元对应的语音片断,对获取的语音片断进行融合,得到对应所述语音信号的有效语音片断序列,从所述有效语音片断序列中提取评测特征,加载与所述评测特征的特征类型相对应的评分预测模型;计算所述评测特征相应于所述评分预测模型的相似度,并将所述相似度作为所述语音信号的得分。但用户在实际进行语言学习时,往往是根据教学视频(音频)中教师的语音示例来学习发音,而教师语音示例往往因个性化的原因,并不能与语音预测模型预测出的标准读音完全一致。因此,用语音预测模型对用户发音进行测评,其预测出的标准读音往往与教学语音示例在某些方面上不完全一致(例如音调、韵律),这样给出的评价结果是用户语音与预测语音的对比结果,并不能真实反映出用户语音与教学语音示例的对比结果。 In the prior art, a speech recognition system is used to acquire a speech segment corresponding to each basic speech unit in a speech signal, and the acquired speech segment is fused to obtain a valid speech segment sequence corresponding to the speech signal, and the sequence of valid speech segments is extracted. And evaluating a feature, loading a score prediction model corresponding to the feature type of the evaluation feature; calculating the similarity of the evaluation feature corresponding to the score prediction model, and using the similarity as a score of the voice signal. However, when the user actually conducts language learning, the pronunciation is often learned according to the teacher's voice example in the teaching video (audio), and the teacher voice example is often completely consistent with the standard pronunciation predicted by the voice prediction model for personalization reasons. . Therefore, the pronunciation prediction model is used to evaluate the user's pronunciation. The predicted standard pronunciation is often not consistent with the teaching speech example in some aspects (such as pitch, rhythm), so the evaluation result is user speech and predicted speech. The comparison results do not truly reflect the comparison between the user's voice and the teaching voice example.
因此,有必要提供一种语音评价方法,在给出由语音预测模型评价出的评价结果的同时,还可以给出与教学语音示例对比的评价结果,从而使用户全面了解自己的学习情况。Therefore, it is necessary to provide a speech evaluation method. While giving the evaluation results evaluated by the speech prediction model, it is also possible to give an evaluation result in comparison with the teaching speech example, so that the user can fully understand his or her learning situation.
发明内容Summary of the invention
为此,本发明所要解决的技术问题是在语言学习的过程中,如何同时向用户提供与教学示例语音对比的评价结果以及语音预测模型预测的标准语音对比的评价结果,以帮助用户全面了解自身学习情况。To this end, the technical problem to be solved by the present invention is how to simultaneously provide the user with the evaluation result compared with the teaching example voice and the evaluation result of the standard voice comparison predicted by the voice prediction model in the process of language learning, so as to help the user fully understand himself. Learning situation.
为此,本发明提供一种语音评价方法,用于在语言学习过程中对用户的语言发音进行评价,其特征在于:To this end, the present invention provides a speech evaluation method for evaluating a user's language pronunciation in a language learning process, which is characterized by:
步骤S101,通过语音评价装置的录音设备获取用户的语音输入;Step S101: Acquire a voice input of a user by using a recording device of the voice evaluation device;
步骤S102,对所录制语音进行基本语音单元划分,获得该录制语音的语音单元序列;Step S102, performing basic speech unit division on the recorded voice, and obtaining a sequence of voice units of the recorded voice;
步骤S103,对所述语音单元序列进行特征提取,获取该语音单元序列的音律特征;Step S103, performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit;
步骤S104,将提取出的音律特征分别与教学示例语音以及语音预测模型预测的标准语音进行对比分析;Step S104, comparing the extracted temperament features with the standard voice predicted by the teaching example voice and the voice prediction model;
步骤S105,将语音对比结果标注在用户语音文本上。In step S105, the voice comparison result is marked on the user voice text.
所述基本语音单元可以是音节、音素等,通过对所述录制语音的划分,从而得到该录制语音的基本语音单元及语音单元序列。The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
所述音律特征包括韵律特征和音节特征,韵律特征包括每个基本语音单元的边界特征、发音时长、相邻基本语音单元间的停顿时间以及整个语音单元序列的发音时长,所述音节特征包括各基本语音单元的发音和整个语音单元序列的发音。The temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
与教学示例语音进行对比分析的过程包括:The process of comparative analysis with the teaching example speech includes:
获取系统中保存的教学示例语音;Obtain the teaching example speech saved in the system;
对教学示例语音进行基本语音单元划分,得到教学示例语音的基本语音单元及语音单元序列;Perform basic speech unit division on the teaching example speech, and obtain a basic speech unit and a speech unit sequence of the teaching example speech;
提取教学语音单元序列的音律特征,所述教学语音单元序列的音律特征与用户语音单元序列的音律特征相对应; Extracting a temperament feature of the sequence of the teaching phonetic unit, the temperament feature of the sequence of the teaching phonetic unit corresponding to the temperament feature of the sequence of the user's phonetic unit;
将用户语音单元序列的音律特征与教学语音单元序列的音律特征进行对比,给出相应的评价结果。The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
利用语音预测模型进行语音评价的过程包括:The process of speech evaluation using a speech prediction model includes:
对所录制的用户语音进行基本语音单元划分,从语音单元序列中提取对应待测评音律特征;Performing basic speech unit division on the recorded user speech, and extracting corresponding to-be-measured temperament features from the speech unit sequence;
对于不同的音律特征加载对应的预测模型,预测出相应的标准发音;Loading corresponding prediction models for different temperament features, and predicting corresponding standard pronunciations;
将用户语音的音律特征与标准发音的音律特征进行对比,得到相应的评价结果。The temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.
语音对比结果标注过程具体包括:The voice comparison result labeling process specifically includes:
将所录制的用户语音,转换成语音文本;Convert the recorded user voice into a voice text;
将所获得的教学示例语音对比的评价结果以及语音预测模型预测的标准语音对比的评价结果,采用可视化的方式分别标注在所述语音文本上,显示给用户。The evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.
本发明还提供一种语音评价装置,所述语音评价装置包括录音模块、存储模块、语音处理模块、特征提取模块、语音分析模块、评价模块、标注模块以及显示模块,其特征在于:The present invention also provides a voice evaluation device, which includes a recording module, a storage module, a voice processing module, a feature extraction module, a voice analysis module, an evaluation module, an annotation module, and a display module, and is characterized in that:
录音模块,用于获取用户的语音输入;a recording module for acquiring a user's voice input;
语音处理模块,用于对所录制语音进行基本语音单元划分,获得该录制语音的语音单元序列;a voice processing module, configured to perform basic voice unit division on the recorded voice, to obtain a sequence of voice units of the recorded voice;
特征提取模块,对所述语音单元序列进行特征提取,获取该语音单元序列的音律特征;a feature extraction module, performing feature extraction on the sequence of the phonetic unit, and acquiring a temperament feature of the sequence of the phonetic unit;
语音分析模块,将提取出的音律特征分别与教学示例语音以及语音预测模型预测的标准语音进行对比分析;The speech analysis module compares the extracted temperament features with the standard speech of the teaching example speech and the speech prediction model;
标注模块,将语音评价结果标注在用户语音文本上。An annotation module that marks the speech evaluation result on the user's voice text.
所述语音评价装置还包括显示模块,用于将带有语音评价结果标注的用户语音文本显示给用户。The voice evaluation device further includes a display module for displaying the user voice text with the voice evaluation result annotation to the user.
本发明的语音评价方法和装置,通过同时向用户提供用户语音与教学示例语音的评价结果以及与语音预测模型预测的标准语音的评价结果,使用户充分了解自己的发音情况,提高了发音的准确性。 The speech evaluation method and apparatus of the present invention provide the user with the evaluation result of the user speech and the teaching example speech and the evaluation result of the standard speech predicted by the speech prediction model, so that the user fully understands the pronunciation status of the user and improves the pronunciation accuracy. Sex.
附图说明DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对本发明实施例描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据本发明实施例的内容和这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings may be obtained according to the contents of the embodiments of the present invention and the drawings without any creative work.
图1是根据本发明实施例的语音评价方法的流程图;和1 is a flowchart of a voice evaluation method according to an embodiment of the present invention; and
图2是根据本发明实施例的语音评价装置的结构图。2 is a structural diagram of a voice evaluation apparatus according to an embodiment of the present invention.
具体实施方式Detailed ways
在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此外,各项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。Before discussing the exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as a process or method depicted as a flowchart. Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. In addition, the order of operations can be rearranged. The process may be terminated when its operation is completed, but may also have additional steps not included in the figures.
在上下文中所称“语音评价装置”即为“计算机设备”,是指可以通过运行预定程序或指令来执行数值计算和/或逻辑计算等预定处理过程的智能电子设备,其可以包括处理器与存储器,由处理器执行在存储器中预存的存续指令来执行预定处理过程,或是由ASIC、FPGA、DSP等硬件执行预定处理过程,或是由上述二者组合来实现。The term "speech evaluation device" as used in the context is a "computer device" and refers to an intelligent electronic device that can perform a predetermined process such as numerical calculation and/or logic calculation by running a predetermined program or instruction, which may include a processor and The memory is executed by the processor to execute a predetermined process pre-stored in the memory to execute a predetermined process, or is executed by hardware such as an ASIC, an FPGA, a DSP, or the like, or a combination of the two.
所述计算机设备包括用户设备和/或网络设备。其中,所述用户设备包括但不限于电脑、智能手机、PDA等;所述网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量计算机或网络服务器构成的云,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。其中,所述计算机设备可单独运行来实现本发明,也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。其中,所述计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。The computer device includes a user device and/or a network device. The user equipment includes, but is not limited to, a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing based computer Or a cloud composed of a network server, wherein cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computers. Wherein, the computer device can be operated separately to implement the present invention, and can also access the network and implement the present invention by interacting with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
本领域技术人员应能理解,本发明中所述的“语音评价装置”可以仅是用户设备,即由用户设备来执行相应的操作;也可以是由用户设备与网络设备或服务器相集成来组成,即由用户设备与网络设备相配合来执行相应的操 作。It should be understood by those skilled in the art that the "speech evaluation device" described in the present invention may be only a user equipment, that is, the user equipment performs a corresponding operation; or may be composed of a user equipment integrated with a network device or a server. That is, the user equipment cooperates with the network equipment to perform the corresponding operation. Work.
需要说明的是,所述用户设备、网络设备和网络等仅为举例,其他现有的或今后可能出现的计算机设备或网络如可适用于本发明,也应包含在本发明保护范围以内,并以引用方式包含于此。It should be noted that the user equipment, the network equipment, the network, and the like are merely examples, and other existing or future possible computer equipment or networks, such as those applicable to the present invention, are also included in the scope of the present invention. It is included here by reference.
在此,本领域技术人员应能理解,本发明可应用于移动端与非移动端,例如,当用户使用手机或PC时,均可利用本发明所述的方法或装置来进行提供与呈现。Here, those skilled in the art should understand that the present invention can be applied to mobile terminals and non-mobile terminals. For example, when a user uses a mobile phone or a PC, the method or apparatus according to the present invention can be used for providing and presenting.
这里所公开的具体结构和功能细节仅仅是代表性的,并且是用于描述本发明的示例性实施例的目的。但是本发明可以通过许多替换形式来具体实现,并且不应当被解释成仅仅受限于这里所阐述的实施例。The specific structural and functional details disclosed are merely representative and are for the purpose of describing exemplary embodiments of the invention. The present invention may, however, be embodied in many alternative forms and should not be construed as being limited only to the embodiments set forth herein.
这里所使用的术语仅仅是为了描述具体实施例而不意图限制示例性实施例。除非上下文明确地另有所指,否则这里所使用的单数形式“一个”、“一项”还意图包括复数。还应当理解的是,这里所使用的术语“包括”和/或“包含”规定所陈述的特征、整数、步骤、操作、单元和/或组件的存在,而不排除存在或添加一个或更多其他特征、整数、步骤、操作、单元、组件和/或其组合。The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It is also to be understood that the terms "comprising" and """ Other features, integers, steps, operations, units, components, and/or combinations thereof.
还应当提到的是,在一些替换实现方式中,所提到的功能/动作可以按照不同于附图中标示的顺序发生。举例来说,取决于所涉及的功能/动作,相继示出的两幅图实际上可以基本上同时执行或者有时可以按照相反的顺序来执行。It should also be noted that in some alternative implementations, the functions/acts noted may occur in a different order than that illustrated in the drawings. For example, two figures shown in succession may in fact be executed substantially concurrently or sometimes in the reverse order, depending on the function/acts involved.
下面结合附图对本发明作进一步详细描述。The invention is further described in detail below with reference to the accompanying drawings.
图1示出了本发明的语音评价方法的流程图。Fig. 1 shows a flow chart of a speech evaluation method of the present invention.
在步骤S101,用户在进行语言学习的口语跟读环节中,通过语音评价装置的录音设备对用户的语音输入进行录制。In step S101, the user records the voice input of the user through the recording device of the voice evaluation device in the speaking and reading step of the language learning.
具体地,用户在学习了教学课件中的语音示例之后,进入跟读环节,此时触发语音评价装置中的录音设备,使其进入录音状态。当用户开始跟读语音示例时,录音设备开始录制用户语音,并将用户的跟读语音保存在语音评价装置的存储模块中,以供进一步分析使用。Specifically, after learning the voice example in the courseware, the user enters the follow-up step, and at this time, the recording device in the voice evaluation device is triggered to enter the recording state. When the user starts to follow the voice example, the recording device starts recording the user voice and saves the user's accompanying voice in the storage module of the voice evaluation device for further analysis and use.
在步骤S102,获取存储模块中录制的用户跟读语音,对所录制语音进行基本语音单元划分,获得所录制的用户跟读语音的语音单元序列。 In step S102, the user follows the recorded voice recorded in the storage module, performs basic speech unit division on the recorded voice, and obtains a sequence of the voice unit of the recorded user followed by the voice.
所述基本语音单元可以是音节、音素等,通过对所述录制语音的划分,从而得到该录制语音的基本语音单元及语音单元序列。The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
不同的语音识别系统将基于不同的声学特征如基于MFCC(Mel-Frequency Cepstrum Coefficients,美尔倒谱系数)特征的声学模型、基于PLP(Perceptual Linear Predictive,感知线性预测)特征的声学模型等,或采用不同的声学模型如HMM-GMM(Hidden Markov Model-Gaussian Mixture Model,隐马尔可夫模型-高斯混合模型)、基于DBN(Dynamic BeyesianNetwork,动态贝叶斯网络)的神经网络声学模型等,或采用不同的解码方式如Viterbi搜索,A*搜索等,对语音信号解码。Different speech recognition systems will be based on different acoustic characteristics such as acoustic models based on MFCC (Mel-Frequency Cepstrum Coefficients) features, acoustic models based on PLP (Perceptual Linear Predictive) features, or Different acoustic models such as HMM-GMM (Hidden Markov Model-Gaussian Mixture Model), neural network acoustic models based on DBN (Dynamic Beyesian Network), etc., or Different decoding methods such as Viterbi search, A* search, etc., decode the speech signal.
步骤S103,对所述语音单元序列进行特征提取,获取该语音单元序列的音律特征。Step S103, performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit.
所述音律特征包括韵律特征和音节特征,韵律特征包括每个基本语音单元的边界特征、发音时长、相邻基本语音单元间的停顿时间以及整个语音单元序列的发音时长,所述音节特征包括各基本语音单元的发音和整个语音单元序列的发音。The temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
步骤S104,将提取出的音律特征分别与教学示例语音以及语音预测模型预测的标准语音进行对比分析。In step S104, the extracted temperament features are compared and analyzed with the teaching example speech and the standard speech predicted by the speech prediction model.
其中,与教学示例语音进行对比分析的过程如下,获取系统中保存的教学示例语音,对教学示例语音进行基本语音单元划分,从而得到教学示例语音的基本语音单元及语音单元序列,并进一步提取教学语音单元序列的音律特征,所述教学语音单元序列的音律特征与用户语音单元序列的音律特征相对应。将用户语音单元序列的音律特征与教学语音单元序列的音律特征进行对比,给出相应的评价结果。The process of comparative analysis with the teaching example speech is as follows: acquiring the teaching example speech saved in the system, and performing basic speech unit division on the teaching example speech, thereby obtaining the basic speech unit and the speech unit sequence of the teaching example speech, and further extracting the teaching. A temperament feature of a sequence of speech units, the temperament features of the sequence of teaching speech units corresponding to the tempo characteristics of the sequence of user speech units. The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
利用语音预测模型进行语音评价的方法可采用现有的语音评价技术,即对所录制的用户语音进行基本语音单元划分,从语音单元序列中提取对应待测评音律特征,对于不同的音律特征加载对应的预测模型,预测出相应的标准发音,再将用户语音的音律特征与标准发音的音律特征进行对比,得到相应的评价结果。The method for using the speech prediction model for speech evaluation can adopt the existing speech evaluation technology, that is, the basic speech unit division is performed on the recorded user speech, and the corresponding temperament characteristics to be evaluated are extracted from the speech unit sequence, and corresponding to different temperament features are loaded. The prediction model predicts the corresponding standard pronunciation, and then compares the temperament characteristics of the user's voice with the temperament characteristics of the standard pronunciation, and obtains the corresponding evaluation results.
步骤S105,将语音对比结果标注在用户语音文本上,提供给用户。In step S105, the voice comparison result is marked on the user voice text and provided to the user.
在该步骤中,通过语音处理模块,进一步将所录制的用户语音,转换成 语音文本。将步骤S104获得的与教学示例语音对比的评价结果以及语音预测模型预测的标准语音对比的评价结果,采用可视化的方式分别标注在所述语音文本上,显示给用户。用户通过所显示的评价结果,可以了解到其发音与教学示例的发音的不同之处,以及与语音预测模型预测的标准语音的发音的不同之处,以便用户全面了解其所读文本的发音存在什么问题,帮助用户进一步提高发音标准型。所述对比结果可包含基本语音单元的发音评价、基本语音单元的发音时长评价、全文流畅度评价等。In this step, the recorded user voice is further converted into a voice processing module. Voice text. The evaluation result of the comparison with the teaching example speech obtained in step S104 and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user. Through the displayed evaluation results, the user can understand the difference between the pronunciation and the pronunciation of the teaching example, and the difference between the pronunciation of the standard speech predicted by the speech prediction model, so that the user can fully understand the pronunciation of the read text. What problems help users further improve the standardization of pronunciation. The comparison result may include a pronunciation evaluation of the basic speech unit, a pronunciation duration evaluation of the basic speech unit, a full-text fluency evaluation, and the like.
图2示出了根据本发明实施例的语音评价装置。所述语音评价装置用于实现本发明的语音评价方法,在用户进行口语跟读后,向用户同时提供与教学示例语音的评价结果以及与语音预测模型预测出的标准语音的评价结果。所述语音评价装置包括录音模块1、存储模块2、语音处理模块3、特征提取模块4、语音分析模块5、标注模块6以及显示模块7。FIG. 2 shows a speech evaluation apparatus according to an embodiment of the present invention. The voice evaluation device is used to implement the voice evaluation method of the present invention, and after the user performs the spoken language follow-up, the user is simultaneously provided with the evaluation result of the teaching example voice and the evaluation result of the standard voice predicted by the voice prediction model. The voice evaluation device includes a recording module 1, a storage module 2, a voice processing module 3, a feature extraction module 4, a voice analysis module 5, an annotation module 6, and a display module 7.
用户在进行语言学习的口语跟读环节中,通过语音评价装置的录音模块1对用户的语音输入进行录制。The user records the voice input of the user through the recording module 1 of the voice evaluation device during the speaking and reading section of the language learning.
具体地,用户在学习了教学课件中的语音示例之后,进入跟读环节,并触发语音评价装置中的录音模块1,使其进入录音状态。当用户开始跟读语音示例时,录音模块1开始录制用户语音,并将用户的跟读语音保存在语音评价装置的存储模块2中,以供进一步分析使用。Specifically, after learning the voice example in the courseware, the user enters the follow-up step and triggers the recording module 1 in the voice evaluation device to enter the recording state. When the user starts to follow the voice example, the recording module 1 starts recording the user voice, and saves the user's follow-up voice in the storage module 2 of the voice evaluation device for further analysis and use.
语音处理模块3获取存储模块2中录制的用户跟读语音,并对所录制语音进行基本语音单元划分。The voice processing module 3 acquires the user-followed voice recorded in the storage module 2, and performs basic voice unit division on the recorded voice.
所述基本语音单元可以是音节、音素等,通过对所述录制语音的划分,从而得到该录制语音的基本语音单元及语音单元序列。The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
语音处理模块3将录制语音的基本语音单元划分完之后,特征提取模块4进一步对所生成的语音单元序列进行特征提取,以获取该语音单元序列的音律特征。After the speech processing module 3 divides the basic speech unit of the recorded speech, the feature extraction module 4 further performs feature extraction on the generated speech unit sequence to obtain the temperament feature of the speech unit sequence.
所述音律特征包括韵律特征和音节特征,韵律特征包括每个基本语音单元的边界特征、发音时长、相邻基本语音单元间的停顿时间以及整个语音单元序列的发音时长,所述音节特征包括各基本语音单元的发音和整个语音单元序列的发音。The temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
语音分析模块5将提取到的音律特征分别与教学示例语音以及语音预测 模型预测的标准语音进行对比分析。The speech analysis module 5 separates the extracted temperament features with the teaching example speech and speech prediction The standard speech predicted by the model was compared and analyzed.
其中,与教学示例语音进行对比分析的过程如下,语音分析模块5获取存储模块2中保存的教学示例语音,对教学示例语音进行基本语音单元划分,从而得到教学示例语音的基本语音单元及语音单元序列,并进一步提取教学语音单元序列的音律特征,所述教学语音单元序列的音律特征与用户语音单元序列的音律特征相对应。将用户语音单元序列的音律特征与教学语音单元序列的音律特征进行对比,给出相应的评价结果。The process of comparing and analyzing with the teaching example voice is as follows. The voice analysis module 5 obtains the teaching example voice saved in the storage module 2, and performs basic voice unit division on the teaching example voice, thereby obtaining the basic voice unit and the voice unit of the teaching example voice. Sequences, and further extracting temperament features of the sequence of teaching phonetic units, the temperament features of the sequence of teaching speech units corresponding to the tempo characteristics of the sequence of user speech units. The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
利用语音预测模型进行语音评价的方法可采用现有的语音评价技术,即对所录制的用户语音进行基本语音单元划分,从语音单元序列中提取对应待测评音律特征,对于不同的音律特征加载对应的预测模型,预测出相应的标准发音,再将用户语音的音律特征与标准发音的音律特征进行对比,得到相应的评价结果。The method for using the speech prediction model for speech evaluation can adopt the existing speech evaluation technology, that is, the basic speech unit division is performed on the recorded user speech, and the corresponding temperament characteristics to be evaluated are extracted from the speech unit sequence, and corresponding to different temperament features are loaded. The prediction model predicts the corresponding standard pronunciation, and then compares the temperament characteristics of the user's voice with the temperament characteristics of the standard pronunciation, and obtains the corresponding evaluation results.
标注模块6将语音对比结果标注在用户语音上,并通过显示模块7提供给用户。The labeling module 6 marks the speech comparison result on the user's voice and provides it to the user through the display module 7.
具体的通过语音处理模块3,进一步将所录制的用户语音,转换成语音文本。采用可视化的方式将语音分析模块5分析得到的与教学示例语音对比的评价结果以及语音预测模型预测的标准语音对比的评价结果,分别标注在所述语音文本上,并通过显示模块快7显示给用户。用户通过所显示的评价结果,可以了解到其发音与教学示例的发音的不同之处,以及与语音预测模型预测的标准语音的发音的不同之处,以便用户全面了解其所读文本的发音存在什么问题,帮助用户进一步提高发音标准型。所述对比结果可包含基本语音单元的发音评价、基本语音单元的发音时长评价、全文流畅度评价等。Specifically, the recorded user voice is further converted into a voice text by the voice processing module 3. The evaluation result of the comparison with the teaching example speech and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively recorded on the speech text in a visual manner, and are displayed on the speech text through the display module. user. Through the displayed evaluation results, the user can understand the difference between the pronunciation and the pronunciation of the teaching example, and the difference between the pronunciation of the standard speech predicted by the speech prediction model, so that the user can fully understand the pronunciation of the read text. What problems help users further improve the standardization of pronunciation. The comparison result may include a pronunciation evaluation of the basic speech unit, a pronunciation duration evaluation of the basic speech unit, a full-text fluency evaluation, and the like.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过计算机程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,并由处理器执行。计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或光盘等。A person of ordinary skill in the art can understand that all or part of the steps of the foregoing embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium and executed by a processor. carried out. The computer readable storage medium may include a read only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like.
以上介绍了本发明的较佳实施方式,旨在使得本发明的精神更加清楚和便于理解,并不是为了限制本发明,凡在本发明的精神和原则之内,所做的修改、替换、改进,均应包含在本发明所附的权利要求概括的保护范围之内。 The preferred embodiments of the present invention have been described above, and are intended to provide a further understanding of the embodiments of the present invention. It is intended to be included within the scope of the appended claims.
工业实用性Industrial applicability
本发明的语音评价方法和装置,通过同时向用户提供用户语音与教学示例语音的评价结果以及与语音预测模型预测的标准语音的评价结果,使用户充分了解自己的发音情况,提高了发音的准确性。 The speech evaluation method and apparatus of the present invention provide the user with the evaluation result of the user speech and the teaching example speech and the evaluation result of the standard speech predicted by the speech prediction model, so that the user fully understands the pronunciation status of the user and improves the pronunciation accuracy. Sex.

Claims (15)

  1. 一种语音评价方法,用于在语言学习过程中对用户的语言发音进行评价,其特征在于:A speech evaluation method for evaluating a user's language pronunciation in a language learning process, which is characterized by:
    步骤S101,通过语音评价装置的录音设备获取用户的语音输入;Step S101: Acquire a voice input of a user by using a recording device of the voice evaluation device;
    步骤S102,对所录制语音进行基本语音单元划分,获得该录制语音的语音单元序列;Step S102, performing basic speech unit division on the recorded voice, and obtaining a sequence of voice units of the recorded voice;
    步骤S103,对所述语音单元序列进行特征提取,获取该语音单元序列的音律特征;Step S103, performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit;
    步骤S104,将提取出的音律特征分别与教学示例语音以及语音预测模型预测的标准语音进行对比分析;Step S104, comparing the extracted temperament features with the standard voice predicted by the teaching example voice and the voice prediction model;
    步骤S105,将语音对比结果标注在用户语音文本上。In step S105, the voice comparison result is marked on the user voice text.
  2. 根据权利要求1的语音评价方法,其特征在于:A speech evaluation method according to claim 1, wherein:
    所述基本语音单元可以是音节、音素等,通过对所述录制语音的划分,从而得到该录制语音的基本语音单元及语音单元序列。The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
  3. 根据权利要求1的语音评价方法,其特征在于:A speech evaluation method according to claim 1, wherein:
    所述音律特征包括韵律特征和音节特征,韵律特征包括每个基本语音单元的边界特征、发音时长、相邻基本语音单元间的停顿时间以及整个语音单元序列的发音时长;The temperament feature includes a prosody feature and a syllable feature, and the prosody feature includes a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a duration of pronunciation of the entire phonetic unit sequence;
    所述音节特征包括各基本语音单元的发音和整个语音单元序列的发音。The syllable features include the pronunciation of each basic speech unit and the pronunciation of the entire speech unit sequence.
  4. 根据权利要求1的语音评价方法,其特征在于:A speech evaluation method according to claim 1, wherein:
    与教学示例语音进行对比分析的过程包括,The process of comparative analysis with the teaching example speech includes,
    获取系统中保存的教学示例语音;Obtain the teaching example speech saved in the system;
    对教学示例语音进行基本语音单元划分,得到教学示例语音的基本语音单元及语音单元序列;Perform basic speech unit division on the teaching example speech, and obtain a basic speech unit and a speech unit sequence of the teaching example speech;
    提取教学语音单元序列的音律特征,所述教学语音单元序列的音律特征与用户语音单元序列的音律特征相对应;Extracting a temperament feature of the sequence of the teaching phonetic unit, the temperament feature of the sequence of the teaching phonetic unit corresponding to the temperament feature of the sequence of the user's phonetic unit;
    将用户语音单元序列的音律特征与教学语音单元序列的音律特征进行对比,给出相应的评价结果。The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
  5. 根据权利要求1的语音评价方法,其特征在于:A speech evaluation method according to claim 1, wherein:
    利用语音预测模型进行语音评价的过程包括, The process of using the speech prediction model for speech evaluation includes,
    对所录制的用户语音进行基本语音单元划分,从语音单元序列中提取对应待测评音律特征;Performing basic speech unit division on the recorded user speech, and extracting corresponding to-be-measured temperament features from the speech unit sequence;
    对于不同的音律特征加载对应的预测模型,预测出相应的标准发音;Loading corresponding prediction models for different temperament features, and predicting corresponding standard pronunciations;
    将用户语音的音律特征与标准发音的音律特征进行对比,得到相应的评价结果。The temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.
  6. 根据权利要求1的语音评价方法,其特征在于:A speech evaluation method according to claim 1, wherein:
    语音对比结果标注过程具体包括,The voice comparison result labeling process specifically includes,
    将所录制的用户语音,转换成语音文本;Convert the recorded user voice into a voice text;
    将所获得的教学示例语音对比的评价结果以及语音预测模型预测的标准语音对比的评价结果,采用可视化的方式分别标注在所述语音文本上,显示给用户。The evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.
  7. 一种语音评价装置,所述语音评价装置包括录音模块、存储模块、语音处理模块、特征提取模块、语音分析模块以及标注模块,其特征在于:A voice evaluation device includes a recording module, a storage module, a voice processing module, a feature extraction module, a voice analysis module, and an annotation module, wherein:
    录音模块,用于获取用户的语音输入;a recording module for acquiring a user's voice input;
    语音处理模块,用于对所录制语音进行基本语音单元划分,获得该录制语音的语音单元序列;a voice processing module, configured to perform basic voice unit division on the recorded voice, to obtain a sequence of voice units of the recorded voice;
    特征提取模块,对所述语音单元序列进行特征提取,获取该语音单元序列的音律特征;a feature extraction module, performing feature extraction on the sequence of the phonetic unit, and acquiring a temperament feature of the sequence of the phonetic unit;
    语音分析模块,将提取出的音律特征分别与教学示例语音以及语音预测模型预测的标准语音进行对比分析;The speech analysis module compares the extracted temperament features with the standard speech of the teaching example speech and the speech prediction model;
    标注模块,将语音评价结果标注在用户语音文本上。An annotation module that marks the speech evaluation result on the user's voice text.
  8. 根据权利要求7的语音评价装置,其特征在于:A speech evaluation apparatus according to claim 7, wherein:
    所述基本语音单元可以是音节、音素等,通过对所述录制语音的划分,从而得到该录制语音的基本语音单元及语音单元序列。The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
  9. 根据权利要求7的语音评价装置,其特征在于:A speech evaluation apparatus according to claim 7, wherein:
    所述音律特征包括韵律特征和音节特征,韵律特征包括每个基本语音单元的边界特征、发音时长、相邻基本语音单元间的停顿时间以及整个语音单元序列的发音时长,所述音节特征包括各基本语音单元的发音和整个语音单元序列的发音。The temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
  10. 根据权利要求7的语音评价装置,其特征在于: A speech evaluation apparatus according to claim 7, wherein:
    与教学示例语音进行对比分析的过程包括,The process of comparative analysis with the teaching example speech includes,
    获取系统中保存的教学示例语音;Obtain the teaching example speech saved in the system;
    对教学示例语音进行基本语音单元划分,得到教学示例语音的基本语音单元及语音单元序列;Perform basic speech unit division on the teaching example speech, and obtain a basic speech unit and a speech unit sequence of the teaching example speech;
    提取教学语音单元序列的音律特征,所述教学语音单元序列的音律特征与用户语音单元序列的音律特征相对应;Extracting a temperament feature of the sequence of the teaching phonetic unit, the temperament feature of the sequence of the teaching phonetic unit corresponding to the temperament feature of the sequence of the user's phonetic unit;
    将用户语音单元序列的音律特征与教学语音单元序列的音律特征进行对比,给出相应的评价结果。The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
  11. 根据权利要求7的语音评价装置,其特征在于:A speech evaluation apparatus according to claim 7, wherein:
    利用语音预测模型进行语音评价的过程,包括,The process of speech evaluation using a speech prediction model, including,
    对所录制的用户语音进行基本语音单元划分,从语音单元序列中提取对应待测评音律特征;Performing basic speech unit division on the recorded user speech, and extracting corresponding to-be-measured temperament features from the speech unit sequence;
    对于不同的音律特征加载对应的预测模型,预测出相应的标准发音;Loading corresponding prediction models for different temperament features, and predicting corresponding standard pronunciations;
    将用户语音的音律特征与标准发音的音律特征进行对比,得到相应的评价结果。The temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.
  12. 根据权利要求7的语音评价装置,其特征在于:A speech evaluation apparatus according to claim 7, wherein:
    语音对比结果标注过程具体包括,The voice comparison result labeling process specifically includes,
    将所录制的用户语音,转换成语音文本;Convert the recorded user voice into a voice text;
    将所获得的教学示例语音对比的评价结果以及语音预测模型预测的标准语音对比的评价结果,采用可视化的方式分别标注在所述语音文本上,显示给用户。The evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.
  13. 根据权利要求7的语音评价装置,其特征在于:A speech evaluation apparatus according to claim 7, wherein:
    所述语音评价装置还包括显示模块,用于将带有语音评价结果标注的用户语音文本显示给用户。The voice evaluation device further includes a display module for displaying the user voice text with the voice evaluation result annotation to the user.
  14. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时可以实现如权利要求1-6中任一项的方法步骤。A computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor can implement any of claims 1-6 when the program is executed The method steps of the item.
  15. 一种计算机存储介质,其存储了可以被计算机执行的程序,执行所述程序时可以实现如权利要求1-6中任一项的方法步骤。 A computer storage medium storing a program executable by a computer, the method steps of any of claims 1-6 being implemented when the program is executed.
PCT/CN2017/111822 2017-10-20 2017-11-20 Voice evaluation method and apparatus WO2019075828A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710996819.1A CN109697988B (en) 2017-10-20 2017-10-20 Voice evaluation method and device
CN201710996819.1 2017-10-20

Publications (1)

Publication Number Publication Date
WO2019075828A1 true WO2019075828A1 (en) 2019-04-25

Family

ID=66172985

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/111822 WO2019075828A1 (en) 2017-10-20 2017-11-20 Voice evaluation method and apparatus

Country Status (2)

Country Link
CN (1) CN109697988B (en)
WO (1) WO2019075828A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081080B (en) * 2019-05-29 2022-05-03 广东小天才科技有限公司 Voice detection method and learning device
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition
CN110910687A (en) * 2019-12-04 2020-03-24 深圳追一科技有限公司 Teaching method and device based on voice information, electronic equipment and storage medium
CN112767932A (en) * 2020-12-11 2021-05-07 北京百家科技集团有限公司 Voice evaluation system, method, device, equipment and computer readable storage medium
CN113192494A (en) * 2021-04-15 2021-07-30 辽宁石油化工大学 Intelligent English language identification and output system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040006468A1 (en) * 2002-07-03 2004-01-08 Lucent Technologies Inc. Automatic pronunciation scoring for language learning
CN1750121A (en) * 2004-09-16 2006-03-22 北京中科信利技术有限公司 A kind of pronunciation evaluating method based on speech recognition and speech analysis
CN101739870A (en) * 2009-12-03 2010-06-16 深圳先进技术研究院 Interactive language learning system and method
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104732977A (en) * 2015-03-09 2015-06-24 广东外语外贸大学 On-line spoken language pronunciation quality evaluation method and system
US20150287339A1 (en) * 2014-04-04 2015-10-08 Xerox Corporation Methods and systems for imparting training

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060057545A1 (en) * 2004-09-14 2006-03-16 Sensory, Incorporated Pronunciation training method and apparatus
CN101246685B (en) * 2008-03-17 2011-03-30 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN103514765A (en) * 2013-10-28 2014-01-15 苏州市思玛特电力科技有限公司 Language teaching assessment method
CN103559894B (en) * 2013-11-08 2016-04-20 科大讯飞股份有限公司 Oral evaluation method and system
CN203773766U (en) * 2014-04-10 2014-08-13 滕坊坪 Language learning machine
CN105825852A (en) * 2016-05-23 2016-08-03 渤海大学 Oral English reading test scoring method
CN106971647A (en) * 2017-02-07 2017-07-21 广东小天才科技有限公司 A kind of Oral Training method and system of combination body language
CN107067834A (en) * 2017-03-17 2017-08-18 麦片科技(深圳)有限公司 Point-of-reading system with oral evaluation function

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040006468A1 (en) * 2002-07-03 2004-01-08 Lucent Technologies Inc. Automatic pronunciation scoring for language learning
CN1750121A (en) * 2004-09-16 2006-03-22 北京中科信利技术有限公司 A kind of pronunciation evaluating method based on speech recognition and speech analysis
CN101739870A (en) * 2009-12-03 2010-06-16 深圳先进技术研究院 Interactive language learning system and method
US20150287339A1 (en) * 2014-04-04 2015-10-08 Xerox Corporation Methods and systems for imparting training
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104732977A (en) * 2015-03-09 2015-06-24 广东外语外贸大学 On-line spoken language pronunciation quality evaluation method and system

Also Published As

Publication number Publication date
CN109697988B (en) 2021-05-14
CN109697988A (en) 2019-04-30

Similar Documents

Publication Publication Date Title
Witt et al. Computer-assisted pronunciation teaching based on automatic speech recognition
WO2019075828A1 (en) Voice evaluation method and apparatus
Black et al. Automatic prediction of children's reading ability for high-level literacy assessment
US9449522B2 (en) Systems and methods for evaluating difficulty of spoken text
CN101551947A (en) Computer system for assisting spoken language learning
CN112466279B (en) Automatic correction method and device for spoken English pronunciation
Lee Language-independent methods for computer-assisted pronunciation training
CN109697975B (en) Voice evaluation method and device
Kyriakopoulos et al. Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech
Larabi-Marie-Sainte et al. A new framework for Arabic recitation using speech recognition and the Jaro Winkler algorithm
Al-Bakeri et al. ASR for Tajweed rules: integrated with self-learning environments
KR20110092622A (en) A method and system for estimating foreign language speaking using speech recognition technique
Bang et al. An automatic feedback system for English speaking integrating pronunciation and prosody assessments
Dong et al. The application of big data to improve pronunciation and intonation evaluation in foreign language learning
Lobanov et al. On a way to the computer aided speech intonation training
Yamashita et al. Automatic scoring for prosodic proficiency of English sentences spoken by Japanese based on utterance comparison
Wu et al. Efficient personalized mispronunciation detection of Taiwanese-accented English speech based on unsupervised model adaptation and dynamic sentence selection
Zhang et al. Cognitive state classification in a spoken tutorial dialogue system
Liu Application of speech recognition technology in pronunciation correction of college oral English teaching
CN114783412B (en) Spanish spoken language pronunciation training correction method and system
Fu Automatic Proficiency Evaluation of Spoken English by Japanese Learners for Dialogue-Based Language Learning System Based on Deep Learning
Marie-Sainte et al. A new system for Arabic recitation using speech recognition and Jaro Winkler algorithm
Varatharaj Developing Automated Audio Assessment Tools for a Chinese Language Course
Varatharaj et al. Supporting teacher assessment in chinese language learning using textual and tonal features
Ungureanu et al. pROnounce: Automatic Pronunciation Assessment for Romanian

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17929286

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17929286

Country of ref document: EP

Kind code of ref document: A1