WO2019075828A1

WO2019075828A1 - Voice evaluation method and apparatus

Info

Publication number: WO2019075828A1
Application number: PCT/CN2017/111822
Authority: WO
Inventors: 卢炀; 宾晓皎; 李明; 蔡泽鑫
Original assignee: 深圳市鹰硕音频科技有限公司
Priority date: 2017-10-20
Filing date: 2017-11-20
Publication date: 2019-04-25
Also published as: CN109697988B; CN109697988A

Abstract

A voice evaluation method for evaluating language pronunciation of a user in a language learning process, comprises the following steps: step S101, acquiring a voice input of a user through a voice recording device of a voice evaluation apparatus; step S102, dividing basic voice units of the recorded voice to obtain a sequence of voice units of the recorded voice; step S103, extracting a characteristic of the sequence of voice units, and acquiring temperament characteristics of the sequence of voice units; step S104, respectively comparing the extracted temperament characteristics with a teaching example voice and a standard voice predicted by a voice prediction model to make an analysis; and step S105, marking a voice comparison result on a voice text of the user.

Description

Voice evaluation method and device

Technical field

The invention relates to the field of multimedia teaching technology, in particular to a voice evaluation method and device for multimedia teaching.

Background technique

As a communication tool, language plays a very important role in life and work. Whether it is students at the stage of school or people at work, oral learning is a learning content that people attach great importance to. With the continuous popularization of online teaching, the way of network teaching is not restricted by time and teaching place, and is loved by users. Therefore, many users are now more willing to use their leisure time to learn languages through the Internet.

In the current network teaching process, when performing pronunciation practice, one way is to give a free time to the user to perform the following exercises after the video (or audio) plays a voice; or to use the recording method, After the student follows the lecture, the student will play the recording, and the student will self-evaluate whether the pronunciation is accurate; or the teacher can conduct online teaching to give guidance and suggestions for the pronunciation of the student. The above existing teaching methods may not provide targeted guidance for the pronunciation of the students, resulting in poor learning results, or requiring teachers to teach online, requiring a large amount of human, material and financial support.

In order to solve the above problems, it is proposed to evaluate the voice of the student according to the voice prediction model. CN101197084A discloses an automated spoken English evaluation learning system, characterized in that the system comprises detecting a spoken language part, and the detecting spoken part comprises the following steps: [1] establishment of a standard speaker corpus: 1) searching for an English standard Pronunciation person; 2) design the first recorded text according to the requirements of oral English learning requirements and phoneme balance; 3) standard pronunciation person to record the recorded text; [2] collection of oral evaluation corpus: in the simulated English learning software application environment, According to the English learning requirements, the second recording text is designed, and the general speaker is searched for, and the spoken pronunciation of the general speaker is recorded; [3] The annotation of the oral corpus is evaluated: the expert specifies whether the pronunciation of the phoneme in each word is correct; 4) Establishment of standard speech acoustic model: training the acoustic model of standard speech based on the recording in the standard speaker corpus and its associated text; [5] Calculating the error detection parameters of the speech: 1) Extracting the Mergian cepstrum of the speech Number parameter; 2) based on standard acoustic model, and one in the evaluation corpus Speakers and audio text corresponding phoneme sequences will generally pronounce the number of people voice According to the automatic segmentation, each segment is divided into phonemes, and each segment is calculated as the first likelihood of the phoneme based on the standard model; 3) each segment of the general speaker's voice is identified by the standard acoustic model. And calculating the second likelihood value of the sound segment as the recognition result phoneme based on the standard acoustic model; 4) dividing the first segment likelihood value by the second likelihood value to obtain the likelihood ratio of the sound segment, as The error detection parameter of the speech segment; [6] establishing an error detection mapping model for the error detection parameter of the error detection parameter to the expert: on a batch of evaluation voices, the evaluation parameters of each segment and the formant sequence of the segment and the details of the expert The annotations are associated, and the corresponding relationship between the above parameters and the expert detailed annotations is obtained by statistical methods, and these relationships are saved as an error detection mapping model from the error detection parameters to the expert pronunciation error labels.

CN101650886A discloses a method for automatically detecting a language learner's reading error, which comprises the following steps: 1) front-end processing: pre-processing input speech, performing feature extraction, and extracting features are MFCC feature vectors; 2) constructing Streamlined search space: use the content that the user wants to read as the reference answer, and build a simplified search space based on the reference answer, pronunciation dictionary, multi-sounding model and acoustic model; 3) construct a reading language model: construct the user's reading language model based on the reference answer The language model describes the context content and probability information that the user may read aloud while reading the reference sentence; 4) Search: in the search space, the feature vector obtained by the acoustic model, the spoken language model, and the multi-phone model are obtained. The most matching path of the stream is used as the actual reading result content of the user to form a sequence of recognition results; 5) Alignment: aligning the reference answer with the recognition result to obtain a detection result of multiple reading, missing reading, and misreading by the user.

In the prior art, a speech recognition system is used to acquire a speech segment corresponding to each basic speech unit in a speech signal, and the acquired speech segment is fused to obtain a valid speech segment sequence corresponding to the speech signal, and the sequence of valid speech segments is extracted. And evaluating a feature, loading a score prediction model corresponding to the feature type of the evaluation feature; calculating the similarity of the evaluation feature corresponding to the score prediction model, and using the similarity as a score of the voice signal. However, when the user actually conducts language learning, the pronunciation is often learned according to the teacher's voice example in the teaching video (audio), and the teacher voice example is often completely consistent with the standard pronunciation predicted by the voice prediction model for personalization reasons. . Therefore, the pronunciation prediction model is used to evaluate the user's pronunciation. The predicted standard pronunciation is often not consistent with the teaching speech example in some aspects (such as pitch, rhythm), so the evaluation result is user speech and predicted speech. The comparison results do not truly reflect the comparison between the user's voice and the teaching voice example.

Therefore, it is necessary to provide a speech evaluation method. While giving the evaluation results evaluated by the speech prediction model, it is also possible to give an evaluation result in comparison with the teaching speech example, so that the user can fully understand his or her learning situation.

Summary of the invention

To this end, the technical problem to be solved by the present invention is how to simultaneously provide the user with the evaluation result compared with the teaching example voice and the evaluation result of the standard voice comparison predicted by the voice prediction model in the process of language learning, so as to help the user fully understand himself. Learning situation.

To this end, the present invention provides a speech evaluation method for evaluating a user's language pronunciation in a language learning process, which is characterized by:

Step S101: Acquire a voice input of a user by using a recording device of the voice evaluation device;

Step S102, performing basic speech unit division on the recorded voice, and obtaining a sequence of voice units of the recorded voice;

Step S103, performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit;

Step S104, comparing the extracted temperament features with the standard voice predicted by the teaching example voice and the voice prediction model;

In step S105, the voice comparison result is marked on the user voice text.

The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.

The temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.

The process of comparative analysis with the teaching example speech includes:

Obtain the teaching example speech saved in the system;

Perform basic speech unit division on the teaching example speech, and obtain a basic speech unit and a speech unit sequence of the teaching example speech;

Extracting a temperament feature of the sequence of the teaching phonetic unit, the temperament feature of the sequence of the teaching phonetic unit corresponding to the temperament feature of the sequence of the user's phonetic unit;

The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.

The process of speech evaluation using a speech prediction model includes:

Performing basic speech unit division on the recorded user speech, and extracting corresponding to-be-measured temperament features from the speech unit sequence;

Loading corresponding prediction models for different temperament features, and predicting corresponding standard pronunciations;

The temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.

The voice comparison result labeling process specifically includes:

Convert the recorded user voice into a voice text;

The evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.

The present invention also provides a voice evaluation device, which includes a recording module, a storage module, a voice processing module, a feature extraction module, a voice analysis module, an evaluation module, an annotation module, and a display module, and is characterized in that:

a recording module for acquiring a user's voice input;

a voice processing module, configured to perform basic voice unit division on the recorded voice, to obtain a sequence of voice units of the recorded voice;

a feature extraction module, performing feature extraction on the sequence of the phonetic unit, and acquiring a temperament feature of the sequence of the phonetic unit;

The speech analysis module compares the extracted temperament features with the standard speech of the teaching example speech and the speech prediction model;

An annotation module that marks the speech evaluation result on the user's voice text.

The voice evaluation device further includes a display module for displaying the user voice text with the voice evaluation result annotation to the user.

The speech evaluation method and apparatus of the present invention provide the user with the evaluation result of the user speech and the teaching example speech and the evaluation result of the standard speech predicted by the speech prediction model, so that the user fully understands the pronunciation status of the user and improves the pronunciation accuracy. Sex.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings may be obtained according to the contents of the embodiments of the present invention and the drawings without any creative work.

1 is a flowchart of a voice evaluation method according to an embodiment of the present invention; and

2 is a structural diagram of a voice evaluation apparatus according to an embodiment of the present invention.

Detailed ways

Before discussing the exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as a process or method depicted as a flowchart. Although the flowcharts describe various operations as a sequential process, many of the operations can be implemented in parallel, concurrently or concurrently. In addition, the order of operations can be rearranged. The process may be terminated when its operation is completed, but may also have additional steps not included in the figures.

The term "speech evaluation device" as used in the context is a "computer device" and refers to an intelligent electronic device that can perform a predetermined process such as numerical calculation and/or logic calculation by running a predetermined program or instruction, which may include a processor and The memory is executed by the processor to execute a predetermined process pre-stored in the memory to execute a predetermined process, or is executed by hardware such as an ASIC, an FPGA, a DSP, or the like, or a combination of the two.

The computer device includes a user device and/or a network device. The user equipment includes, but is not limited to, a computer, a smart phone, a PDA, etc.; the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing based computer Or a cloud composed of a network server, wherein cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computers. Wherein, the computer device can be operated separately to implement the present invention, and can also access the network and implement the present invention by interacting with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

It should be understood by those skilled in the art that the "speech evaluation device" described in the present invention may be only a user equipment, that is, the user equipment performs a corresponding operation; or may be composed of a user equipment integrated with a network device or a server. That is, the user equipment cooperates with the network equipment to perform the corresponding operation. Work.

It should be noted that the user equipment, the network equipment, the network, and the like are merely examples, and other existing or future possible computer equipment or networks, such as those applicable to the present invention, are also included in the scope of the present invention. It is included here by reference.

Here, those skilled in the art should understand that the present invention can be applied to mobile terminals and non-mobile terminals. For example, when a user uses a mobile phone or a PC, the method or apparatus according to the present invention can be used for providing and presenting.

The specific structural and functional details disclosed are merely representative and are for the purpose of describing exemplary embodiments of the invention. The present invention may, however, be embodied in many alternative forms and should not be construed as being limited only to the embodiments set forth herein.

The terminology used herein is for the purpose of describing the particular embodiments, The singular forms "a", "an", It is also to be understood that the terms "comprising" and """ Other features, integers, steps, operations, units, components, and/or combinations thereof.

It should also be noted that in some alternative implementations, the functions/acts noted may occur in a different order than that illustrated in the drawings. For example, two figures shown in succession may in fact be executed substantially concurrently or sometimes in the reverse order, depending on the function/acts involved.

The invention is further described in detail below with reference to the accompanying drawings.

Fig. 1 shows a flow chart of a speech evaluation method of the present invention.

In step S101, the user records the voice input of the user through the recording device of the voice evaluation device in the speaking and reading step of the language learning.

Specifically, after learning the voice example in the courseware, the user enters the follow-up step, and at this time, the recording device in the voice evaluation device is triggered to enter the recording state. When the user starts to follow the voice example, the recording device starts recording the user voice and saves the user's accompanying voice in the storage module of the voice evaluation device for further analysis and use.

In step S102, the user follows the recorded voice recorded in the storage module, performs basic speech unit division on the recorded voice, and obtains a sequence of the voice unit of the recorded user followed by the voice.

Different speech recognition systems will be based on different acoustic characteristics such as acoustic models based on MFCC (Mel-Frequency Cepstrum Coefficients) features, acoustic models based on PLP (Perceptual Linear Predictive) features, or Different acoustic models such as HMM-GMM (Hidden Markov Model-Gaussian Mixture Model), neural network acoustic models based on DBN (Dynamic Beyesian Network), etc., or Different decoding methods such as Viterbi search, A* search, etc., decode the speech signal.

Step S103, performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit.

In step S104, the extracted temperament features are compared and analyzed with the teaching example speech and the standard speech predicted by the speech prediction model.

The process of comparative analysis with the teaching example speech is as follows: acquiring the teaching example speech saved in the system, and performing basic speech unit division on the teaching example speech, thereby obtaining the basic speech unit and the speech unit sequence of the teaching example speech, and further extracting the teaching. A temperament feature of a sequence of speech units, the temperament features of the sequence of teaching speech units corresponding to the tempo characteristics of the sequence of user speech units. The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.

The method for using the speech prediction model for speech evaluation can adopt the existing speech evaluation technology, that is, the basic speech unit division is performed on the recorded user speech, and the corresponding temperament characteristics to be evaluated are extracted from the speech unit sequence, and corresponding to different temperament features are loaded. The prediction model predicts the corresponding standard pronunciation, and then compares the temperament characteristics of the user's voice with the temperament characteristics of the standard pronunciation, and obtains the corresponding evaluation results.

In step S105, the voice comparison result is marked on the user voice text and provided to the user.

In this step, the recorded user voice is further converted into a voice processing module. Voice text. The evaluation result of the comparison with the teaching example speech obtained in step S104 and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user. Through the displayed evaluation results, the user can understand the difference between the pronunciation and the pronunciation of the teaching example, and the difference between the pronunciation of the standard speech predicted by the speech prediction model, so that the user can fully understand the pronunciation of the read text. What problems help users further improve the standardization of pronunciation. The comparison result may include a pronunciation evaluation of the basic speech unit, a pronunciation duration evaluation of the basic speech unit, a full-text fluency evaluation, and the like.

FIG. 2 shows a speech evaluation apparatus according to an embodiment of the present invention. The voice evaluation device is used to implement the voice evaluation method of the present invention, and after the user performs the spoken language follow-up, the user is simultaneously provided with the evaluation result of the teaching example voice and the evaluation result of the standard voice predicted by the voice prediction model. The voice evaluation device includes a recording module 1, a storage module 2, a voice processing module 3, a feature extraction module 4, a voice analysis module 5, an annotation module 6, and a display module 7.

The user records the voice input of the user through the recording module 1 of the voice evaluation device during the speaking and reading section of the language learning.

Specifically, after learning the voice example in the courseware, the user enters the follow-up step and triggers the recording module 1 in the voice evaluation device to enter the recording state. When the user starts to follow the voice example, the recording module 1 starts recording the user voice, and saves the user's follow-up voice in the storage module 2 of the voice evaluation device for further analysis and use.

The voice processing module 3 acquires the user-followed voice recorded in the storage module 2, and performs basic voice unit division on the recorded voice.

After the speech processing module 3 divides the basic speech unit of the recorded speech, the feature extraction module 4 further performs feature extraction on the generated speech unit sequence to obtain the temperament feature of the speech unit sequence.

The speech analysis module 5 separates the extracted temperament features with the teaching example speech and speech prediction The standard speech predicted by the model was compared and analyzed.

The process of comparing and analyzing with the teaching example voice is as follows. The voice analysis module 5 obtains the teaching example voice saved in the storage module 2, and performs basic voice unit division on the teaching example voice, thereby obtaining the basic voice unit and the voice unit of the teaching example voice. Sequences, and further extracting temperament features of the sequence of teaching phonetic units, the temperament features of the sequence of teaching speech units corresponding to the tempo characteristics of the sequence of user speech units. The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.

The labeling module 6 marks the speech comparison result on the user's voice and provides it to the user through the display module 7.

Specifically, the recorded user voice is further converted into a voice text by the voice processing module 3. The evaluation result of the comparison with the teaching example speech and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively recorded on the speech text in a visual manner, and are displayed on the speech text through the display module. user. Through the displayed evaluation results, the user can understand the difference between the pronunciation and the pronunciation of the teaching example, and the difference between the pronunciation of the standard speech predicted by the speech prediction model, so that the user can fully understand the pronunciation of the read text. What problems help users further improve the standardization of pronunciation. The comparison result may include a pronunciation evaluation of the basic speech unit, a pronunciation duration evaluation of the basic speech unit, a full-text fluency evaluation, and the like.

A person of ordinary skill in the art can understand that all or part of the steps of the foregoing embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium and executed by a processor. carried out. The computer readable storage medium may include a read only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like.

The preferred embodiments of the present invention have been described above, and are intended to provide a further understanding of the embodiments of the present invention. It is intended to be included within the scope of the appended claims.

Industrial applicability

Claims

A speech evaluation method for evaluating a user's language pronunciation in a language learning process, which is characterized by:

Step S101: Acquire a voice input of a user by using a recording device of the voice evaluation device;

Step S102, performing basic speech unit division on the recorded voice, and obtaining a sequence of voice units of the recorded voice;

Step S103, performing feature extraction on the sequence of the speech unit to obtain a temperament feature of the sequence of the speech unit;

Step S104, comparing the extracted temperament features with the standard voice predicted by the teaching example voice and the voice prediction model;

In step S105, the voice comparison result is marked on the user voice text.
A speech evaluation method according to claim 1, wherein:

The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
A speech evaluation method according to claim 1, wherein:

The temperament feature includes a prosody feature and a syllable feature, and the prosody feature includes a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a duration of pronunciation of the entire phonetic unit sequence;

The syllable features include the pronunciation of each basic speech unit and the pronunciation of the entire speech unit sequence.
A speech evaluation method according to claim 1, wherein:

The process of comparative analysis with the teaching example speech includes,

Obtain the teaching example speech saved in the system;

Perform basic speech unit division on the teaching example speech, and obtain a basic speech unit and a speech unit sequence of the teaching example speech;

Extracting a temperament feature of the sequence of the teaching phonetic unit, the temperament feature of the sequence of the teaching phonetic unit corresponding to the temperament feature of the sequence of the user's phonetic unit;

The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
A speech evaluation method according to claim 1, wherein:

The process of using the speech prediction model for speech evaluation includes,

Performing basic speech unit division on the recorded user speech, and extracting corresponding to-be-measured temperament features from the speech unit sequence;

Loading corresponding prediction models for different temperament features, and predicting corresponding standard pronunciations;

The temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.
A speech evaluation method according to claim 1, wherein:

The voice comparison result labeling process specifically includes,

Convert the recorded user voice into a voice text;

The evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.
A voice evaluation device includes a recording module, a storage module, a voice processing module, a feature extraction module, a voice analysis module, and an annotation module, wherein:

a recording module for acquiring a user's voice input;

a voice processing module, configured to perform basic voice unit division on the recorded voice, to obtain a sequence of voice units of the recorded voice;

a feature extraction module, performing feature extraction on the sequence of the phonetic unit, and acquiring a temperament feature of the sequence of the phonetic unit;

The speech analysis module compares the extracted temperament features with the standard speech of the teaching example speech and the speech prediction model;

An annotation module that marks the speech evaluation result on the user's voice text.
A speech evaluation apparatus according to claim 7, wherein:

The basic speech unit may be a syllable, a phoneme or the like, and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
A speech evaluation apparatus according to claim 7, wherein:

The temperament feature includes a prosody feature and a syllable feature, the prosody feature including a boundary feature of each basic phone unit, a length of pronunciation, a pause time between adjacent basic phone units, and a length of pronunciation of the entire phonetic unit sequence, the syllable features including The pronunciation of the basic speech unit and the pronunciation of the entire speech unit sequence.
A speech evaluation apparatus according to claim 7, wherein:

The process of comparative analysis with the teaching example speech includes,

Obtain the teaching example speech saved in the system;

Perform basic speech unit division on the teaching example speech, and obtain a basic speech unit and a speech unit sequence of the teaching example speech;

Extracting a temperament feature of the sequence of the teaching phonetic unit, the temperament feature of the sequence of the teaching phonetic unit corresponding to the temperament feature of the sequence of the user's phonetic unit;

The temperament features of the user phonetic unit sequence are compared with the temperament features of the teaching phonetic unit sequence, and the corresponding evaluation results are given.
A speech evaluation apparatus according to claim 7, wherein:

The process of speech evaluation using a speech prediction model, including,

Performing basic speech unit division on the recorded user speech, and extracting corresponding to-be-measured temperament features from the speech unit sequence;

Loading corresponding prediction models for different temperament features, and predicting corresponding standard pronunciations;

The temperament characteristics of the user's voice are compared with the temperament characteristics of the standard pronunciation, and the corresponding evaluation results are obtained.
A speech evaluation apparatus according to claim 7, wherein:

The voice comparison result labeling process specifically includes,

Convert the recorded user voice into a voice text;

The evaluation result of the obtained teaching example speech comparison and the evaluation result of the standard speech comparison predicted by the speech prediction model are respectively visually marked on the speech text and displayed to the user.
A speech evaluation apparatus according to claim 7, wherein:

The voice evaluation device further includes a display module for displaying the user voice text with the voice evaluation result annotation to the user.
A computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor can implement any of claims 1-6 when the program is executed The method steps of the item.
A computer storage medium storing a program executable by a computer, the method steps of any of claims 1-6 being implemented when the program is executed.