Background
Language is a very important place in life and work as a communication tool, and spoken language learning is a very important learning content of people no matter students are in the learning stage of schools or the working stage of people. With the continuous popularization of network teaching, the network teaching mode is not restricted by time and teaching places, so that the network teaching mode is popular with the majority of users. Therefore, many users prefer to use leisure time to learn languages through the network.
In the current network teaching process, when pronunciation practice is performed, one way is to give a free time after a piece of voice is played in video (or audio) and perform follow-up reading practice by the user; or a recording mode is adopted, the recording is played to the student after the student follows the reading, and the student self-evaluates whether the pronunciation is accurate or not; or the teacher can also carry out on-line teaching and give guidance and suggestions for the pronunciation of the student. The existing teaching mode can not give targeted guidance suggestions according to the pronunciation of the student, so that the learning effect is poor, or on-line teaching of a teacher is needed, and a large amount of manpower, material resources and financial resources are needed for supporting.
To solve the above problem, it is proposed to evaluate the voice of a learner based on a voice prediction model. CN101197084A discloses an automatic spoken English evaluation learning system, which is characterized in that the system comprises a spoken language pronunciation detection part, and the spoken language pronunciation detection part comprises the following steps: and (1) establishing a standard pronunciation human corpus: 1) searching English standard speakers; 2) designing a first recording text according to the oral English learning requirement and the phoneme balance principle; 3) recording the standard speaker by contrasting the recording text; and [ 2 ] collection of spoken language evaluation corpus: under the application environment of simulated English learning software, designing a second recording text according to the English learning requirement, searching for a common speaker, and recording the spoken pronunciation of the common speaker; and [ 3 ] labeling of spoken language evaluation corpus: the expert details whether the pronunciation of the phoneme in each word is correct; and (4) establishing a standard voice acoustic model: training an acoustic model of standard voice based on the recording in the standard pronunciation person corpus and the associated text thereof; calculating error detection parameters of the speech: 1) extracting a Mel cepstrum coefficient parameter of the voice; 2) based on a standard acoustic model and evaluating the general speaker recording in a corpus and a phoneme sequence corresponding to a text thereof, automatically cutting the general speaker voice data into segments taking phonemes as units, and meanwhile, calculating based on the standard model to obtain each segment as a first likelihood value of the phoneme; 3) recognizing each sound segment of the voice of the general speaker by using a standard acoustic model, and calculating a second likelihood value of the sound segment serving as a recognition result phoneme based on the standard acoustic model; 4) dividing the first likelihood value of the voice segment by the second likelihood value to obtain the likelihood ratio of the voice segment, and using the likelihood ratio as an error detection parameter of the voice segment; and (6) establishing an error detection mapping model for marking pronunciation errors to experts by error detection parameters: on a batch of evaluation voices, correlating each segment evaluation parameter and formant sequence of the segment with detailed labels of experts, obtaining corresponding relations between the parameters and the detailed labels of the experts by using a statistical method, and storing the relations as error detection mapping models from error detection parameters to pronunciation error labels of the experts.
CN101650886A discloses a method for automatically detecting reading errors of language learners, which is characterized by comprising the following steps: 1) front-end processing: preprocessing input voice, and extracting features, wherein the extracted features are MFCC feature vectors; 2) constructing a simplified search space: taking contents to be read by a user as a reference answer, and constructing a simplified search space according to the reference answer, a pronunciation dictionary, a multi-pronunciation model and an acoustic model; 3) constructing a reading language model: constructing a reading language model of the user according to the reference answer, wherein the language model describes context content and probability information which are possibly read when the user reads the reference sentence; 4) searching: in a search space, searching according to an acoustic model, a reading language model and a multi-pronunciation model to obtain a path which is most matched with the input feature vector stream, and taking the path as the actual reading result content of the user to make an identification result sequence; 5) alignment: and aligning the reference answer with the recognition result to obtain the detection results of multi-reading, missing reading and misreading of the user.
In the prior art, a voice recognition system is used for acquiring voice fragments corresponding to each basic voice unit in a voice signal, the acquired voice fragments are fused to obtain an effective voice fragment sequence corresponding to the voice signal, an evaluation feature is extracted from the effective voice fragment sequence, and a scoring prediction model corresponding to the feature type of the evaluation feature is loaded; and calculating the similarity of the evaluation features corresponding to the scoring prediction model, and taking the similarity as the score of the voice signal. However, when a user actually learns a language, the user often learns pronunciation according to a teacher's voice example in a teaching video (audio), and the teacher's voice example cannot be completely consistent with a standard pronunciation predicted by a voice prediction model due to individuation. Therefore, the pronunciation of the user is evaluated by the voice prediction model, and the predicted standard pronunciation is often not completely consistent with the teaching voice example in some aspects (such as tone and rhythm), so that the evaluation result is the comparison result of the user voice and the predicted voice and cannot truly reflect the comparison result of the user voice and the teaching voice example.
Therefore, it is necessary to provide a speech evaluation method that gives an evaluation result evaluated by a speech prediction model and also gives an evaluation result compared with a teaching speech example, so that a user can comprehensively understand his or her learning situation.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is how to simultaneously provide the evaluation result compared with the teaching example voice and the evaluation result compared with the standard voice predicted by the voice prediction model for the user in the language learning process so as to help the user to comprehensively know the self learning condition.
Therefore, the invention provides a voice evaluation method for evaluating the language pronunciation of a user in the language learning process, which is characterized in that:
step S101, acquiring voice input of a user through a recording device of a voice evaluation device;
step S102, performing basic voice unit division on the recorded voice to obtain a voice unit sequence of the recorded voice;
step S103, extracting the characteristics of the voice unit sequence to obtain the temperament characteristics of the voice unit sequence;
step S104, comparing and analyzing the extracted temperament characteristics with teaching example voices and standard voices predicted by a voice prediction model respectively;
and step S105, marking the voice comparison result on the voice text of the user.
The basic speech unit may be a syllable, a phoneme, etc., and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
The rhythm characteristics comprise rhythm characteristics and syllable characteristics, wherein the rhythm characteristics comprise boundary characteristics, pronunciation duration, pause time between adjacent basic voice units and pronunciation duration of the whole voice unit sequence of each basic voice unit, and the syllable characteristics comprise pronunciation of each basic voice unit and pronunciation of the whole voice unit sequence.
The process of performing comparative analysis with the teaching example speech includes:
acquiring teaching example voice stored in a system;
dividing basic voice units of the teaching example voice to obtain basic voice units and voice unit sequences of the teaching example voice;
extracting the voice rhythm characteristics of the teaching voice unit sequence, wherein the voice rhythm characteristics of the teaching voice unit sequence correspond to the voice rhythm characteristics of the user voice unit sequence;
and comparing the voice rhythm characteristics of the user voice unit sequence with the voice rhythm characteristics of the teaching voice unit sequence, and giving a corresponding evaluation result.
The process of evaluating the voice by using the voice prediction model comprises the following steps:
dividing the recorded user voice into basic voice units, and extracting corresponding to-be-tested voice evaluation rule features from the voice unit sequence;
loading corresponding prediction models for different temperament characteristics, and predicting corresponding standard pronunciations;
and comparing the voice rhythm characteristics of the user voice with the voice rhythm characteristics of the standard pronunciation to obtain a corresponding evaluation result.
The process for labeling the voice comparison result specifically comprises the following steps:
converting the recorded user voice into a voice text;
and respectively marking the obtained evaluation result of the teaching example voice comparison and the evaluation result of the standard voice comparison predicted by the voice prediction model on the voice text in a visual mode, and displaying the evaluation results to a user.
The invention also provides a voice evaluation device, which comprises a recording module, a storage module, a voice processing module, a feature extraction module, a voice analysis module, an evaluation module, a labeling module and a display module, and is characterized in that:
the recording module is used for acquiring voice input of a user;
the voice processing module is used for dividing the recorded voice into basic voice units to obtain a voice unit sequence of the recorded voice;
the characteristic extraction module is used for extracting the characteristics of the voice unit sequence to obtain the temperament characteristics of the voice unit sequence;
the voice analysis module is used for respectively comparing and analyzing the extracted temperament characteristics with the teaching example voice and the standard voice predicted by the voice prediction model;
and the marking module is used for marking the voice evaluation result on the voice text of the user.
The voice evaluation device also comprises a display module which is used for displaying the user voice text with the voice evaluation result label to the user.
The voice evaluation method and the voice evaluation device provided by the invention have the advantages that the evaluation results of the user voice and the teaching example voice and the evaluation result of the standard voice predicted by the voice prediction model are provided for the user at the same time, so that the user can fully know the pronunciation condition of the user, and the pronunciation accuracy is improved.
Detailed Description
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure.
The "voice evaluation device" is a "computer device" in this context, and refers to an intelligent electronic device that can execute a predetermined processing procedure such as numerical calculation and/or logic calculation by running a predetermined program or instruction, and may include a processor and a memory, where the processor executes a pre-stored instruction stored in the memory to execute the predetermined processing procedure, or the processor executes the predetermined processing procedure by hardware such as ASIC, FPGA, DSP, or a combination thereof.
The computer device comprises user equipment and/or network equipment. Wherein the user equipment includes but is not limited to computers, smart phones, PDAs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. Wherein the computer device can be operated alone to implement the invention, or can be accessed to a network and implement the invention through interoperation with other computer devices in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
Those skilled in the art should understand that the "voice evaluation device" described in the present invention may be only a user equipment, that is, the user equipment performs corresponding operations; or the user equipment and the network equipment or the server are integrated to form the system, namely the user equipment and the network equipment are matched to execute corresponding operations.
It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present invention, and are included by reference.
Here, it should be understood by those skilled in the art that the present invention can be applied to both mobile terminals and non-mobile terminals, for example, when a user uses a mobile phone or a PC, the method or the apparatus of the present invention can be used for providing and presenting.
Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present invention. The present invention may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
The present invention is described in further detail below with reference to the attached drawing figures.
FIG. 1 shows a flow chart of a speech assessment method of the present invention.
In step S101, in a spoken language follow-up reading process of language learning, a user records a voice input of the user through a recording device of a voice evaluation apparatus.
Specifically, after learning the voice examples in the teaching courseware, the user enters a follow-up reading link, and at the moment, the recording equipment in the voice evaluation device is triggered to enter a recording state. When the user starts to follow the voice example, the recording device starts to record the voice of the user, and the follow-up voice of the user is stored in a storage module of the voice evaluation device for further analysis and use.
In step S102, the user follow-up reading voice recorded in the storage module is acquired, and basic voice unit division is performed on the recorded voice to obtain a voice unit sequence of the recorded user follow-up reading voice.
The basic speech unit may be a syllable, a phoneme, etc., and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
Different speech recognition systems will decode the speech signal based on different acoustic features, such as acoustic models based on MFCC (Mel-Frequency Cepstrum Coefficients, Mel-Frequency cepstral Coefficients) features, acoustic models based on PLP (Perceptual Linear prediction) features, etc., or using different acoustic models, such as HMM-gmkov Model-Gaussian Mixture Model, Hidden Markov Model-Gaussian Mixture Model, DBN (Dynamic beyesian network), etc., or using different decoding means, such as Viterbi search, a search, etc.
And step S103, extracting the characteristics of the voice unit sequence to obtain the temperament characteristics of the voice unit sequence.
The rhythm characteristics comprise rhythm characteristics and syllable characteristics, wherein the rhythm characteristics comprise boundary characteristics, pronunciation duration, pause time between adjacent basic voice units and pronunciation duration of the whole voice unit sequence of each basic voice unit, and the syllable characteristics comprise pronunciation of each basic voice unit and pronunciation of the whole voice unit sequence.
And step S104, comparing and analyzing the extracted temperament characteristics with the teaching example speech and the standard speech predicted by the speech prediction model respectively.
The process of comparing and analyzing the teaching example voice comprises the following steps of obtaining the teaching example voice stored in the system, dividing the teaching example voice into basic voice units to obtain the basic voice units and the voice unit sequences of the teaching example voice, and further extracting the rhythm characteristics of the teaching voice unit sequences, wherein the rhythm characteristics of the teaching voice unit sequences correspond to the rhythm characteristics of the user voice unit sequences. And comparing the voice rhythm characteristics of the user voice unit sequence with the voice rhythm characteristics of the teaching voice unit sequence, and giving a corresponding evaluation result.
The method for evaluating the voice by utilizing the voice prediction model can adopt the existing voice evaluation technology, namely, the basic voice unit division is carried out on the recorded user voice, corresponding to-be-evaluated voice rhythm characteristics are extracted from a voice unit sequence, the corresponding prediction models are loaded for different voice rhythm characteristics, corresponding standard pronunciations are predicted, and then the voice rhythm characteristics of the user voice are compared with the voice rhythm characteristics of the standard pronunciations to obtain corresponding evaluation results.
And step S105, marking the voice comparison result on the voice text of the user and providing the voice comparison result for the user.
In this step, the recorded user speech is further converted into a speech text by a speech processing module. And respectively marking the evaluation result obtained in the step S104 and compared with the teaching example voice and the evaluation result of the standard voice predicted by the voice prediction model on the voice text in a visual mode, and displaying the evaluation results to the user. The user can know the difference between the pronunciation of the user and the pronunciation of the teaching example and the difference between the pronunciation of the user and the pronunciation of the standard speech predicted by the speech prediction model through the displayed evaluation result, so that the user can comprehensively know the problems of the pronunciation of the read text and further improve the pronunciation standard type. The comparison result can comprise pronunciation evaluation of the basic voice unit, pronunciation duration evaluation of the basic voice unit, full text fluency evaluation and the like.
Fig. 2 shows a speech evaluation device according to an embodiment of the invention. The voice evaluation device is used for realizing the voice evaluation method, and simultaneously providing the evaluation result of the teaching example voice and the evaluation result of the standard voice predicted by the voice prediction model to the user after the user reads the following spoken language. The voice evaluation device comprises a recording module 1, a storage module 2, a voice processing module 3, a feature extraction module 4, a voice analysis module 5, a labeling module 6 and a display module 7.
In the following reading link of the spoken language for language learning, the user records the voice input of the user through the recording module 1 of the voice evaluation device.
Specifically, after learning the voice examples in the teaching courseware, the user enters a follow-up reading link and triggers the recording module 1 in the voice evaluation device to enter a recording state. When the user starts to follow the voice example, the recording module 1 starts to record the voice of the user, and stores the follow-up voice of the user in the storage module 2 of the voice evaluation device for further analysis.
The voice processing module 3 obtains the user follow-up reading voice recorded in the storage module 2, and performs basic voice unit division on the recorded voice.
The basic speech unit may be a syllable, a phoneme, etc., and the basic speech unit and the speech unit sequence of the recorded speech are obtained by dividing the recorded speech.
After the voice processing module 3 divides the basic voice unit of the recorded voice, the feature extraction module 4 further extracts features of the generated voice unit sequence to obtain the temperament features of the voice unit sequence.
The rhythm characteristics comprise rhythm characteristics and syllable characteristics, wherein the rhythm characteristics comprise boundary characteristics, pronunciation duration, pause time between adjacent basic voice units and pronunciation duration of the whole voice unit sequence of each basic voice unit, and the syllable characteristics comprise pronunciation of each basic voice unit and pronunciation of the whole voice unit sequence.
And the voice analysis module 5 compares and analyzes the extracted voice rhythm characteristics with the teaching example voice and the standard voice predicted by the voice prediction model respectively.
The process of comparing and analyzing the teaching example voice is as follows, the voice analysis module 5 obtains the teaching example voice stored in the storage module 2, and performs basic voice unit division on the teaching example voice, so as to obtain a basic voice unit and a voice unit sequence of the teaching example voice, and further extract the rhythm characteristics of the teaching voice unit sequence, wherein the rhythm characteristics of the teaching voice unit sequence correspond to those of the user voice unit sequence. And comparing the voice rhythm characteristics of the user voice unit sequence with the voice rhythm characteristics of the teaching voice unit sequence, and giving a corresponding evaluation result.
The method for evaluating the voice by utilizing the voice prediction model can adopt the existing voice evaluation technology, namely, the basic voice unit division is carried out on the recorded user voice, corresponding to-be-evaluated voice rhythm characteristics are extracted from a voice unit sequence, the corresponding prediction models are loaded for different voice rhythm characteristics, corresponding standard pronunciations are predicted, and then the voice rhythm characteristics of the user voice are compared with the voice rhythm characteristics of the standard pronunciations to obtain corresponding evaluation results.
The marking module 6 marks the voice comparison result on the voice of the user and provides the voice comparison result to the user through the display module 7.
Specifically, the recorded user voice is further converted into a voice text through the voice processing module 3. And respectively marking the evaluation result of the comparison with the teaching example voice, which is obtained by analyzing the voice analysis module 5, and the evaluation result of the standard voice comparison predicted by the voice prediction model on the voice text in a visual mode, and displaying the evaluation results to a user through a display module 7. The user can know the difference between the pronunciation of the user and the pronunciation of the teaching example and the difference between the pronunciation of the user and the pronunciation of the standard speech predicted by the speech prediction model through the displayed evaluation result, so that the user can comprehensively know the problems of the pronunciation of the read text and further improve the pronunciation standard type. The comparison result can comprise pronunciation evaluation of the basic voice unit, pronunciation duration evaluation of the basic voice unit, full text fluency evaluation and the like.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware as instructed by a computer program, which may be stored on a computer readable storage medium and executed by a processor. The computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The foregoing describes preferred embodiments of the present invention, and is intended to provide a clear and concise description of the spirit and scope of the invention, and not to limit the same, but to include all modifications, substitutions, and alterations falling within the spirit and scope of the invention as defined by the appended claims.