CN103928023B

CN103928023B - A kind of speech assessment method and system

Info

Publication number: CN103928023B
Application number: CN201410178813.XA
Authority: CN
Inventors: 李心广; 李苏梅; 何智明; 陈泽群; 李婷婷; 陈广豪; 马晓纯; 王晓杰; 陈嘉华; 徐集优
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2014-04-29
Filing date: 2014-04-29
Publication date: 2017-04-05
Anticipated expiration: 2034-04-29
Also published as: CN103928023A

Abstract

The invention discloses a kind of speech assessment method, including step：S1, the examination paper voice for recording examinee；S2, the examination paper voice to the examinee carry out pretreatment, obtain examination paper voice language material；S3, the characteristic parameter for extracting the examination paper voice language material；The characteristic parameter and received pronunciation template of the examination paper voice language material are carried out characteristic matching by the audio recognition method of S4, employing based on HMM and ANN mixed models, are identified the content of the examination paper voice, and are given raw score；If S5, raw score are less than threshold value, raw score is final scoring；The scoring of point index such as accuracy, fluency, word speed, rhythm, stress and intonation is carried out otherwise；S6, comprehensive various score calculations obtain the final scoring of examination paper voice.The invention also discloses a kind of speech assessment system., using the audio recognition method based on mixed model, it is more accurate to recognize for the present invention, additionally it is possible to realize that the voice paper that examinee is deposited with document form after recording carries out objective scoring by evaluation criterion classification.

Description

A kind of speech assessment method and system

Technical field

The present invention relates to speech recognition and assessment technique, more particularly to a kind of speech assessment method and system.

Background technology

Speech recognition technology is generally divided into two classes from using angle：One class is particular person speech recognition, and a class is non-spy Determine people's speech recognition.Particular person speech recognition technology is the technology of identification for a specific people, is exactly briefly only identification The sound of one people, is not suitable for wider colony；And unspecified person technology of identification is on the contrary, different people can be met Speech recognition requires, is adapted to extensive crowd and applies.

The IBM voice studies group for maintaining the leading position in terms of big vocabulary speech recognition at present.Bel's research of AT＆T A series of also begun to experiments about signer-independent sign language recognition, its achievement are established and how to be made for nonspecific human speech The method of the standard form of sound identification.

Major progress acquired by this period has：

(1) maturation and constantly improve of hidden markov model (Hidden Markov Models, HMM) technology becomes The main stream approach of speech recognition；

(2) when continuous speech recognition is carried out, in addition to recognizing acoustic information, more known using various language Know, word-building, syntax, semanteme, dialogue background in terms of etc. knowledge come help further to voice make identification and understand；Together When in the Research of Speech Recognition field, also create the language model based on statistical probability；

(3) rise of applied research of the artificial neural network in speech recognition.In these researchs, major part adopts base In the multilayer perception network of back-propagation algorithm (BP algorithm)；Additionally, also network structure is simple, be easily achieved, do not have feedback The feedforward network of signal；The stability and function of associate memory of system have the feedback network for having feedback between substantial connection, neuron. Artificial neural network has the ability of the classification boundaries for distinguishing complicated, it is clear that it efficiently contributes to mode division.

In addition, the continuous speech dictation machine technology towards personal use is also gradually improved.This respect, it is most representational to be The Dragon Dictate systems of the ViaVoice and Dragon companies of IBM.These systems have speaker adaptation ability, newly User need not be trained to whole vocabulary, just can improve constantly discrimination in use.

The development of the speech recognition technology of China：There are acoustics institute of the Chinese Academy of Sciences, Institute of Automation, Tsing-Hua University, the north in Beijing The scientific research institutions such as university of communications and institution of higher learning.In addition, also Harbin Institute of Technology, Chinese University of Science and Technology, Sichuan University etc. Also take action one after another.Now, the country has many speech recognition systems to succeed in developing.The performance of these systems differs from one another： In terms of isolated word large vocabulary speech recognition, most representational is department of electronic engineering, tsinghua university and China Electronics's device public affairs Take charge of cooperation research and development successful THED-919 particular persons speech recognition and understand real-time system；In terms of continuous speech recognition, Sichuan University computer center realizes the continuous English of particular person of a Topic-constrained on microcomputer --- Chinese speech translation demonstration System；In terms of signer-independent sign language recognition, the voice control telephone directory enquiry system for having Computer Science and Technology Department of Tsing-Hua University to develop Unite and put into and be actually used.

In addition, University of Science and Technology's news fly the intelligent sound technology provider as Largest In China, issue global first in 2010 Mobile Internet intelligent sound interaction platform " news fly speech cloud ", the declaration mobile Internet voice dictation epoch arrive.

Fly have long-term research accumulation in intelligent sound technical field University of Science and Technology news, and know in Chinese speech synthesis, voice Not, speech evaluating etc. is multinomial technically possesses achievement leading in the world：Phonetic synthesis and speech recognition technology are to realize man-machine language Sound communicates, and setting up one has two key technologies necessary to the voice system for listening and saying ability；Automatic speech recognition technology (Auto Speech Recognize, ASR) problem to be solved is the voice for allowing computer " can understand " mankind, by language The Word message " extraction " included in sound is out；Speech evaluating technology is a study frontier of intelligent sound process field, and Claim computer-assisted language learning (Computer Assisted Language Learning) technology, be that one kind passes through machine Automatically pronunciation carried out scoring, error detection provide the technology of remedial teaching；Sound groove recognition technology in e, also known as speaker Recognition Technology (Speaker Recognition), is one and extracts the correlated characteristic for representing speaker's identity (such as reflection sound by voice signal Spectrum signature of fundamental frequency feature, reflection oral cavity size shape and sound channel length of door folding frequency etc.), and then identify speaker Technology in terms of the work such as identity；Natural language be for thousands of years people life, work, study in requisite element, and Computer is one of 20th century greatest invention, how the natural language that the mankind grasp is carried out processing using computer, very To understanding, computer is made to possess the listening, speaking, reading and writing ability of the mankind, always research institution pays special attention to and actively develops both at home and abroad Research work.

The content of the invention

The technical problem to be solved is, there is provided a kind of speech assessment method and system, can be fast accurate Carry out scoring of going over examination papers, with objective standards of grading give examinee scoring.The present invention has merged existing voice quality objective evaluation mould The advantage of type, obtains the more preferable speech recognition modeling of performance and voice training model and the spoken scoring of more accurate voice Scheme；And can realize that the voice paper to depositing with document form carries out objective scoring by multiple assessment indicator system. The present invention has the advantages that more stable, in hgher efficiency, is that the practical of achievement in research lays the foundation, is advantageously implemented on a large scale The target that Oral English Practice test is automatically goed over examination papers.

To solve above-mentioned technical problem, the invention provides a kind of speech assessment method, including step：

S1, the examination paper voice for recording examinee；

S2, the examination paper voice to the examinee carry out pretreatment, obtain examination paper voice language material；

S3, the characteristic parameter for extracting the examination paper voice language material；

S4, using the audio recognition method based on HMM and ANN mixed models by the characteristic parameter of the examination paper voice language material Characteristic matching is carried out with received pronunciation template, the content of the examination paper voice is identified, and is given raw score；

If S5, raw score are less than preset threshold value, the raw score is the final scoring of the examination paper voice, and The labelling examination paper voice is rolled up for problem；If raw score be higher than preset threshold value, the examination paper voice is carried out accuracy, Point index scoring of fluency, word speed, rhythm, stress and intonation；

S6, the final scoring for obtaining the examination paper voice is weighted to the scoring of described point of index.

Further, also include step S0 before step S1, step S0 specifically includes step：

S01, the received pronunciation for recording expert；

S02, pretreatment is carried out to the received pronunciation, obtain received pronunciation language material；

S03, the characteristic parameter for extracting the received pronunciation language material；

S04, the characteristic parameter to the received pronunciation language material carry out model training, obtain the received pronunciation template.

Further, the audio recognition method in step S4 based on HMM and ANN mixed models is concretely comprised the following steps：

S41, set up the examination paper voice language material characteristic parameter HMM model, obtain all state cumulatives in HMM model Probability；

S42, all state cumulative probability are processed as the input feature vector of ANN classification device, so as to export knowledge Other result；

S43, the recognition result is carried out into characteristic matching with the received pronunciation template, so as to identify the examination paper language The content of sound.

Further, the pretreatment in step S2 specifically include preemphasis, framing, adding window, noise reduction, end-point detection and Cutting word, wherein, the blank voice segments concretely comprised the following steps using voice of the noise reduction are entered to subsequent voice as the base value of noise Row denoising.

Further, the cutting word specifically includes step：

S21, the MFCC parameters for extracting each phoneme in voice, and set up the HMM model of correspondence phoneme；

S22, thick cutting is carried out to voice, obtain effective voice segments；

S23, the word that institute's speech segment is identified according to the HMM model of the phoneme, so as to by speech recognition be word Set.

Further, the extracting parameter feature in step S3 specially extracts MFCC characteristic parameters, concretely comprises the following steps The language material obtained after pretreatment is carried out fast Fourier transform, triangle window filtering, asks logarithm, discrete cosine transform to obtain MFCC Characteristic parameter.

Further, the accuracy scoring in step S5 is concretely comprised the following steps：

Using pulling and pushing the method for value by speech sentences to be scored regular to the degree close with received pronunciation sentence；Using short Shi Nengliang extracts the intensity curve of the speech sentences to be scored and received pronunciation sentence as feature；Wait to score by comparing Speech sentences are scored with the fitting degree of the intensity curve of received pronunciation sentence.

Further, the fluency scoring in step S5 is concretely comprised the following steps：

Voice to be scored is cut into into before and after's two parts, and to first part and latter part cutting word so as to obtaining efficient voice Section；The length of two-part efficient voice section in front and back made with the length of always voice to be scored division operation respectively, and will be obtained Value is compared with corresponding threshold value, if both greater than corresponding threshold value, it is fluent to be judged to；Otherwise, it is determined that being unfluent.

Word speed scoring is concretely comprised the following steps：Calculate the ratio that pronunciation part in voice to be scored accounts for entirely voice duration to be scored Example, carries out word speed scoring according to the ratio.

Rhythm scoring is concretely comprised the following steps：The rhythm of voice to be scored is calculated using improved dPVI parameter calculation formulas.

Stress scoring is concretely comprised the following steps：On the basis of intensity curve after regular, by arranging stress threshold value and non-stress Threshold value divides stress unit as the double threshold and stressed vowel duration of feature, and using DTW algorithms to the language to be scored Sound sentence and received pronunciation sentence carry out pattern match, realize commenting for stress.

Intonation scoring is concretely comprised the following steps：The formant of voice to be scored and received pronunciation is extracted, and waits to score according to described The variation tendency of speech resonant peak is scored to intonation with the fitting degree of the variation tendency of received pronunciation formant.

Present invention also offers a kind of speech assessment system, including：

Voice recording module, for recording the examination paper voice of examinee；

Pretreatment module, for carrying out pretreatment to the examination paper voice of the examinee, obtains examination paper voice language material；

Parameter attribute extraction module, for extracting the characteristic parameter of the examination paper voice language material；

Sound identification module, for adopting the audio recognition method based on HMM and ANN mixed models to the examination paper voice The characteristic parameter and received pronunciation template of language material carries out characteristic matching, recognizes the content of the voice that sets a paper, and gives raw score；

Speech assessment module, carries out accuracy scoring, stream for the examination paper voice for raw score higher than given threshold Sharp degree scoring, word speed scoring, rhythm scoring, stress scoring and intonation scoring.

Comprehensive grading module, the score calculation for overall accuracy, fluency, word speed, rhythm, stress and intonation are obtained Final scoring of the raw score higher than the examination paper voice of given threshold.

Implement the present invention, have the advantages that：

1st, the present invention adds the noise reduction and cutting word method of practicality in pretreatment module, obtains the voice language of better quality Material；

2nd, using the audio recognition method based on HMM and ANN mixed models, more preferably, it is more accurate to recognize for performance；

3rd, it is by the multi-target analysis to word speed, rhythm, stress and intonation, more polynary than original Score index for reading aloud topic Change, as a result more objectivity；

4th, by the double analysis to accuracy and fluency, on the basis of original can only realization to reading aloud topic scoring, Realize to non-objective scorings for reading aloud topic such as translation topic, question-and-answer problem and repetition topics, establish a rationally perfect voice and comment Divide method and system, fast can carry out scoring of going over examination papers exactly, score to examinee with objective standards of grading；

5th, the present invention has the advantages that more stable, in hgher efficiency and practical, applied range, can apply to Process is corrected in SET, is significantly effectively shortened and is corrected the time, improves the high efficiency that system is processed, also improve and correct Objectivity.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of speech assessment method provided in an embodiment of the present invention；

Fig. 2 is the schematic flow sheet of the concrete steps of step S0；

Fig. 3 is the schematic flow sheet of the concrete steps of pretreatment in Fig. 1；

Fig. 4 is the schematic flow sheet of the concrete steps of cutting word in Fig. 3；

Fig. 5 is the schematic flow sheet of the concrete steps of MFCC characteristic parameter extractions；

Fig. 6 is the schematic flow sheet of the concrete steps of the audio recognition method based on HMM and ANN mixed models；

Fig. 7 is the structural representation of speech assessment system provided in an embodiment of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

A kind of speech assessment method is embodiments provided, as shown in figure 1, including step：

S1, the examination paper voice for recording examinee；

S4, employing are based on hidden Markov model (Hidden Markov Models, HMM) and artificial neural network The audio recognition method of (Artificial Neural Networks, ANN) mixed model is by the spy of the examination paper voice language material Levying parameter and received pronunciation template carries out characteristic matching, identifies the content of the examination paper voice, and gives raw score；

S6, the scoring to described point of index are weighted the final scoring for obtaining the examination paper voice.

Further, also include step S0 before step S1, as shown in Fig. 2 step S0 specifically includes step：

S01, the received pronunciation for recording expert；

Wherein received pronunciation is all recorded under particular circumstances by most professional persons, voice content and Oral English Practice Examination content is corresponding；

Wherein, the model training of received pronunciation refers to that according to certain criterion sign is obtained from a large amount of known modes should The model parameter of pattern substitutive characteristics, i.e. received pronunciation template.The process of the model training is specifically referred in order that voice is known Other system reaches certain optimum state, by initial construction data constantly iteration adjustment system template parameter (include shape The variance of the probability and gauss hybrid models of state transfer matrix, average, weight etc.), make the performance of system constantly to it is this most The process that good state is approached.As the received pronunciation of professional person and the voice of examinee have difference to a certain extent, and The scoring of the present invention, is expanded to so the present invention will make great efforts to extend corpus by specific professional person to liking natural person Ordinary people, specific environment expand to conventional environment, and comprising different sexes, the age, the speaker of accent sound.

Next each step will be specifically introduced.

1st, pretreatment

As shown in figure 3, the pretreatment in step S2 specifically includes noise reduction, preemphasis, framing, adding window, end-point detection And cutting word, the purpose of pretreatment is eliminated because of people's phonatory organ itself and as the equipment of voice signal is to quality of speech signal The impact of generation, provides the parameter of high-quality for speech feature extraction, so as to improve the quality of speech processes.

Wherein, the blank voice segments concretely comprised the following steps using voice of the noise reduction as the base value of noise to subsequent voice Denoising is carried out, because being found according to research, when examinee is before recording is recorded, generally in a bit of time for starting is Without sounding, and this bit of recording is not blank, but the recording section with noise.Therefore, by extracting this The audio frequency of recording section can be carried out the process of a place to go noise to recording afterwards as the base value of noise, while Eliminate the noise jamming of unvoiced segments.

Wherein, the cutting word refer to a word be cut into word one by one or phrase so that computer can pass through The presentation content of word one by one or phrase and " understanding " examinee is recognized, is that rear stepped reckoner carries out corresponding bonus point or deduction of points The analysis of factor and last automatic scoring are prepared.As shown in figure 4, the cutting word specifically includes step：

S21, Mel frequency cepstral coefficients (the Mel Frequency Cepstrum for extracting each phoneme in voice Coefficient, MFCC) parameter, and set up the HMM model of correspondence phoneme；

S22, thick cutting is carried out to voice, obtain effective voice segments；

The purpose of thick cutting has at 2 points：One is to reduce operand, reduces the time of cutting word whereby；Two standards for being to increase cutting word Exactness.With regard to rough segmentation, what is utilized is double threshold method, intercepting where substantially blank is fallen, but the threshold value for using is relatively low, mesh Be to obtain effective voice segments；

The cutting word method has discrimination, accuracy rate high, the little advantage of error：1) number of recognition template be it is fixed, For HMM model, accuracy rate is very high；And need not go again to arrange the threshold value of output probability, this will largely Improve discrimination.2) after cutting word, that is, the pronunciation of word is obtained, pronunciation can aid in the matching for carrying out key word, so as to subtract The error brought by matching word is lacked.

2nd, extracting parameter feature

Extraction characteristic parameter in step S3 specially extracts MFCC characteristic parameters, as shown in figure 5, concretely comprising the following steps The language material obtained after pretreatment is carried out fast Fourier transform, triangle window filtering, asks logarithm, discrete cosine transform to obtain MFCC Characteristic parameter.Wherein, it is because that takes into account the auditory properties of human ear using MFCC characteristic parameters, frequency spectrum is converted into and is based on The non-linear spectrum of Mel frequencies, then switches on cepstrum domain.And no any hypotheses, with the method for mathematics come mould The auditory properties of anthropomorphic ear, using a string triangle mode filters in the arrangement of low frequency region juxtaposition, capture the frequency spectrum of voice Information；In addition, the anti-noise ability of MFCC characteristic parameters and anti-distortion spectrum ability are strong, the identity of system can be preferably improved Energy.

3rd, voice content identification

The audio recognition method based on HMM and ANN mixed models is employed in step S4, wherein HMM methods have needs Priori statistical knowledge, the categorised decision ability of wanting voice signal are weak, complex structure, need substantial amounts of training sample and needs to carry out A large amount of shortcomings for calculating；Although ANN has certain advantage in decision-making capability, its description energy to dynamic time signal Power is still unsatisfactory, and the speech recognition algorithm based on neutral net has training, the too long of shortcoming of recognition time.In order to gram Respective shortcoming is taken, it is of the invention by the two methods of the HMM with stronger time modeling ability and ANN with stronger classification capacity Organically combine, further increase the robustness and accuracy rate of speech recognition.This method not only overcomes HMM itself Overlapped problem between insoluble pattern class, improves the identification ability to easy confusable word, while also overcoming ANN It is only capable of processing the limitation of the long input pattern of fixation, eliminates the consolidation computing of complexity.Specifically, as shown in fig. 6, step S4 In concretely comprised the following steps based on the audio recognition method of HMM and ANN mixed models：

It is S42, all state cumulative probability are special as the input of ANN (specially self organizing neural network) grader Levy and processed, so as to export recognition result；

4th, Speech Assessment

Due in daily life, there are some examinees can not carry out spoken test in the time of regulation well, obtain Examination paper voice will appear from a large amount of blank or None- identifieds, these examination paper record labels are problem volume by we.Problem volume bag Blank recording and the sound recording of various None- identifieds, such as excessive recording of the recording of non-English languages, noise etc. are included, and is walked The purpose of rapid S4 is more than identifying the content read by examinee, is also exactly test problems volume, and is given according to actual situation Go out relatively low fraction, for problems volume voice just there is no need accuracy, fluency, word speed, rhythm, stress to be carried out to which Scored with intonation.Only further Speech Assessment is just carried out when initial score is higher than preset threshold value.

(1) the accuracy scoring in step S5 is concretely comprised the following steps：Will voice language be scored using the method for pulling and pushing value Sentence is regular to the degree close with received pronunciation sentence；Using short-time energy as feature extracting the speech sentences to be scored With the intensity curve of received pronunciation sentence；By the fitting for comparing speech sentences to be scored and the intensity curve of received pronunciation sentence Degree is scored.

The intensity of sentence can reflect voice signal change over time.The loud spy of stressed syllable in sentence Levy will reflection to show as speech energy intensity to the energy intensity in time domain, i.e. syllable big.But during due to different people difference Between, intensity of phonation unequal to the pronunciation duration of same a word it is also different, if will speech sentences be scored and received pronunciation language The intensity curve of sentence directly carries out template matching, the objectivity that as a result will affect to evaluate.Therefore the present invention is in the base of original technology A kind of intensity curve extracting method based on received pronunciation sentence is changed out on plinth：When speech sentences time length ratio standard to be scored is used When speech sentences are short, the supplement of duration is carried out using interpolation method to which；When speech sentences time length ratio standard speech to be scored When sound sentence is long, the adjustment of duration is carried out to which using value method is taken out；Finally, using the intensity curve of received pronunciation sentence Point of maximum intensity, treating the intensity curve of scoring speech sentences, to carry out intensity regular.

(2) fluency scoring is concretely comprised the following steps：Voice to be scored is cut into into before and after's two parts, and to first part and later half Part cutting word is so as to obtaining efficient voice section；By the length of two-part efficient voice section in front and back respectively with always voice to be scored Length makees division operation, and the value for obtaining is compared with corresponding threshold value, if both greater than corresponding threshold value, it is fluent to be judged to； Otherwise, it is determined that being unfluent；

For the fluency of Sentence-level, it is intended to by the clear and coherent degree for calculating sentence expression, and utilize received pronunciation meter The rhythm score of pronunciation is calculated, both fusions obtain the fluency diagnostic cast of sentence.This sentence fluency methods of marking also may be used To be applied to the scoring of chapter fluency.The method considers smoothness of the enunciator during statement sentence, compares traditional method There is higher degree of association.Therefore may apply in speech assessment system.

(3) word speed scoring is concretely comprised the following steps：Calculate pronunciation part in voice to be scored and account for whole voice duration to be scored Ratio, scores to word speed according to the ratio.

(4) rhythm scoring is concretely comprised the following steps：Using improved diversity paired index of variability (the Distinct Pairwise Variability Index, dPVI) parameter calculation formula calculates the rhythm of voice to be scored.DPVI is according to voice The feature of unit duration diversity, received pronunciation sentence is carried out respectively with the syllable unit clip durations with scoring speech sentences Comparing calculation, and the parameter changed out is used for into objective evaluation and feedback guidance foundation.

Wherein d be sentence divide voice unit clip durations (such as：d_kFor k-th voice unit clip durations), m= (received pronunciation statement element number, speech sentences unit number to be scored), Len during min_StdFor received pronunciation sentence duration.Due to By speech sentences duration to be scored regular to suitable with received pronunciation sentence duration before carrying out PVI computings, can during calculating Len is used only_StdAs computing unit.

(5) stress scoring is concretely comprised the following steps：On the basis of intensity curve after regular, by arranging stress threshold value and non-heavy Sound threshold value divides stress unit as the double threshold and stressed vowel duration of feature, and adopts dynamic time warping (Dynamic Time Warping, DTW) algorithm carries out pattern match to the speech sentences to be scored and received pronunciation sentence, realizes stress Scoring.

Stress refers to the sound read again in word, phrase, sentence.The ultimate principle of DTW algorithms is dynamic time warping, test Between template and reference template, original unmatched time span is matched.Its similarity is calculated with traditional Euclidean distance, If reference template and test template are R and T, higher apart from the more little then similarities of D [T, R].The shortcoming of traditional DTW algorithms is to enter During row template matching, the weight of all frames is consistent, it is necessary to match all of template, amount of calculation than larger, particularly when template number When increasing very fast, operand increases especially fast.So the present invention using the DTW algorithms that improve carry out speech sentences score with The pattern match of received pronunciation sentence, the perfect shortcoming of traditional DTW algorithms, the weight of each frame are given priority to, are substantially reduced Amount of calculation so that result is more accurate.

(6) intonation scoring is concretely comprised the following steps：The formant of voice to be scored and received pronunciation is extracted, and according to described to be evaluated The variation tendency of speech resonant peak is divided to score to intonation with the fitting degree of the variation tendency of received pronunciation formant.

Intonation is an important sign of representation language ability to express in people's English communication, is speech people language fortune With the reflection of state holophrase gesture, it is the order of importance and emergency and the intonation of modulation in tone of voice in sense of hearing.

In the research of digital speech signal processing, the formant of voice signal is a highly important performance parameter. Formant described herein refers to some regions of the energy Relatively centralized in the frequency spectrum of sound, formant not still tonequality certainly Determine factor, and reflect the physical features of sound channel (resonant cavity).Sound is made by the filtering of cavity when through resonant cavity With so that in frequency domain, the energy of different frequency is redistributed, and a part is strengthened because of the resonant interaction of resonant cavity, another portion Divide then attenuated, those frequencies for being strengthened show as dense blackstreak on the sonagram of time frequency analysis.Due to energy Amount skewness, strong part is just as mountain peak, so referred to as formant.Formant is reflection vocal tract resonance characteristic Key character, it represents the most direct sources of pronunciation information, and people make use of formant information in speech perception, so Formant is very important characteristic parameter in Speech processing.Formant is that pulse paracycle excitation is produced into during sound channel One group of resonant frequency.Formant parameter includes formant frequency and bandwidth, and it is the important parameter of the different simple or compound vowel of a Chinese syllable of difference. And among formant information is included in frequency envelope, therefore the key that formant parameter is extracted is to estimate natural-sounding frequency spectrum bag Network, it is considered that the maximum in spectrum envelope is exactly formant.

Present invention also offers a kind of speech assessment system, as shown in fig. 7, comprises：

Voice recording module 101, for recording the examination paper voice of examinee；

Pretreatment module 102, for carrying out pretreatment to the examination paper voice of the examinee, obtains examination paper voice language material；

Parameter attribute extraction module 103, for extracting the characteristic parameter of the examination paper voice language material；

Sound identification module 104, for adopting the audio recognition method based on HMM and ANN mixed models to the examination paper The characteristic parameter and received pronunciation template of voice language material carries out characteristic matching, recognizes the content of the voice that sets a paper, and gives preliminary Scoring；

Speech assessment module 105, for for raw score higher than given threshold examination paper voice carry out accuracy scoring, Fluency scoring, word speed scoring, rhythm scoring, stress scoring and intonation scoring.

Comprehensive grading module 106, for the score calculation of overall accuracy, fluency, word speed, rhythm, stress and intonation Obtain final scoring of the raw score higher than the examination paper voice of given threshold.

Wherein, described speech assessment system and speech assessment method are mutually corresponded to, therefore the concrete process step of each module Suddenly the step of referring to speech assessment method, is not repeating again.

Implement the present invention, have the advantages that：

(1) present invention adds the noise reduction and cutting word method of practicality in pretreatment module, obtains the voice of better quality Language material；

(2) using the audio recognition method based on HMM and ANN mixed models, more preferably, it is more accurate to recognize for performance；

(3) it is by the multi-target analysis to word speed, rhythm, stress and intonation, more more than original Score index for reading aloud topic Unitization, as a result more objectivity；

(4) by the double analysis to accuracy and fluency, on the basis of original can only realization to reading aloud topic scoring, Realize to non-objective scorings for reading aloud topic such as translation topic, question-and-answer problem and repetition topics, establish a rationally perfect voice and comment Divide method and system, fast can carry out scoring of going over examination papers exactly, score to examinee with objective standards of grading；

(5) present invention has the advantages that more stable, in hgher efficiency and practical, applied range, can apply to Process is corrected in SET, is significantly effectively shortened and is corrected the time, improves the high efficiency that system is processed, also improve and correct Objectivity.

Above disclosed is only a kind of preferred embodiment of the invention, and the power of the present invention can not be limited certainly with this Sharp scope, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope covered by the present invention.

Claims

1. a kind of speech assessment method, it is characterised in that including step：

S1, the examination paper voice for recording examinee；

S4, using the audio recognition method based on HMM and ANN mixed models by the characteristic parameter and mark of the examination paper voice language material Quasi- sound template carries out characteristic matching, identifies the content of the examination paper voice, and gives raw score；

If S5, raw score are less than preset threshold value, the raw score is the final scoring of the examination paper voice, and labelling The examination paper voice is rolled up for problem；If raw score is higher than preset threshold value, accuracy, fluent is carried out to the examination paper voice Point index scoring of degree, word speed, rhythm, stress and intonation；

S6, the final scoring for obtaining the examination paper voice is weighted to the scoring of described point of index；

Wherein, the accuracy scoring in step S5 is concretely comprised the following steps：

Using pulling and pushing the method for value by speech sentences to be scored regular to the degree close with received pronunciation sentence；Using in short-term Amount extracts the intensity curve of the speech sentences to be scored and received pronunciation sentence as feature；By comparing voice to be scored Sentence is scored with the fitting degree of the intensity curve of received pronunciation sentence；

Fluency scoring in step S5 is concretely comprised the following steps：

Voice to be scored is cut into into before and after's two parts, and to first part and latter part cutting word so as to obtaining efficient voice section； The length of two-part efficient voice section in front and back made division operation respectively with the length of always voice to be scored, and by the value for obtaining with Corresponding threshold value compares, if being more than corresponding threshold value, it is fluent to be judged to；Otherwise, it is determined that being unfluent；

The examination paper voice is test taker answers translation topic, question-and-answer problem or the voice for repeating questions record；

Pretreatment in step S2 specifically includes noise reduction, preemphasis, framing, adding window, end-point detection and cutting word, wherein, institute Stating concretely comprising the following steps for noise reduction carries out denoising to subsequent voice as the base value of noise using the blank voice segments of voice；

The cutting word specifically includes step：

S22, thick cutting is carried out to voice, obtain effective voice segments；

S23, the word that institute's speech segment is identified according to the HMM model of the phoneme, so as to by speech recognition be set of letters.

2. speech assessment method as claimed in claim 1, it is characterised in that also include step S0, institute before step S1 State step S0 and specifically include step：

S01, the received pronunciation for recording expert；

3. speech assessment method as claimed in claim 1, it is characterised in that HMM and ANN hybrid guided modes are based in step S4 The audio recognition method of type is concretely comprised the following steps：

S41, set up the examination paper voice language material characteristic parameter HMM model, in obtaining HMM model, all state cumulatives are general Rate；

S42, all state cumulative probability are processed as the input feature vector of ANN classification device, so as to export identification knot Really；

S43, the recognition result is carried out into characteristic matching with the received pronunciation template, so as to identify the examination paper voice Content.

4. speech assessment method as claimed in claim 1, it is characterised in that the extraction characteristic parameter in step S3 is concrete To extract MFCC characteristic parameters, concretely comprising the following steps carries out fast Fourier transform, quarter window filter by the language material obtained after pretreatment Ripple, logarithm, discrete cosine transform is asked to obtain MFCC characteristic parameters.

5. speech assessment method as claimed in claim 1, it is characterised in that the word speed scoring concrete steps in step S5 For：The ratio that pronunciation part in voice to be scored accounts for entirely voice duration to be scored is calculated, word speed is carried out according to the ratio Scoring；

Rhythm scoring is concretely comprised the following steps：The rhythm of voice to be scored is calculated using improved dPVI parameter calculation formulas；

Stress scoring is concretely comprised the following steps：On the basis of intensity curve after regular, by arranging stress threshold value and non-stress threshold value Double threshold and stressed vowel duration as feature divides stress unit, and using DTW algorithms to the voice language to be scored Sentence and received pronunciation sentence carry out pattern match, realize the scoring of stress；

Intonation scoring is concretely comprised the following steps：The formant of voice to be scored and received pronunciation is extracted, and according to the voice to be scored The variation tendency of formant is scored to intonation with the fitting degree of the variation tendency of received pronunciation formant.

6. a kind of speech assessment system, it is characterised in that include：

Characteristic parameter extraction module, for extracting the characteristic parameter of the examination paper voice language material；

Sound identification module, for adopting the audio recognition method based on HMM and ANN mixed models to the examination paper voice language material Characteristic parameter and received pronunciation template carry out characteristic matching, identification sets a paper the content of voice, and give raw score and Mark whether to roll up for problem；

Speech assessment module, for commenting for raw score carries out accuracy higher than the non-problems examination paper voice of preset threshold value Divide, fluency scores, word speed scores, rhythm scores, stress scoring and intonation are scored；

Comprehensive grading module, the score calculation for overall accuracy, fluency, word speed, rhythm, stress and intonation are obtained tentatively Final scoring of the scoring higher than the examination paper voice of given threshold；

Wherein, the accuracy scoring is concretely comprised the following steps：

The fluency scoring is concretely comprised the following steps：

It is described that noise reduction, preemphasis, framing, adding window, end-point detection and cutting word are specifically included to the pretreatment that examination paper voice is carried out, its In, concretely comprising the following steps for the noise reduction is carried out at denoising to subsequent voice as the base value of noise using the blank voice segments of voice Reason；

The cutting word specifically includes step：

S22, thick cutting is carried out to voice, obtain effective voice segments；