CN102426834B - Method for testing rhythm level of spoken English - Google Patents

Method for testing rhythm level of spoken English Download PDF

Info

Publication number
CN102426834B
CN102426834B CN2011102527792A CN201110252779A CN102426834B CN 102426834 B CN102426834 B CN 102426834B CN 2011102527792 A CN2011102527792 A CN 2011102527792A CN 201110252779 A CN201110252779 A CN 201110252779A CN 102426834 B CN102426834 B CN 102426834B
Authority
CN
China
Prior art keywords
rhythm
fundamental frequency
duration
variance
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011102527792A
Other languages
Chinese (zh)
Other versions
CN102426834A (en
Inventor
李宏言
徐波
王士进
高鹏
李鹏
陈振标
柯登峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2011102527792A priority Critical patent/CN102426834B/en
Publication of CN102426834A publication Critical patent/CN102426834A/en
Application granted granted Critical
Publication of CN102426834B publication Critical patent/CN102426834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for testing the rhythm level of spoken English. The method comprises the following steps of: A, preprocessing an original English speech signal; B, extracting multi-knowledge-source characteristic parameters used for a rhythm test from the preprocessed original English speech signal, wherein the multi-knowledge-source characteristic parameters comprise rhythm performance characteristics, rhythm generation characteristics and rhythm influence characteristics; and C, acquiring a rhythm level test score of original English sounds by using most of the multi-knowledge-source characteristic parameters. By the method for testing the rhythm level of spoken English, a better test result is obtained by using a strategy in which multi-knowledge information is used for thinning and merging, and the objectivity and the accuracy of the test are improved.

Description

The method of test rhythm level of spoken English
Technical field
The present invention relates to artificial intelligence jargon tone signal and process and mode identification technology, relate in particular to a kind of method of testing rhythm level of spoken English.
Background technology
The rhythm is that the behavior for these mankind of speech puts forward, and the medium of verbal communication is sound, and the information that can hear of the information that will pass on of person and hearer is included in sound wave thus.In computer-assisted language learning, the rhythm does very well and expresses " refined " in boundary " fidelity, fluency, elegance " than the language learner, it is a five-star ring in speech, it acts on speech by physics and the acoustic mode of complexity, to characterize speaker's the Supersonic section characteristics such as the tone, attitude, intention and emotion.Can say, the learner just can read to represent language is grasped in the expression process, and the performance degree that really digesting of described content is depended on its rhythm to a great extent.
Rhythm horizontal checkout is an important component part in area of computer aided automatic speech test macro, the rhythm of the present invention test and in the past for prosody generation and the test in phonetic synthesis field, essential distinction was arranged.In the past for prosody generation and the test in phonetic synthesis field, it is paid close attention to be how effectively to improve synthesize rhythmicity and the naturalness of voice.Rhythm test of the present invention is the rhythm grasp level of paying close attention to the pronunciation of test crowd's true spoken.The present invention more pays close attention to the higher test crowd of spoken language proficiency, namely in, pronouncing accuracy is higher and fluent degree is higher oral test crowd complete at the content expression, by the test to rhythm level, reaches further the purpose of " getting excellent in good ".
The basic acoustics correlative of rhythm perception is fundamental frequency, duration and energy, and from the angle of people's perception, usually goes to weigh the rhythm level of certain sentence or paragraph from intonation and rhythm two aspects.Intonation mainly reflects " the pressing down " and " raising " in subjective sense of hearing, in the acoustic feature aspect, is by the fundamental frequency in voice over time, goes to reflect the variation of intonation, the tone and emotion.Rhythm comprises the aspects such as stress, pause, flow control, stress mainly reflects in subjective sense of hearing weight to expression content, biased and increase the weight of, English is a kind of typical stress rhythm language, relies on just this stress to change and just brings strong timing acoustically.How sense straggly in the subjective sense of hearing rhythm of the main reflection of pause is produced by the gap of sense-group, semanteme or content conversion.Flow is controlled in the subjective sense of hearing rhythm of main reflection the general control situation that the macroscopic view to each segment duration in whole flow distributes.
From man-machine communication's angle, verbal Communication is exactly in fact one and is coding, is the process of decoding for the hearer for the person of saying, and rhythm test utilizes machine speech to be carried out an important ring of robotization decode procedure just.And from the voice Psychological Angle, can think in perception that for the people rhythm of one section voice is better than another section why, do not form so far general understanding and common recognition.At present, more towards the research of the prosodic analysis of phonetic synthesis task, and relatively less for the research of the rhythm test of educational measurement target, but its application demand presents and increases progressively trend.
Existing rhythm method of testing generally adopts the mode of the simple fundamental frequency of direct use, duration and energy feature, feature itself is not carried out deep processing, introduce pointedly the multiple knowledge sources model that is associated with rhythm performance, rhythm generation and rhythm impact yet.And a large amount of research practices of educational measurement and area of pattern recognition show, broadly use merely prosodic features to be difficult to further improve test performance.
Summary of the invention
The technical matters that (one) will solve
For solving above-mentioned one or more problems, the invention provides a kind of method of testing rhythm level of spoken English, obtain better test effect with the strategy that uses multiple knowledge sources information to carry out refinement and fusion, improve objectivity and the accuracy of test.
(2) technical scheme
According to an aspect of the present invention, provide a kind of method of testing Oral English Practice rhythm level.The method comprises: steps A, the original English voice signal is carried out pre-service; Step B extracts the multiple knowledge sources characteristic parameter that is used for rhythm test in carrying out pretreated original English voice signal, the multiple knowledge sources characteristic parameter comprises: rhythm performance characteristic; The rhythm produces feature; Rhythm effect characteristics; Step C is obtained the rhythm horizontal checkout mark of original English voice by the multiple knowledge sources characteristic parameter.
Preferably, in the method for Oral English Practice rhythm level of the present invention, steps A comprises: steps A 1, and the original English voice signal to be carried out the efficient voice section detect, filter out noise section and long pause section keep effective voice segments signal; Steps A 2 divides frame to process to the efficient voice segment signal; Steps A 3 uses speech recognition device to dividing efficient voice segment signal and corresponding text after frame is processed to carry out automatic aligning, obtains the frontier point information of phoneme, syllable, word and sentence.
Preferably, in the method for Oral English Practice rhythm level of the present invention, in steps A 2, take 25ms as frame length, 10ms is frame period.
Preferably, in the method for Oral English Practice rhythm level of the present invention, extracting the rhythm performance characteristic that is used for rhythm test in step B in carrying out pretreated original English voice signal comprises: step B1a, extract fundamental frequency and the energy of each speech frame, form fundamental frequency sequence and energy sequence, calculate fundamental frequency mean value and fundamental frequency variance yields, the average energy of calculating energy sequence and the energy variance yields of fundamental frequency sequence.Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4; Step B1b, extract each consonant section duration, each yuan segment duration, each syllable section duration and each word pause section duration, calculate respectively the average duration of consonant section and consonant section duration variance, the difference average duration of Computing Meta segment and first segment duration variance, calculate respectively the average duration of syllable section and syllable section duration variance, calculate respectively the average duration of word pause section and a word pause section duration variance.The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features; Step B1c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B1b that step B1a is extracted extract are spliced into together, as 12 dimension prosodic features based on rhythm performance knowledge source.
Preferably, in the method for Oral English Practice rhythm level of the present invention, extracting the rhythm generation feature that is used for rhythm test in step B in carrying out pretreated original English voice signal comprises: step B2a, extract the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence carried out robustness process; Step B2b, the fundamental frequency sequence after to process through robustness extracts corresponding stressed parts as object, extracts and comes from the step number of reading parts again, average snap time, snap time variance, forms totally 3 dimension prosodic features; Step B2c, the fundamental frequency sequence after processing take the process robustness extracts corresponding benchmark fundamental frequency as object, extracts the benchmark fundamental frequency feature that comes from the benchmark fundamental frequency; Step B2d, the fundamental frequency sequence after processing take the process robustness extracts corresponding phrase parts as object, extracts the impulse number that comes from the phrase parts, average impulse amplitude, impulse amplitude variance, and formation is totally 3 dimension prosodic features; The 1 dimension benchmark fundamental frequency feature that step B2e, 3 dimension prosodic features, the step B2c that step B2b is extracted extract, the 3 dimension prosodic features that step B2d extracts splice, as 7 dimension prosodic features based on rhythm production model.
Preferably, in the method for Oral English Practice rhythm level of the present invention, in step B2a, the fundamental frequency sequence that has extracted is carried out the robustness processing and comprise: the fundamental frequency sequence that has extracted is removed half frequently and the frequency multiplication interference; The fundamental frequency sequence of removing after half frequency and frequency multiplication are disturbed is carried out smooth operation; Carry out stylization and process carrying out fundamental frequency sequence after smooth operation.
Preferably, in the method for Oral English Practice rhythm level of the present invention, step B2b comprises: the fundamental frequency sequence of processing through robustness is carried out high-pass filtering, utilize the gradient method automatic lifting to take out its mean curvature and change violent maximum value and minimal value part; The quantity of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering is as the step number feature of fundamental frequency sequence; Average duration and the variance of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering change, as average snap time and the snap time Variance feature of fundamental frequency sequence; The step number that said extracted is gone out, average snap time, snap time variance are read 3 of parts again and are tieed up prosodic features as coming from.
Preferably, in the method for Oral English Practice rhythm level of the present invention, step B2d comprises: the fundamental frequency sequence after processing with step B2a deducts the benchmark fundamental frequency that step B2c extracts, and forms the fundamental frequency sequence curve of reflection phrase parts; Calculate the quantity of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the impulse number feature of fundamental frequency sequence; Calculate average impulse amplitude and the amplitude variance of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as average impulse amplitude and the impulse amplitude Variance feature of fundamental frequency sequence; The impulse number that said extracted is gone out, average impulse amplitude, impulse amplitude variance 3 are tieed up prosodic features as what come from the phrase parts.
Preferably, in the method for Oral English Practice rhythm level of the present invention, extracting the rhythm effect characteristics that is used for rhythm test in step B in carrying out pretreated original English voice signal comprises: step B3a, by formula one extraction consonant segment base frequency PVI feature, vowel segment base frequency PVI feature, syllable segment base frequency PVI feature, form totally 3 prosodic features of tieing up based on fundamental frequency PVI; Step B3b extracts consonant section duration PVI feature, first segment duration PVI feature, syllable section duration PVI feature by formula one, forms totally 3 prosodic features of tieing up based on duration PVI, and wherein, the expression formula of formula one is:
Figure BDA0000087430050000041
The continuous speech section is divided into x kAnd x k+1Two parts, and represent respectively fundamental frequency value or the duration value of k and k+1 voice segments, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Preferably, in the method for Oral English Practice rhythm level of the present invention, also comprise before step C: step C ' 1, collects the speech data sample as the exploitation collection, and the speech data sample carried out the mark of artificial rhythm mark; Step C ' 2, select rhythm mark match device model; Step C ' 3, with the prosodic features of each speech data sample, as the front end input parameter of rhythm mark match device model, with the artificial rhythm mark of each speech data sample, as the rear end Output rusults of rhythm mark match device model; Step C ' 4, utilize the corresponding model training algorithm of rhythm mark match device model, and training rhythm mark match device model obtains the model parameter of rhythm mark match device model.Described step C comprises: the multiple knowledge sources characteristic parameter that described original English voice signal is corresponding is inputted the rhythm mark match device after training, thereby obtains the rhythm assessment of levels test result of described original English voice.
Preferably, in the method for Oral English Practice rhythm level of the present invention, rhythm mark match device model is a kind of with in drag: mixed Gauss model, Support Vector Machine model, multi-Layer Perceptron Neural Network model.
(3) beneficial effect
The method that the present invention tests rhythm level of spoken English has following beneficial effect:
1. in the present invention, the multiple knowledge sources feature that is used for rhythm test affects three aspects from rhythm performance, rhythm generation and the rhythm and obtains.Due to the prosodic information that takes full advantage of multiple knowledge sources, the present invention can effectively improve accuracy and the reliability of rhythm test macro;
2. by the present invention, can accumulate for the speech data with artificial mark mark of the otherness of different sexes, all ages and classes and different geographical and learning sample storehouse, utilize this speech data and learning sample storehouse to train rhythm mark match device model for different sexes, all ages and classes and different geographical, thereby make method of testing of the present invention have good generalization.
Description of drawings
Fig. 1 is the overall procedure block diagram of embodiment of the present invention test rhythm level of spoken English method;
Fig. 2 extracts the process flow diagram that the rhythm produces feature in embodiment of the present invention test rhythm level of spoken English method;
Fig. 3 is the process flow diagram of training rhythm mark match device in embodiment of the present invention test rhythm level of spoken English method.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 is the overall procedure block diagram of embodiment of the present invention test rhythm level of spoken English method.As shown in Figure 1, the step of test rhythm level of spoken English method is:
Steps A ', obtain the primary speech signal that needs that the user reads carry out rhythm horizontal checkout.
Steps A is carried out pre-service to primary speech signal.
Steps A-1 is carried out the efficient voice section to raw tone and is detected (being called for short VAD detects), filters out noise section and long pause section, keeps voice segments as the use of next step.
Steps A-2 divide frame to process to the efficient voice segment signal, and preferably, take 25ms as frame length, 10ms is frame period, and re-treatment is until the voice signal end.
Steps A-3 use speech recognition device to dividing the efficient voice segment signal after frame is processed to carry out automatic aligning, obtain the signal boundary information of phoneme, syllable, word and sentence.
Need to prove, above-mentioned A-1, A-2, the execution of A-3 has permanent order, and its order cannot be upset or put upside down.
Step B extracts the multiple knowledge sources prosodic features that is used for rhythm test.
Present situation and deficiency in view of rhythm measuring technology, the present invention considers the performance of the rhythm, the generation of the rhythm and all many-sides such as impact of the rhythm as far as possible, extract respectively its characterization parameter effective and robust, then use mark match device to go to simulate people's machining processor system, and each knowledge source model is further merged, to realize the objective examination to rhythm level.The multiple knowledge sources feature of testing for the rhythm in specific words, affects three aspects from rhythm performance, rhythm generation and the rhythm and obtains.Three kinds of basic prosodic features can derivation go out various features, how to weigh which feature to rhythm test effectively, up to the present also there is no unified understanding.
The present invention has adopted the thinking of greedy algorithm at predevelopment phase, extract widely various features, then screens, and test helps maximum Feature Combination to the rhythm in order to be settled out.And in the follow-up embodiment of the present invention, related feature is all through the excellent characteristic after Feature Selection.
And above all prosodic features that extract all carry out normalized take sex as unit, and carry out corresponding normalized at vocabulary, sentence surface.All prosodic features that the present invention relates to all need not artificial mark, can automatically generate by computer program.In addition, above-mentioned characteristic extraction procedure is not distinguished successively, after whole feature extractions are complete, whole features is merged into the prosodic features of final use.Below will the leaching process of feature be elaborated:
Step B-1 extracts rhythm performance characteristic (based on the prosodic features of rhythm performance knowledge source).
For the prosodic features based on rhythm performance knowledge source, comprise the most basic fundamental frequency, duration and energy feature, and these three kinds features that the derivation of basic acoustic feature institute goes out, these prosodic features can reflect the learner in morphology and syntax aspect for linguistic organization, the ability expressing and control, be also the most widely used feature of present researcher.
Step B-1-a extracts fundamental frequency and the energy of each speech frame, forms fundamental frequency sequence and energy sequence, calculates fundamental frequency mean value and fundamental frequency variance yields, the average energy of calculating energy sequence and the energy variance yields of fundamental frequency sequence.Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4.
Step B-1-b, extract each consonant section duration, each yuan segment duration, each syllable section duration and each word pause section duration, calculate respectively the average duration of consonant section and consonant section duration variance, the difference average duration of Computing Meta segment and first segment duration variance, calculate respectively the average duration of syllable section and syllable section duration variance, calculate respectively the average duration of word pause section and a word pause section duration variance.The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features.
Step B-1-c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B-1-b that step B-1-a is extracted extract are spliced into together, as 12 dimension prosodic features based on rhythm performance knowledge source.
Step B-2 with reference to accompanying drawing 2, extracts the rhythm and produces feature (based on the prosodic features of rhythm production model).
For produce the prosodic features of Knowledge Set based on the rhythm, be that the rhythm production model in phonetic synthesis is oppositely considered.Generally, our the pitch contour feature that extracts is a kind of observation phenomenon that changes for the rhythm that people's speech performance produces after by the rhythm model effect.And this observation phenomenon is excavated, obtaining the method for more going deep into rhythm generation knowledge is exactly to recall the mechanism that its rhythm produces.It is generally acknowledged, the intonation in the rhythm and the relation between rhythm like the relation that is layering between " wave " and " ripples ", can represent with simple algebraic sum, and mutual addition when phase place equates is cancelled each other during single spin-echo.The Supersonic section rhythm model that Japanese scholars Fujisaki proposes has carried out good modeling, and on physiology, physics and acoustics, good explanation has been arranged " large wave, the ripplet " relation in fundamental curve.
The Fujisaki model thinks that seeming irregular fundamental curve can be made of three different parameters of operating part, and can find the physical characteristics of corresponding phonatory organ to make an explanation.These three kinds of rhythm parts are respectively the phrase parts, read parts and reference frequency parts again, respectively corresponding the description to intonation, rhythm and basic pitch.Target of the present invention extracts this three kinds of corresponding characteristic parameters of parts exactly, produces the knowledge of angle to obtain the rhythm.
Step B-2-a extracts the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence is carried out robustness process.Robustness is processed and is comprised three steps, is at first to remove half frequently and the frequency multiplication interference, and be then that the fundamental frequency sequence is carried out smooth operation, be to carry out stylization to process at last.
Step B-2-b extracts and reads parameters of operating part again.The fundamental frequency sequence of processing through step B-2-a is carried out high-pass filtering, utilize the gradient method automatic lifting to take out its mean curvature and change violent maximum value and minimal value part.Calculate the quantity of acute variation part, as step number feature.Average duration and the variance of calculating the acute variation part change, as average snap time and snap time Variance feature.Step number, average snap time, snap time variance are read again 3 of parts and tieed up prosodic features as coming from.
Step B-2-c extracts the benchmark fundamental frequency.For the fundamental frequency sequence of processing through step 3-2-a, remove the HFS that step B-2-b extracts, to form the fundamental frequency sequence of low pass.Find out the minimum point of this low pass fundamental frequency sequence, with this as benchmark fundamental frequency feature.
Step B-2-d extracts the phrase parameters of operating part.Fundamental frequency sequence after step B2a is processed deducts the benchmark fundamental frequency that step B-2-c extracts, and forms the fundamental frequency sequence curve of reflection phrase parts.Calculate the quantity of acute variation part, as impulse number feature.Calculate average impulse amplitude and the amplitude variance of acute variation part, as average impulse amplitude and impulse amplitude Variance feature.Impulse number, average impulse amplitude, impulse amplitude variance 3 are tieed up prosodic features as what come from the phrase parts.
The 1 dimension benchmark fundamental frequency feature that step B-2-e, 3 dimension prosodic features, the step B-2-c that step B-2-b is extracted extract, the 3 dimension prosodic features that step B-2-d extracts splice, as 7 dimension prosodic features based on rhythm production model.
Step B-3 extracts rhythm effect characteristics.
For affect the prosodic features of knowledge source based on the rhythm, what it was paid close attention to is that the language learner is for the correlation degree of English and self mother tongue, the i.e. pure degree of Oral English Practice.In general, the pure degree of English is grasped people preferably, usually can occur rising and falling and in picturesque disorder rhythm variation in spoken language.The variation characteristic of fundamental frequency and duration feature, especially fundamental frequency and duration plays key effect in the pure kilsyth basalt of English is existing.PVI (Pairwise Variability Index) operator has been obtained remarkable performance in the work of distinguishing the different language kind, the present invention expands to different segment levels with it in rhythm test, PVI is calculated at continuous consonant, vowel and three segment levels of syllable respectively, to obtain the prosodic features that affects knowledge source based on the rhythm.
Step B-3-a extracts respectively the fundamental frequency PVI feature of consonant section, first segment and syllable section, and computing method are as follows:
PVI = 100 × Σ k = 1 m - 1 | p k - p k + 1 ( p k + p k + 1 ) / 2 | / ( m - 1 ) ,
In following formula, the continuous speech section is divided into p kAnd p k+1Two parts, and represent respectively the fundamental frequency mean value of k and k+1 voice segments, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Step B-3-b extracts respectively the duration PVI feature of consonant section, first segment and syllable section, and computing method are as follows:
PVI = 100 × Σ k = 1 m - 1 | d k - d k + 1 ( d k + d k + 1 ) / 2 | / ( m - 1 )
In following formula, the continuous speech section is divided into d kAnd d k+1Two parts, and represent respectively the duration value of k and k+1 voice segments, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Step B-3-c, the consonant segment base that step B-3-a is extracted is PVI feature, vowel segment base PVI feature, syllable segment base PVI feature frequently frequently frequently, and step B-3-b the consonant section duration PVI feature, first segment duration PVI feature, the syllable section duration PVI feature that extract, merging rear conduct affects 6 dimension prosodic features of knowledge source based on the rhythm.
Step B4, prosodic features merges.The 6 dimension prosodic features that the 7 dimension prosodic features that 12 dimension prosodic features, the step B-2 that step B-1 is extracted extract and step B-3 extract are merged into 25 final dimension prosodic features.
Step C ', training mark match device.
For the mapping of feature to mark, the present invention trains to obtain the match device with exploitation collection data.Exploitation collection data have the rhythm rating fraction that the expert marks out, exploitation is collected each Partial Feature of each speech samples in data as the input of match device, artificial mark mark is as the output of match device, obtain match device parameter by match device training algorithm, to complete the training process of mark match device.
With reference to accompanying drawing 3, the concrete steps of training rhythm mark match device are:
Speech data is collected as the exploitation collection in step C '-1, and speech samples is carried out the mark of artificial rhythm mark.
Step C '-2, select suitable rhythm mark match device, the present invention does not limit particular type, can be a kind of in common sorter model, such as mixed Gauss model (GMM), Support Vector Machine (SVM), multi-Layer Perceptron Neural Network (MLP) etc.
Step C '-3 extract the prosodic features of each speech samples by step B, and as the input parameter of mark match device.With the artificial rhythm mark of each speech samples, as the Output rusults of rhythm mark match device.
Step C '-4 on the basis of step C '-3, utilize corresponding model training algorithm, and training rhythm mark match device finally obtains the model parameter of rhythm mark match device.
Step C with English spoken most multiple knowledge sources characteristic parameters input rhythm mark match devices to be evaluated, obtains rhythm horizontal checkout mark, with this as the objective examination to the rhythm level of this tested speech sample.
Need to prove, the step of above rhythm test is suitable for sentence and paragraph level, and concrete levels of testing is decided according to actual conditions.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. a method of testing Oral English Practice rhythm level, is characterized in that, comprising:
Steps A is carried out pre-service to the original English voice signal;
Step B extracts the multiple knowledge sources characteristic parameter that is used for rhythm test in pretreated original English voice signal, this multiple knowledge sources characteristic parameter comprises that rhythm performance characteristic, the rhythm produce feature and rhythm effect characteristics;
Step C is obtained the rhythm horizontal checkout mark of described original English voice by described multiple knowledge sources characteristic parameter;
Wherein, extracting the rhythm performance characteristic that is used for rhythm test in described step B in pretreated original English voice signal comprises:
Step B1a extracts fundamental frequency and the energy of each speech frame, forms fundamental frequency sequence and energy sequence, calculates fundamental frequency mean value and the fundamental frequency variance yields of fundamental frequency sequence, the average energy of calculating energy sequence and energy variance yields; Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4;
Step B1b, extract each consonant section duration, each yuan segment duration, each syllable section duration and each word pause section duration, calculate respectively the average duration of consonant section and consonant section duration variance, the difference average duration of Computing Meta segment and first segment duration variance, calculate respectively the average duration of syllable section and syllable section duration variance, calculate respectively the average duration of word pause section and a word pause section duration variance; The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features;
Step B1c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B1b that step B1a is extracted extract are spliced into together, as 12 dimension rhythm performance characteristics based on rhythm performance knowledge source.
2. the method for test Oral English Practice rhythm level according to claim 1, is characterized in that, described steps A comprises:
Steps A 1 is carried out the efficient voice section to the original English voice signal and is detected, and filter out noise section and long pause section keep effective voice segments signal;
Steps A 2 divides frame to process to the efficient voice segment signal;
Steps A 3 uses speech recognition device to dividing efficient voice segment signal and corresponding text after frame is processed to carry out automatic aligning, obtains the frontier point information of phoneme, syllable, word and sentence.
3. the method for test Oral English Practice rhythm level according to claim 2, is characterized in that, in described steps A 2, take 25ms as frame length, 10ms is frame period.
4. the method for test Oral English Practice rhythm level according to claim 1, is characterized in that, extracts the rhythm generation feature that is used for rhythm test in described step B and comprise in carrying out described pretreated original English voice signal:
Step B2a extracts the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence is carried out robustness process;
Step B2b, the fundamental frequency sequence after to process through robustness extracts corresponding stressed parts as object, extracts and comes from the step number of reading parts again, average snap time, snap time variance, forms totally 3 dimension prosodic features;
Step B2c, the fundamental frequency sequence after processing take the process robustness extracts corresponding benchmark fundamental frequency as object, extracts the benchmark fundamental frequency feature that comes from the benchmark fundamental frequency;
Step B2d, the fundamental frequency sequence after processing take the process robustness extracts corresponding phrase parts as object, extracts the impulse number that comes from the phrase parts, average impulse amplitude, impulse amplitude variance, and formation is totally 3 dimension prosodic features;
The 1 dimension benchmark fundamental frequency feature that step B2e, 3 dimension prosodic features, the step B2c that step B2b is extracted extract, the 3 dimension prosodic features that step B2d extracts splice, and produce feature as the 7 dimension rhythms based on rhythm production model.
5. the method for test Oral English Practice rhythm level according to claim 4, is characterized in that, in described step B2a, the fundamental frequency sequence that has extracted carried out the robustness processing and comprise:
The fundamental frequency sequence that has extracted is removed half frequently and the frequency multiplication interference;
The fundamental frequency sequence of removing after half frequency and frequency multiplication are disturbed is carried out smooth operation;
Carry out stylization and process carrying out fundamental frequency sequence after smooth operation.
6. the method for test Oral English Practice rhythm level according to claim 4, is characterized in that, described step B2b comprises:
The fundamental frequency sequence of processing through robustness is carried out high-pass filtering, utilize the gradient method automatic lifting to take out its mean curvature and change violent maximum value and minimal value part;
The quantity of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering is as the step number feature of fundamental frequency sequence;
Average duration and the variance of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering change, as average snap time and the snap time Variance feature of fundamental frequency sequence;
Step number, average snap time, snap time variance that said extracted is gone out produce feature as coming from the 3 dimension rhythms of reading parts again.
7. the method for test Oral English Practice rhythm level according to claim 4, is characterized in that, described step B2d comprises:
Fundamental frequency sequence after processing with step B2a deducts the benchmark fundamental frequency that step B2c extracts, and forms the fundamental frequency sequence curve of reflection phrase parts;
Calculate the quantity of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the impulse number feature of fundamental frequency sequence;
Calculate average impulse amplitude and the amplitude variance of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as average impulse amplitude and the impulse amplitude Variance feature of fundamental frequency sequence;
Impulse number, average impulse amplitude, impulse amplitude variance that said extracted is gone out produce feature as the 3 dimension rhythms that come from the phrase parts.
8. the method for test Oral English Practice rhythm level according to claim 1, is characterized in that, extracts the rhythm effect characteristics that is used for rhythm test in described step B and comprise in carrying out described pretreated original English voice signal:
Step B3a by formula one extraction consonant segment base frequency PVI feature, vowel segment base frequency PVI feature, syllable segment base frequency PVI feature, forms totally 3 prosodic features of tieing up based on fundamental frequency PVI;
Step B3b extracts consonant section duration PVI feature, first segment duration PVI feature, syllable section duration PVI feature by formula one, forms totally 3 rhythm effect characteristicses of tieing up based on duration PVI,
Wherein, the expression formula of formula one is: PVI = 100 × Σ k = 1 m - 1 | x k - x k + 1 ( x k + x k + 1 ) / 2 | / ( m - 1 ) , The continuous speech section is divided into x kAnd x k+1Two parts, and represent respectively fundamental frequency value or the duration value of k and k+1 voice segments, m represents the number of continuous speech section; The voice segments here is consonant section, first segment or syllable section.
9. the method for test Oral English Practice rhythm level according to claim 1, is characterized in that,
Also comprise before described step C: step C ' 1, collects the speech data training sample as the exploitation collection, and described speech data training sample carried out the mark of artificial rhythm mark; Step C ' 2, select rhythm mark match device model; Step C ' 3, with the multiple knowledge sources characteristic parameter of each speech data training sample front end input parameter as described rhythm mark match device model, with the artificial rhythm mark of each speech data training sample rear end Output rusults as rhythm mark match device model; Step C ' 4, utilize the corresponding model training algorithm of described rhythm mark match device model, train described rhythm mark match device model, obtain the model parameter of described rhythm mark match device model;
Described step C comprises: the multiple knowledge sources characteristic parameter that described original English voice signal is corresponding is inputted the rhythm mark match device after training, thereby obtains the rhythm assessment of levels test result of described original English voice.
10. the method for test Oral English Practice rhythm level according to claim 9, is characterized in that, described rhythm mark match device model is a kind of with in drag: mixed Gauss model, Support Vector Machine model, multi-Layer Perceptron Neural Network model.
CN2011102527792A 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English Active CN102426834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102527792A CN102426834B (en) 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102527792A CN102426834B (en) 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English

Publications (2)

Publication Number Publication Date
CN102426834A CN102426834A (en) 2012-04-25
CN102426834B true CN102426834B (en) 2013-05-08

Family

ID=45960808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102527792A Active CN102426834B (en) 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English

Country Status (1)

Country Link
CN (1) CN102426834B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104575518B (en) * 2013-10-17 2018-10-02 清华大学 Rhythm event detecting method and device
US20150179167A1 (en) * 2013-12-19 2015-06-25 Kirill Chekhter Phoneme signature candidates for speech recognition
CN104464751B (en) * 2014-11-21 2018-01-16 科大讯飞股份有限公司 The detection method and device for rhythm problem of pronouncing
CN104361896B (en) * 2014-12-04 2018-04-13 上海流利说信息技术有限公司 Voice quality assessment equipment, method and system
CN108206026B (en) * 2017-12-05 2021-12-03 北京小唱科技有限公司 Method and device for determining pitch deviation of audio content
CN110992986B (en) * 2019-12-04 2022-06-07 南京大学 Word syllable stress reading error detection method, device, electronic equipment and storage medium
CN111243625B (en) * 2020-01-03 2023-03-24 合肥讯飞数码科技有限公司 Method, device and equipment for testing definition of equipment and readable storage medium
CN111312231B (en) * 2020-05-14 2020-09-04 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN112289298A (en) * 2020-09-30 2021-01-29 北京大米科技有限公司 Processing method and device for synthesized voice, storage medium and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
CN101727903B (en) * 2008-10-29 2011-10-19 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems

Also Published As

Publication number Publication date
CN102426834A (en) 2012-04-25

Similar Documents

Publication Publication Date Title
CN102426834B (en) Method for testing rhythm level of spoken English
Gobl et al. 11 voice source variation and its communicative functions
CN103928023B (en) A kind of speech assessment method and system
CN101751919B (en) Spoken Chinese stress automatic detection method
Kourkounakis et al. Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning
US20200294509A1 (en) Method and apparatus for establishing voiceprint model, computer device, and storage medium
CN102354495B (en) Testing method and system of semi-opened spoken language examination questions
CN106128450A (en) The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN104050965A (en) English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
Yap Speech production under cognitive load: Effects and classification
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN106057192A (en) Real-time voice conversion method and apparatus
CN103366759A (en) Speech data evaluation method and speech data evaluation device
CN101996635A (en) English pronunciation quality evaluation method based on accent highlight degree
CN103366735A (en) A voice data mapping method and apparatus
CN102880906B (en) Chinese vowel pronunciation method based on DIVA nerve network model
Prom-on et al. Functional Modeling of Tone, Focus and Sentence Type in Mandarin Chinese.
CN202758611U (en) Speech data evaluation device
Han et al. The modular design of an english pronunciation level evaluation system based on machine learning
Sun et al. Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab.
Yin Training & evaluation system of intelligent oral phonics based on speech recognition technology
Li et al. English sentence pronunciation evaluation using rhythm and intonation
Sun et al. Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy.
Kim et al. Estimation of the movement trajectories of non-crucial articulators based on the detection of crucial moments and physiological constraints.
Zhai Research on Emotional Feature Analysis and Recognition in Speech Signal Based on Feature Analysis Modeling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant