CN102426834A - Method for testing rhythm level of spoken English - Google Patents

Method for testing rhythm level of spoken English Download PDF

Info

Publication number
CN102426834A
CN102426834A CN2011102527792A CN201110252779A CN102426834A CN 102426834 A CN102426834 A CN 102426834A CN 2011102527792 A CN2011102527792 A CN 2011102527792A CN 201110252779 A CN201110252779 A CN 201110252779A CN 102426834 A CN102426834 A CN 102426834A
Authority
CN
China
Prior art keywords
rhythm
fundamental frequency
characteristic
duration
variance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102527792A
Other languages
Chinese (zh)
Other versions
CN102426834B (en
Inventor
李宏言
徐波
王士进
高鹏
李鹏
陈振标
柯登峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2011102527792A priority Critical patent/CN102426834B/en
Publication of CN102426834A publication Critical patent/CN102426834A/en
Application granted granted Critical
Publication of CN102426834B publication Critical patent/CN102426834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for testing the rhythm level of spoken English. The method comprises the following steps of: A, preprocessing an original English speech signal; B, extracting multi-knowledge-source characteristic parameters used for a rhythm test from the preprocessed original English speech signal, wherein the multi-knowledge-source characteristic parameters comprise rhythm performance characteristics, rhythm generation characteristics and rhythm influence characteristics; and C, acquiring a rhythm level test score of original English sounds by using most of the multi-knowledge-source characteristic parameters. By the method for testing the rhythm level of spoken English, a better test result is obtained by using a strategy in which multi-knowledge information is used for thinning and merging, and the objectivity and the accuracy of the test are improved.

Description

Test the method for English spoken rhythm level
Technical field
The present invention relates to artificial intelligence jargon tone signal and handle and mode identification technology, relate in particular to a kind of method of testing English spoken rhythm level.
Background technology
The rhythm is to put forward to this human behavior of speech, and the media of verbal communication is a sound, and person's information of wanting information conveyed and hearer to hear is included in the sound wave thus.In computer-assisted language learning; The rhythm does very well and expresses " refined " in the boundary " fidelity, fluency, elegance " than the language learner; It is a five-star ring in the speech; It acts on speech through the physics and the acoustic mode of complicacy, to characterize speaker's Supersonic section characteristics such as the tone, attitude, intention and emotion.We can say that the learner just can read to represent language is grasped in the expression process, and the performance degree that really digesting of said content is depended on its rhythm to a great extent.
Rhythm horizontal checkout is an important component part in the area of computer aided automatic speech test macro, and rhythm test of the present invention generates with the rhythm that was directed against the phonetic synthesis field in the past and test has essential distinction.Generate and test to the rhythm in phonetic synthesis field, it is paid close attention to is the rhythmicity and the naturalness of how voice that effectively raising synthesizes in the past.Rhythm test of the present invention then is a rhythm grasp level of paying close attention to the pronunciation of test crowd's true spoken.The present invention more pays close attention to the higher test crowd of spoken language proficiency, promptly expresses among the oral test crowd more complete, that pronouncing accuracy is higher and fluent degree is higher in content,, reaches further the purpose of " getting excellent in good " through the test to rhythm level.
The basic acoustics correlative of rhythm perception is fundamental frequency, duration and energy, and from the angle of people's perception, goes to weigh the rhythm level of certain sentence or paragraph usually from intonation and two aspects of rhythm.Intonation mainly reflects " the pressing down " and " raising " in the subjective sense of hearing, in the acoustic feature aspect, then is through the fundamental frequency in the voice over time, goes to reflect the variation of intonation, the tone and emotion.Rhythm then comprises aspects such as stress, pause, flow control; Stress mainly reflects in the subjective sense of hearing weight to expression content, biased and increase the weight of; English is a kind of typical stress rhythm language, relies on this stress to change just and just brings strong timing acoustically.Pause and mainly reflect the sense straggly in the subjective sense of hearing rhythm, how to produce by the gap of sense-group, semanteme or content conversion.The overall assurance situation that in the subjective sense of hearing rhythm of the then main reflection of flow control the macroscopic view of each segment duration in the whole flow is distributed.
From man-machine communication's angle, verbal Communication is exactly one in fact and is coding, is the process of decoding for the hearer for the person of saying, and rhythm test utilizes machine speech to be carried out an important ring of robotization decode procedure just.And, can think that for the people rhythm of one section voice is better than another section in perception why from voice psychology angle, do not form general understanding and common recognition so far.At present, more towards the research of the prosodic analysis of phonetic synthesis task, and less relatively to the rhythm Research of measuring of educational measurement target, but its application demand appears and increases progressively trend.
Existing rhythm method of testing generally adopts the mode of the simple fundamental frequency of direct use, duration and energy feature; Characteristic itself is not carried out deep processing, introduce the many knowledge sources model that is associated with rhythm performance, rhythm generation and rhythm influence pointedly yet.And a large amount of research practices of educational measurement and area of pattern recognition show, use prosodic features to be difficult to further improve test performance merely generally.
Summary of the invention
The technical matters that (one) will solve
For solving above-mentioned one or more problems, the invention provides a kind of method of testing English spoken rhythm level, obtain better test effect with the strategy that uses many knowledge sources information to carry out refinement and fusion, improve the objectivity and the accuracy of test.
(2) technical scheme
According to an aspect of the present invention, a kind of method of testing Oral English Practice rhythm level is provided.This method comprises: steps A, the original english voice signal is carried out pre-service; Step B extracts the many knowledge sources characteristic parameter that is used for rhythm test in carrying out pretreated original english voice signal, many knowledge sources characteristic parameter comprises: rhythm performance characteristic; The rhythm produces characteristic; Rhythm effect characteristics; Step C is obtained the rhythm horizontal checkout mark of original english voice by many knowledge sources characteristic parameter.
Preferably, in the method for Oral English Practice rhythm level of the present invention, steps A comprises: steps A 1, and the original english voice signal to be carried out the efficient voice section detect, filter out noise section and long pause section keep effective voice segments signal; Steps A 2 is carried out the branch frame to the efficient voice segment signal and is handled; Steps A 3 uses speech recognition device that efficient voice segment signal and the corresponding text that carries out after the branch frame is handled alignd automatically, obtains the frontier point information of phoneme, syllable, word and sentence.
Preferably, in the method for Oral English Practice rhythm level of the present invention, in the steps A 2, be frame length with 25ms, 10ms is a frame period.
Preferably; In the method for Oral English Practice rhythm level of the present invention; In carrying out pretreated original english voice signal, extracting the rhythm performance characteristic be used for rhythm test among the step B comprises: step B1a, extract the fundamental frequency and the energy of each speech frame, and form fundamental frequency sequence and energy sequence; Calculate fundamental frequency mean value and fundamental frequency variance yields, the average energy of calculating energy sequence and the energy variance yields of fundamental frequency sequence.Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4; Step B1b; Extract each consonant section duration, each first segment duration, each syllable section duration and each word pause section duration; Calculate average duration of consonant section and consonant section duration variance respectively; Calculate the long and first segment duration variance of first segment mean time respectively, calculate average duration of syllable section and syllable section duration variance respectively, calculate average duration of word pause section and word pause section duration variance respectively.The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features; Step B1c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B1b that step B1a is extracted extract are spliced into together, as tieing up prosodic features based on 12 of rhythm performance knowledge source.
Preferably; In the method for Oral English Practice rhythm level of the present invention; In carrying out pretreated original english voice signal, extracting the rhythm generation characteristic that is used for rhythm test among the step B comprises: step B2a, extract the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence carried out robustness handle; Step B2b is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding stressed parts, extracts and comes from the step number of reading parts again, average snap time, snap time variance, forms totally 3 dimension prosodic features; Step B2c is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding benchmark fundamental frequency, extracts the benchmark fundamental frequency characteristic that comes from the benchmark fundamental frequency; Step B2d is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding phrase parts, extracts the impulse number that comes from the phrase parts, average impulse amplitude, impulse amplitude variance, forms totally 3 dimension prosodic features; The 1 dimension benchmark fundamental frequency characteristic that step B2e, 3 dimension prosodic features, the step B2c that step B2b is extracted extract, the 3 dimension prosodic features that step B2d extracts splice, as the 7 dimension prosodic features that produce model based on the rhythm.
Preferably, in the method for Oral English Practice rhythm level of the present invention, among the step B2a fundamental frequency sequence that has extracted is carried out the robustness processing and comprise: the fundamental frequency sequence that has extracted is removed half frequency and frequency multiplication interference; Fundamental frequency sequence to removing after half frequency and frequency multiplication are disturbed is carried out smooth operation; Carry out stylization and handle carrying out fundamental frequency sequence after the smooth operation.
Preferably, in the method for Oral English Practice rhythm level of the present invention, step B2b comprises: the fundamental frequency sequence to handling through robustness is carried out high-pass filtering, utilizes gradient method to extract maximum value and the minimal value part that wherein curved transition is violent automatically; The quantity of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering is as the step number characteristic of fundamental frequency sequence; The average duration and the variance of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering change, as the average snap time and the snap time variance characteristic of fundamental frequency sequence; The step number that said extracted is gone out, average snap time, snap time variance are read 3 of parts again and are tieed up prosodic features as coming from.
Preferably, in the method for Oral English Practice rhythm level of the present invention, step B2d comprises: the fundamental frequency sequence after handling with step B2a deducts the benchmark fundamental frequency that step B2c extracts, and forms the fundamental frequency sequence curve of reflection phrase parts; Calculate the quantity of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the impulse number characteristic of fundamental frequency sequence; Calculate the average impulse amplitude and the amplitude variance of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the average impulse amplitude and the impulse amplitude variance characteristic of fundamental frequency sequence; The impulse number that said extracted is gone out, average impulse amplitude, impulse amplitude variance 3 are tieed up prosodic features as what come from the phrase parts.
Preferably; In the method for Oral English Practice rhythm level of the present invention; In carrying out pretreated original english voice signal, extracting the rhythm effect characteristics that is used for rhythm test among the step B comprises: step B3a; Extract consonant segment base PVI characteristic, vowel segment base PVI characteristic, syllable segment base PVI characteristic frequently frequently frequently by formula one, form totally 3 dimensions based on the prosodic features of fundamental frequency PVI; Step B3b extracts consonant section duration PVI characteristic, first segment duration PVI characteristic, syllable section duration PVI characteristic by formula one, and totally 3 dimensions are based on the prosodic features of duration PVI in formation, and wherein, the expression formula of formula one is:
Figure BDA0000087430050000041
The continuous speech section is divided into x kAnd x K+1Two parts, and represent the fundamental frequency value or the duration value of k and k+1 voice segments respectively, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Preferably, in the method for Oral English Practice rhythm level of the present invention, also comprise before the step C: step C ' 1, collects the speech data sample as development set, and the speech data sample is carried out the mark of artificial rhythm mark; Step C ' 2, select rhythm mark match device model; Step C ' 3, with the prosodic features of each speech data sample, as the front end input parameter of rhythm mark match device model, with the artificial rhythm mark of each speech data sample, as the rear end output result of rhythm mark match device model; Step C ' 4, utilize the pairing model training algorithm of rhythm mark match device model, and training rhythm mark match device model obtains the model parameter of rhythm mark match device model.Said step C comprises: many knowledge sources characteristic parameter that said original english voice signal is corresponding is imported the rhythm mark match device after the training, thereby obtains the horizontal evaluation test mark of the rhythm of said original english voice.
Preferably, in the method for Oral English Practice rhythm level of the present invention, rhythm mark match device model is a kind of with in the drag: mixed Gauss model, support vector machine model, multilayer perceptron network model.
(3) beneficial effect
The method that the present invention tests English spoken rhythm level has following beneficial effect:
1. among the present invention, the many knowledge sources characteristic that is used for rhythm test influences three aspects from rhythm performance, rhythm generation and the rhythm and obtains.Because made full use of the prosodic information of many knowledge sources, the present invention can effectively improve the accuracy and the reliability of rhythm test macro;
2. through the present invention; Can accumulate speech data and learning sample storehouse with artificial mark mark to the otherness of different sexes, all ages and classes and different regions; Utilize this speech data and learning sample storehouse to train rhythm mark match device model, thereby make method of testing of the present invention have good generalization to different sexes, all ages and classes and different regions.
Description of drawings
Fig. 1 is the overall procedure block diagram that the embodiment of the invention is tested English spoken rhythm horizontal process;
Fig. 2 is that the embodiment of the invention is tested the process flow diagram that extracts rhythm generation characteristic in the English spoken rhythm horizontal process;
Fig. 3 is the process flow diagram that the embodiment of the invention is tested training rhythm mark match device in the English spoken rhythm horizontal process.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, to further explain of the present invention.
Fig. 1 is the overall procedure block diagram that the embodiment of the invention is tested English spoken rhythm horizontal process.As shown in Figure 1, the step of testing English spoken rhythm horizontal process is:
Steps A ', obtain the primary speech signal that needs that the user reads carry out rhythm horizontal checkout.
Steps A is carried out pre-service to primary speech signal.
Steps A-1 is carried out the efficient voice section to raw tone and is detected (being called for short VAD detects), filters out noise section and long pause section, keeps the usefulness of voice segments as next step.
Steps A-2 is carried out the branch frame to the efficient voice segment signal and is handled, and preferably, is frame length with 25ms, and 10ms is a frame period, and re-treatment finishes until voice signal.
Steps A-3 uses speech recognition device that the efficient voice segment signal that carries out after the branch frame is handled is alignd automatically, obtains the signal boundary information of phoneme, syllable, word and sentence.
Need to prove, above-mentioned A-1, A-2, the execution of A-3 has permanent order, and its order cannot be upset or put upside down.
Step B extracts the many knowledge sources prosodic features that is used for rhythm test.
Present situation and deficiency in view of rhythm measuring technology; The present invention considers the performance of the rhythm, the generation of the rhythm and many aspects such as influence of the rhythm as far as possible; Extract its characterization parameter effective and robust respectively; Use mark match device to go anthropomorphic dummy's processed machine-processed then, and each knowledge source model is further merged, to realize objective examination rhythm level.In specific words, the many knowledge sources characteristic that is used for rhythm test produces and the rhythm influences three aspects and obtains from rhythm performance, the rhythm.Three kinds of basic prosodic features can derivation go out various features, how to weigh which characteristic to rhythm test effectively, up to the present also do not have unified understanding.
The present invention has adopted the thinking of greedy algorithm at predevelopment phase, extracts various characteristics widely, screens then, and test helps the biggest characteristic combination to the rhythm so that be settled out.And related characteristic all is through the excellent characteristic after the characteristic screening in the follow-up embodiment of the present invention.
And, more than all prosodic features that extract, all be that unit carries out normalization and handles, and carry out corresponding normalization at vocabulary, sentence surface and handle with the sex.All prosodic features that the present invention relates to all need not artificial mark, can generate automatically through computer program.In addition, above-mentioned characteristic extraction procedure is not distinguished successively, after all feature extraction finishes, whole characteristics is merged into the prosodic features of final use.Below will be elaborated to the Feature Extraction process:
Step B-1 extracts rhythm performance characteristic (based on the prosodic features of rhythm performance knowledge source).
For prosodic features based on rhythm performance knowledge source; Comprise the most basic fundamental frequency, duration and energy feature; And the characteristic that goes out of these three kinds basic acoustic feature institutes derivation; These prosodic features can reflect the learner in morphology and sentence structure aspect for linguistic organization, the ability expressing and control, also be the most widely used characteristic of present researcher.
Step B-1-a extracts the fundamental frequency and the energy of each speech frame, forms fundamental frequency sequence and energy sequence, calculates fundamental frequency mean value and fundamental frequency variance yields, the average energy of calculating energy sequence and the energy variance yields of fundamental frequency sequence.Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4.
Step B-1-b; Extract each consonant section duration, each first segment duration, each syllable section duration and each word pause section duration; Calculate average duration of consonant section and consonant section duration variance respectively; Calculate the long and first segment duration variance of first segment mean time respectively, calculate average duration of syllable section and syllable section duration variance respectively, calculate average duration of word pause section and word pause section duration variance respectively.The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features.
Step B-1-c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B-1-b that step B-1-a is extracted extract are spliced into together, as tieing up prosodic features based on 12 of rhythm performance knowledge source.
Step B-2 with reference to accompanying drawing 2, extracts the rhythm and produces characteristic (producing the prosodic features of model based on the rhythm).
For the prosodic features that produces the knowledge collection based on the rhythm, be the rhythm in the phonetic synthesis to be produced model oppositely consider.Generally, our the fundamental frequency track characteristic that extracts is a kind of observation phenomenon that changes to the rhythm that people's speech performance is produced after through the rhythm model effect.And this observation phenomenon is excavated, obtaining the method for more going deep into rhythm generation knowledge is exactly to recall the mechanism that its rhythm produces.It is generally acknowledged that intonation in the rhythm and the relation between the rhythm like the relation that is layering between " wave " and " ripples ", can be represented with simple algebraic sum, mutual addition when phase place equates is cancelled each other when phase place is opposite.The Supersonic section rhythm model that Japan scholar Fujisaki proposes has carried out good modeling, and good explanation has all been arranged on physiology, physics and acoustics " big wave, the ripplet " relation in the fundamental curve.
The Fujisaki model thinks that seeming irregular fundamental curve can be made up of three different parameters of operating part, and can find the physical characteristics of corresponding phonatory organ to make an explanation.These three kinds of rhythm parts are respectively the phrase parts, read parts and reference frequency parts again, corresponding respectively the description to intonation, rhythm and basic pitch.Target of the present invention extracts this three kinds of pairing characteristic parameters of parts exactly, to obtain the knowledge that the rhythm produces angle.
Step B-2-a extracts the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence is carried out robustness handle.Robustness is handled and is comprised three steps, at first is to remove half frequency and frequency multiplication interference, and be that the fundamental frequency sequence is carried out smooth operation then, be to carry out stylization to handle at last.
Step B-2-b extracts and reads parameters of operating part again.Fundamental frequency sequence to handling through step B-2-a is carried out high-pass filtering, utilizes gradient method to extract maximum value and the minimal value part that wherein curved transition is violent automatically.Calculate the quantity of acute variation part, as step number characteristic.Average duration and the variance of calculating the acute variation part change, as average snap time and snap time variance characteristic.Step number, average snap time, snap time variance are read again 3 of parts and tieed up prosodic features as coming from.
Step B-2-c extracts the benchmark fundamental frequency.For the fundamental frequency sequence of handling through step 3-2-a, remove the HFS that step B-2-b extracts, to form the fundamental frequency sequence of low pass.Find out the minimum point of this low pass fundamental frequency sequence, with this as benchmark fundamental frequency characteristic.
Step B-2-d extracts the phrase parameters of operating part.Fundamental frequency sequence after the step B2a processing is deducted the benchmark fundamental frequency that step B-2-c extracts, form the fundamental frequency sequence curve of reflection phrase parts.Calculate the quantity of acute variation part, as impulse number characteristic.Calculate the average impulse amplitude and the amplitude variance of acute variation part, as average impulse amplitude and impulse amplitude variance characteristic.Impulse number, average impulse amplitude, impulse amplitude variance 3 are tieed up prosodic features as what come from the phrase parts.
The 1 dimension benchmark fundamental frequency characteristic that step B-2-e, 3 dimension prosodic features, the step B-2-c that step B-2-b is extracted extract, the 3 dimension prosodic features that step B-2-d extracts splice, as the 7 dimension prosodic features that produce model based on the rhythm.
Step B-3 extracts rhythm effect characteristics.
To the prosodic features that influences knowledge source based on the rhythm, what it was paid close attention to is the correlation degree of language learner for English and self mother tongue, the i.e. pure degree of Oral English Practice.In general, the pure degree of English is grasped people preferably, in spoken language, can occur usually rising and falling and in picturesque disorder rhythm variation.The variation characteristic of fundamental frequency and duration characteristic, especially fundamental frequency and duration plays key effect in the pure kilsyth basalt of English is existing.PVI (Pairwise Variability Index) operator has been obtained remarkable performance in the work of distinguishing the different language kind; The present invention expands to different segment levels with it in rhythm test; PVI is calculated at continuous consonant, vowel and three segment levels of syllable respectively, to obtain the prosodic features that influences knowledge source based on the rhythm.
Step B-3-a extracts the fundamental frequency PVI characteristic of consonant section, first segment and syllable section respectively, and computing method are following:
PVI = 100 × Σ k = 1 m - 1 | p k - p k + 1 ( p k + p k + 1 ) / 2 | / ( m - 1 ) ,
In the following formula, the continuous speech section is divided into p kAnd p K+1Two parts, and represent the fundamental frequency mean value of k and k+1 voice segments respectively, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Step B-3-b extracts the duration PVI characteristic of consonant section, first segment and syllable section respectively, and computing method are following:
PVI = 100 × Σ k = 1 m - 1 | d k - d k + 1 ( d k + d k + 1 ) / 2 | / ( m - 1 )
In the following formula, the continuous speech section is divided into d kAnd d K+1Two parts, and represent the duration value of k and k+1 voice segments respectively, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Step B-3-c; The consonant segment base that step B-3-a is extracted is PVI characteristic, vowel segment base PVI characteristic, syllable segment base PVI characteristic frequently frequently frequently; And step B-3-b the consonant section duration PVI characteristic, first segment duration PVI characteristic, the syllable section duration PVI characteristic that extract, merge the back and 6 tie up prosodic features as what influence knowledge source based on the rhythm.
Step B4, prosodic features merges.The 6 dimension prosodic features that 7 dimension prosodic features that 12 dimension prosodic features, the step B-2 that step B-1 is extracted extract and step B-3 extract are merged into 25 final dimension prosodic features.
Step C ', training mark match device.
For the mapping of characteristic to mark, the present invention will utilize the development set data to train to obtain the match device.The development set data have the rhythm rating fraction that the expert marks out; With the input of each Partial Feature of each speech samples in the development set data as the match device; Artificial mark mark is as the output of match device; Obtain match device parameter through match device training algorithm, to accomplish the training process of mark match device.
With reference to accompanying drawing 3, the concrete steps of training rhythm mark match device are:
Step C '-1 collects speech data as development set, and speech samples is carried out the mark of artificial rhythm mark.
Step C '-2 selects suitable rhythm mark match device, and the present invention does not limit particular type, can be a kind of in the common sorter model, for example mixed Gauss model (GMM), SVM (SVM), multilayer perceptron network (MLP) etc.
Step C '-3 extracts the prosodic features of each speech samples through step B, and as the input parameter of mark match device.With the artificial rhythm mark of each speech samples, as the output result of rhythm mark match device.
Step C '-4 on the basis of step C '-3, utilizes corresponding model training algorithm, and training rhythm mark match device finally obtains the model parameter of rhythm mark match device.
Step C with English spoken most many knowledge sources characteristic parameters input rhythm mark match devices to be evaluated, obtains rhythm horizontal checkout mark, with this as objective examination to the rhythm level of this tested speech sample.
Need to prove that the step of above rhythm test is suitable for sentence and paragraph level, concrete levels of testing is decided according to actual conditions.
Above-described specific embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely specific embodiment of the present invention; Be not limited to the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1. a method of testing Oral English Practice rhythm level is characterized in that, comprising:
Steps A is carried out pre-service to the original english voice signal;
Step B extracts the many knowledge sources characteristic parameter that is used for rhythm test in pretreated original english voice signal, this many knowledge sources characteristic parameter comprises that rhythm performance characteristic, the rhythm produce characteristic and rhythm effect characteristics;
Step C is obtained the rhythm horizontal checkout mark of said original english voice by said many knowledge sources characteristic parameter.
2. the method for test Oral English Practice rhythm level according to claim 1 is characterized in that said steps A comprises:
Steps A 1 is carried out the efficient voice section to the original english voice signal and is detected, and filter out noise section and long pause section keep effective voice segments signal;
Steps A 2 is carried out the branch frame to the efficient voice segment signal and is handled;
Steps A 3 uses speech recognition device that efficient voice segment signal and the corresponding text that carries out after the branch frame is handled alignd automatically, obtains the frontier point information of phoneme, syllable, word and sentence.
3. the method for test Oral English Practice rhythm level according to claim 2 is characterized in that, in the said steps A 2, is frame length with 25ms, and 10ms is a frame period.
4. the method for test Oral English Practice rhythm level according to claim 1 is characterized in that, in pretreated original english voice signal, extracts the rhythm performance characteristic that is used for rhythm test among the said step B and comprises:
Step B1a extracts the fundamental frequency and the energy of each speech frame, forms fundamental frequency sequence and energy sequence, calculates the fundamental frequency mean value and the fundamental frequency variance yields of fundamental frequency sequence, the average energy of calculating energy sequence and energy variance yields; Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4;
Step B1b; Extract each consonant section duration, each first segment duration, each syllable section duration and each word pause section duration; Calculate average duration of consonant section and consonant section duration variance respectively; Calculate the long and first segment duration variance of first segment mean time respectively, calculate average duration of syllable section and syllable section duration variance respectively, calculate average duration of word pause section and word pause section duration variance respectively; The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features;
Step B1c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B1b that step B1a is extracted extract are spliced into together, as tieing up rhythm performance characteristics based on 12 of rhythm performance knowledge source.
5. the method for test Oral English Practice rhythm level according to claim 1 is characterized in that, in carrying out said pretreated original english voice signal, extracts the rhythm generation characteristic that is used for rhythm test among the said step B and comprises:
Step B2a extracts the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence is carried out robustness handle;
Step B2b is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding stressed parts, extracts and comes from the step number of reading parts again, average snap time, snap time variance, forms totally 3 dimension prosodic features;
Step B2c is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding benchmark fundamental frequency, extracts the benchmark fundamental frequency characteristic that comes from the benchmark fundamental frequency;
Step B2d is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding phrase parts, extracts the impulse number that comes from the phrase parts, average impulse amplitude, impulse amplitude variance, forms totally 3 dimension prosodic features;
The 1 dimension benchmark fundamental frequency characteristic that step B2e, 3 dimension prosodic features, the step B2c that step B2b is extracted extract, the 3 dimension prosodic features that step B2d extracts splice, and produce characteristic as the 7 dimension rhythms that produce model based on the rhythm.
6. the method for test Oral English Practice rhythm level according to claim 5 is characterized in that, among the said step B2a fundamental frequency sequence that has extracted is carried out the robustness processing and comprises:
The fundamental frequency sequence that has extracted is removed half frequency and frequency multiplication interference;
Fundamental frequency sequence to removing after half frequency and frequency multiplication are disturbed is carried out smooth operation;
Carry out stylization and handle carrying out fundamental frequency sequence after the smooth operation.
7. the method for test Oral English Practice rhythm level according to claim 5 is characterized in that said step B2b comprises:
Fundamental frequency sequence to handling through robustness is carried out high-pass filtering, utilizes gradient method to extract maximum value and the minimal value part that wherein curved transition is violent automatically;
The quantity of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering is as the step number characteristic of fundamental frequency sequence;
The average duration and the variance of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering change, as the average snap time and the snap time variance characteristic of fundamental frequency sequence;
Step number, average snap time, snap time variance that said extracted is gone out produce characteristic as coming from the 3 dimension rhythms of reading parts again.
8. the method for test Oral English Practice rhythm level according to claim 5 is characterized in that said step B2d comprises:
Fundamental frequency sequence with after the step B2a processing deducts the benchmark fundamental frequency that step B2c extracts, and forms the fundamental frequency sequence curve of reflection phrase parts;
Calculate the quantity of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the impulse number characteristic of fundamental frequency sequence;
Calculate the average impulse amplitude and the amplitude variance of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the average impulse amplitude and the impulse amplitude variance characteristic of fundamental frequency sequence;
Impulse number, average impulse amplitude, impulse amplitude variance that said extracted is gone out produce characteristic as the 3 dimension rhythms that come from the phrase parts.
9. the method for test Oral English Practice rhythm level according to claim 1 is characterized in that, in carrying out said pretreated original english voice signal, extracts the rhythm effect characteristics that is used for rhythm test among the said step B and comprises:
Step B3a extracts consonant segment base PVI characteristic, vowel segment base PVI characteristic, syllable segment base PVI characteristic frequently frequently frequently by formula one, forms totally 3 dimensions based on the prosodic features of fundamental frequency PVI;
Step B3b extracts consonant section duration PVI characteristic, first segment duration PVI characteristic, syllable section duration PVI characteristic by formula one, forms totally 3 dimensions based on the rhythm effect characteristics of duration PVI,
Wherein, the expression formula of formula one is:
Figure FDA0000087430020000031
The continuous speech section is divided into x kAnd x K+1Two parts, and represent the fundamental frequency value or the duration value of k and k+1 voice segments respectively, m represents the number of continuous speech section; The voice segments here is consonant section, first segment or syllable section.
10. the method for test Oral English Practice rhythm level according to claim 1 is characterized in that,
Also comprise before the said step C: step C ' 1, collects the speech data training sample as development set, and said speech data training sample is carried out the mark of artificial rhythm mark; Step C ' 2, select rhythm mark match device model; Step C ' 3; With many knowledge sources characteristic parameter of each speech data training sample front end input parameter, with the artificial rhythm mark of each speech data training sample rear end output result as rhythm mark match device model as said rhythm mark match device model; Step C ' 4, utilize the pairing model training algorithm of said rhythm mark match device model, train said rhythm mark match device model, obtain the model parameter of said rhythm mark match device model;
Said step C comprises: many knowledge sources characteristic parameter that said original english voice signal is corresponding is imported the rhythm mark match device after the training, thereby obtains the horizontal evaluation test mark of the rhythm of said original english voice.
11. the method for test Oral English Practice rhythm level according to claim 10 is characterized in that, said rhythm mark match device model is a kind of with in the drag: mixed Gauss model, support vector machine model, multilayer perceptron network model.
CN2011102527792A 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English Active CN102426834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102527792A CN102426834B (en) 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102527792A CN102426834B (en) 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English

Publications (2)

Publication Number Publication Date
CN102426834A true CN102426834A (en) 2012-04-25
CN102426834B CN102426834B (en) 2013-05-08

Family

ID=45960808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102527792A Active CN102426834B (en) 2011-08-30 2011-08-30 Method for testing rhythm level of spoken English

Country Status (1)

Country Link
CN (1) CN102426834B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361896A (en) * 2014-12-04 2015-02-18 上海流利说信息技术有限公司 Voice quality evaluation equipment, method and system
CN104464751A (en) * 2014-11-21 2015-03-25 科大讯飞股份有限公司 Method and device for detecting pronunciation rhythm problem
CN104575518A (en) * 2013-10-17 2015-04-29 清华大学 Rhyme event detection method and device
CN104732971A (en) * 2013-12-19 2015-06-24 Sap欧洲公司 Phoneme signature candidates for speech recognition
CN108206026A (en) * 2017-12-05 2018-06-26 北京小唱科技有限公司 Determine the method and device of audio content pitch deviation
CN110992986A (en) * 2019-12-04 2020-04-10 南京大学 Word syllable stress reading error detection method, device, electronic equipment and storage medium
CN111243625A (en) * 2020-01-03 2020-06-05 合肥讯飞数码科技有限公司 Method, device and equipment for testing definition of equipment and readable storage medium
CN111312231A (en) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN112289298A (en) * 2020-09-30 2021-01-29 北京大米科技有限公司 Processing method and device for synthesized voice, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197084A (en) * 2007-11-06 2008-06-11 安徽科大讯飞信息科技股份有限公司 Automatic spoken English evaluating and learning system
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104575518A (en) * 2013-10-17 2015-04-29 清华大学 Rhyme event detection method and device
CN104575518B (en) * 2013-10-17 2018-10-02 清华大学 Rhythm event detecting method and device
CN104732971B (en) * 2013-12-19 2019-07-30 Sap欧洲公司 Phoneme signature for speech recognition is candidate
CN104732971A (en) * 2013-12-19 2015-06-24 Sap欧洲公司 Phoneme signature candidates for speech recognition
CN104464751A (en) * 2014-11-21 2015-03-25 科大讯飞股份有限公司 Method and device for detecting pronunciation rhythm problem
CN104464751B (en) * 2014-11-21 2018-01-16 科大讯飞股份有限公司 The detection method and device for rhythm problem of pronouncing
CN104361896B (en) * 2014-12-04 2018-04-13 上海流利说信息技术有限公司 Voice quality assessment equipment, method and system
CN104361896A (en) * 2014-12-04 2015-02-18 上海流利说信息技术有限公司 Voice quality evaluation equipment, method and system
CN108206026A (en) * 2017-12-05 2018-06-26 北京小唱科技有限公司 Determine the method and device of audio content pitch deviation
CN110992986A (en) * 2019-12-04 2020-04-10 南京大学 Word syllable stress reading error detection method, device, electronic equipment and storage medium
CN110992986B (en) * 2019-12-04 2022-06-07 南京大学 Word syllable stress reading error detection method, device, electronic equipment and storage medium
CN111243625A (en) * 2020-01-03 2020-06-05 合肥讯飞数码科技有限公司 Method, device and equipment for testing definition of equipment and readable storage medium
CN111312231A (en) * 2020-05-14 2020-06-19 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN111312231B (en) * 2020-05-14 2020-09-04 腾讯科技(深圳)有限公司 Audio detection method and device, electronic equipment and readable storage medium
CN112289298A (en) * 2020-09-30 2021-01-29 北京大米科技有限公司 Processing method and device for synthesized voice, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN102426834B (en) 2013-05-08

Similar Documents

Publication Publication Date Title
CN102426834B (en) Method for testing rhythm level of spoken English
Gobl et al. 11 voice source variation and its communicative functions
CN103928023B (en) A kind of speech assessment method and system
Kourkounakis et al. Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning
CN101751919B (en) Spoken Chinese stress automatic detection method
Yap Speech production under cognitive load: Effects and classification
CN104050965A (en) English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
CN102231278A (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
CN106128450A (en) The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
US9489864B2 (en) Systems and methods for an automated pronunciation assessment system for similar vowel pairs
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN103366735B (en) The mapping method of speech data and device
Zhang et al. Using computer speech recognition technology to evaluate spoken English.
Kourkounakis et al. FluentNet: end-to-end detection of speech disfluency with deep learning
CN106856095A (en) The voice quality evaluating system that a kind of phonetic is combined into syllables
Dong Application of artificial intelligence software based on semantic web technology in english learning and teaching
CN102880906B (en) Chinese vowel pronunciation method based on DIVA nerve network model
Prom-on et al. Functional Modeling of Tone, Focus and Sentence Type in Mandarin Chinese.
Han et al. The modular design of an english pronunciation level evaluation system based on machine learning
CN202758611U (en) Speech data evaluation device
Ramteke et al. Text-To-Speech Synthesizer for English, Hindi and Marathi Spoken Signals‖
Sun et al. Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab.
Li et al. English sentence pronunciation evaluation using rhythm and intonation
Kim et al. Estimation of the movement trajectories of non-crucial articulators based on the detection of crucial moments and physiological constraints.
Sun et al. Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant