Summary of the invention
The technical matters that (one) will solve
For solving above-mentioned one or more problems, the invention provides a kind of method of testing English spoken rhythm level, obtain better test effect with the strategy that uses many knowledge sources information to carry out refinement and fusion, improve the objectivity and the accuracy of test.
(2) technical scheme
According to an aspect of the present invention, a kind of method of testing Oral English Practice rhythm level is provided.This method comprises: steps A, the original english voice signal is carried out pre-service; Step B extracts the many knowledge sources characteristic parameter that is used for rhythm test in carrying out pretreated original english voice signal, many knowledge sources characteristic parameter comprises: rhythm performance characteristic; The rhythm produces characteristic; Rhythm effect characteristics; Step C is obtained the rhythm horizontal checkout mark of original english voice by many knowledge sources characteristic parameter.
Preferably, in the method for Oral English Practice rhythm level of the present invention, steps A comprises: steps A 1, and the original english voice signal to be carried out the efficient voice section detect, filter out noise section and long pause section keep effective voice segments signal; Steps A 2 is carried out the branch frame to the efficient voice segment signal and is handled; Steps A 3 uses speech recognition device that efficient voice segment signal and the corresponding text that carries out after the branch frame is handled alignd automatically, obtains the frontier point information of phoneme, syllable, word and sentence.
Preferably, in the method for Oral English Practice rhythm level of the present invention, in the steps A 2, be frame length with 25ms, 10ms is a frame period.
Preferably; In the method for Oral English Practice rhythm level of the present invention; In carrying out pretreated original english voice signal, extracting the rhythm performance characteristic be used for rhythm test among the step B comprises: step B1a, extract the fundamental frequency and the energy of each speech frame, and form fundamental frequency sequence and energy sequence; Calculate fundamental frequency mean value and fundamental frequency variance yields, the average energy of calculating energy sequence and the energy variance yields of fundamental frequency sequence.Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4; Step B1b; Extract each consonant section duration, each first segment duration, each syllable section duration and each word pause section duration; Calculate average duration of consonant section and consonant section duration variance respectively; Calculate the long and first segment duration variance of first segment mean time respectively, calculate average duration of syllable section and syllable section duration variance respectively, calculate average duration of word pause section and word pause section duration variance respectively.The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features; Step B1c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B1b that step B1a is extracted extract are spliced into together, as tieing up prosodic features based on 12 of rhythm performance knowledge source.
Preferably; In the method for Oral English Practice rhythm level of the present invention; In carrying out pretreated original english voice signal, extracting the rhythm generation characteristic that is used for rhythm test among the step B comprises: step B2a, extract the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence carried out robustness handle; Step B2b is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding stressed parts, extracts and comes from the step number of reading parts again, average snap time, snap time variance, forms totally 3 dimension prosodic features; Step B2c is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding benchmark fundamental frequency, extracts the benchmark fundamental frequency characteristic that comes from the benchmark fundamental frequency; Step B2d is an object with the fundamental frequency sequence after handling through robustness, extracts corresponding phrase parts, extracts the impulse number that comes from the phrase parts, average impulse amplitude, impulse amplitude variance, forms totally 3 dimension prosodic features; The 1 dimension benchmark fundamental frequency characteristic that step B2e, 3 dimension prosodic features, the step B2c that step B2b is extracted extract, the 3 dimension prosodic features that step B2d extracts splice, as the 7 dimension prosodic features that produce model based on the rhythm.
Preferably, in the method for Oral English Practice rhythm level of the present invention, among the step B2a fundamental frequency sequence that has extracted is carried out the robustness processing and comprise: the fundamental frequency sequence that has extracted is removed half frequency and frequency multiplication interference; Fundamental frequency sequence to removing after half frequency and frequency multiplication are disturbed is carried out smooth operation; Carry out stylization and handle carrying out fundamental frequency sequence after the smooth operation.
Preferably, in the method for Oral English Practice rhythm level of the present invention, step B2b comprises: the fundamental frequency sequence to handling through robustness is carried out high-pass filtering, utilizes gradient method to extract maximum value and the minimal value part that wherein curved transition is violent automatically; The quantity of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering is as the step number characteristic of fundamental frequency sequence; The average duration and the variance of the curvature acute variation part of the fundamental frequency sequence after the calculating high-pass filtering change, as the average snap time and the snap time variance characteristic of fundamental frequency sequence; The step number that said extracted is gone out, average snap time, snap time variance are read 3 of parts again and are tieed up prosodic features as coming from.
Preferably, in the method for Oral English Practice rhythm level of the present invention, step B2d comprises: the fundamental frequency sequence after handling with step B2a deducts the benchmark fundamental frequency that step B2c extracts, and forms the fundamental frequency sequence curve of reflection phrase parts; Calculate the quantity of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the impulse number characteristic of fundamental frequency sequence; Calculate the average impulse amplitude and the amplitude variance of the curvature acute variation part in the fundamental frequency sequence that reflects the phrase parts, as the average impulse amplitude and the impulse amplitude variance characteristic of fundamental frequency sequence; The impulse number that said extracted is gone out, average impulse amplitude, impulse amplitude variance 3 are tieed up prosodic features as what come from the phrase parts.
Preferably; In the method for Oral English Practice rhythm level of the present invention; In carrying out pretreated original english voice signal, extracting the rhythm effect characteristics that is used for rhythm test among the step B comprises: step B3a; Extract consonant segment base PVI characteristic, vowel segment base PVI characteristic, syllable segment base PVI characteristic frequently frequently frequently by formula one, form totally 3 dimensions based on the prosodic features of fundamental frequency PVI; Step B3b extracts consonant section duration PVI characteristic, first segment duration PVI characteristic, syllable section duration PVI characteristic by formula one, and totally 3 dimensions are based on the prosodic features of duration PVI in formation, and wherein, the expression formula of formula one is:
The continuous speech section is divided into x
kAnd x
K+1Two parts, and represent the fundamental frequency value or the duration value of k and k+1 voice segments respectively, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Preferably, in the method for Oral English Practice rhythm level of the present invention, also comprise before the step C: step C ' 1, collects the speech data sample as development set, and the speech data sample is carried out the mark of artificial rhythm mark; Step C ' 2, select rhythm mark match device model; Step C ' 3, with the prosodic features of each speech data sample, as the front end input parameter of rhythm mark match device model, with the artificial rhythm mark of each speech data sample, as the rear end output result of rhythm mark match device model; Step C ' 4, utilize the pairing model training algorithm of rhythm mark match device model, and training rhythm mark match device model obtains the model parameter of rhythm mark match device model.Said step C comprises: many knowledge sources characteristic parameter that said original english voice signal is corresponding is imported the rhythm mark match device after the training, thereby obtains the horizontal evaluation test mark of the rhythm of said original english voice.
Preferably, in the method for Oral English Practice rhythm level of the present invention, rhythm mark match device model is a kind of with in the drag: mixed Gauss model, support vector machine model, multilayer perceptron network model.
(3) beneficial effect
The method that the present invention tests English spoken rhythm level has following beneficial effect:
1. among the present invention, the many knowledge sources characteristic that is used for rhythm test influences three aspects from rhythm performance, rhythm generation and the rhythm and obtains.Because made full use of the prosodic information of many knowledge sources, the present invention can effectively improve the accuracy and the reliability of rhythm test macro;
2. through the present invention; Can accumulate speech data and learning sample storehouse with artificial mark mark to the otherness of different sexes, all ages and classes and different regions; Utilize this speech data and learning sample storehouse to train rhythm mark match device model, thereby make method of testing of the present invention have good generalization to different sexes, all ages and classes and different regions.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, to further explain of the present invention.
Fig. 1 is the overall procedure block diagram that the embodiment of the invention is tested English spoken rhythm horizontal process.As shown in Figure 1, the step of testing English spoken rhythm horizontal process is:
Steps A ', obtain the primary speech signal that needs that the user reads carry out rhythm horizontal checkout.
Steps A is carried out pre-service to primary speech signal.
Steps A-1 is carried out the efficient voice section to raw tone and is detected (being called for short VAD detects), filters out noise section and long pause section, keeps the usefulness of voice segments as next step.
Steps A-2 is carried out the branch frame to the efficient voice segment signal and is handled, and preferably, is frame length with 25ms, and 10ms is a frame period, and re-treatment finishes until voice signal.
Steps A-3 uses speech recognition device that the efficient voice segment signal that carries out after the branch frame is handled is alignd automatically, obtains the signal boundary information of phoneme, syllable, word and sentence.
Need to prove, above-mentioned A-1, A-2, the execution of A-3 has permanent order, and its order cannot be upset or put upside down.
Step B extracts the many knowledge sources prosodic features that is used for rhythm test.
Present situation and deficiency in view of rhythm measuring technology; The present invention considers the performance of the rhythm, the generation of the rhythm and many aspects such as influence of the rhythm as far as possible; Extract its characterization parameter effective and robust respectively; Use mark match device to go anthropomorphic dummy's processed machine-processed then, and each knowledge source model is further merged, to realize objective examination rhythm level.In specific words, the many knowledge sources characteristic that is used for rhythm test produces and the rhythm influences three aspects and obtains from rhythm performance, the rhythm.Three kinds of basic prosodic features can derivation go out various features, how to weigh which characteristic to rhythm test effectively, up to the present also do not have unified understanding.
The present invention has adopted the thinking of greedy algorithm at predevelopment phase, extracts various characteristics widely, screens then, and test helps the biggest characteristic combination to the rhythm so that be settled out.And related characteristic all is through the excellent characteristic after the characteristic screening in the follow-up embodiment of the present invention.
And, more than all prosodic features that extract, all be that unit carries out normalization and handles, and carry out corresponding normalization at vocabulary, sentence surface and handle with the sex.All prosodic features that the present invention relates to all need not artificial mark, can generate automatically through computer program.In addition, above-mentioned characteristic extraction procedure is not distinguished successively, after all feature extraction finishes, whole characteristics is merged into the prosodic features of final use.Below will be elaborated to the Feature Extraction process:
Step B-1 extracts rhythm performance characteristic (based on the prosodic features of rhythm performance knowledge source).
For prosodic features based on rhythm performance knowledge source; Comprise the most basic fundamental frequency, duration and energy feature; And the characteristic that goes out of these three kinds basic acoustic feature institutes derivation; These prosodic features can reflect the learner in morphology and sentence structure aspect for linguistic organization, the ability expressing and control, also be the most widely used characteristic of present researcher.
Step B-1-a extracts the fundamental frequency and the energy of each speech frame, forms fundamental frequency sequence and energy sequence, calculates fundamental frequency mean value and fundamental frequency variance yields, the average energy of calculating energy sequence and the energy variance yields of fundamental frequency sequence.Fundamental frequency mean value, fundamental frequency variance yields, average energy, energy variance yields are tieed up prosodic features as 4.
Step B-1-b; Extract each consonant section duration, each first segment duration, each syllable section duration and each word pause section duration; Calculate average duration of consonant section and consonant section duration variance respectively; Calculate the long and first segment duration variance of first segment mean time respectively, calculate average duration of syllable section and syllable section duration variance respectively, calculate average duration of word pause section and word pause section duration variance respectively.The average duration of consonant section, consonant section duration variance, first segment mean time length, first segment duration variance, the average duration of syllable section, syllable section duration variance, word are paused the average duration of section, word pause section duration variance as 8 dimension prosodic features.
Step B-1-c, the 8 dimension prosodic features that 4 dimension prosodic features and the step B-1-b that step B-1-a is extracted extract are spliced into together, as tieing up prosodic features based on 12 of rhythm performance knowledge source.
Step B-2 with reference to accompanying drawing 2, extracts the rhythm and produces characteristic (producing the prosodic features of model based on the rhythm).
For the prosodic features that produces the knowledge collection based on the rhythm, be the rhythm in the phonetic synthesis to be produced model oppositely consider.Generally, our the fundamental frequency track characteristic that extracts is a kind of observation phenomenon that changes to the rhythm that people's speech performance is produced after through the rhythm model effect.And this observation phenomenon is excavated, obtaining the method for more going deep into rhythm generation knowledge is exactly to recall the mechanism that its rhythm produces.It is generally acknowledged that intonation in the rhythm and the relation between the rhythm like the relation that is layering between " wave " and " ripples ", can be represented with simple algebraic sum, mutual addition when phase place equates is cancelled each other when phase place is opposite.The Supersonic section rhythm model that Japan scholar Fujisaki proposes has carried out good modeling, and good explanation has all been arranged on physiology, physics and acoustics " big wave, the ripplet " relation in the fundamental curve.
The Fujisaki model thinks that seeming irregular fundamental curve can be made up of three different parameters of operating part, and can find the physical characteristics of corresponding phonatory organ to make an explanation.These three kinds of rhythm parts are respectively the phrase parts, read parts and reference frequency parts again, corresponding respectively the description to intonation, rhythm and basic pitch.Target of the present invention extracts this three kinds of pairing characteristic parameters of parts exactly, to obtain the knowledge that the rhythm produces angle.
Step B-2-a extracts the fundamental frequency sequence of each speech frame, and this fundamental frequency sequence is carried out robustness handle.Robustness is handled and is comprised three steps, at first is to remove half frequency and frequency multiplication interference, and be that the fundamental frequency sequence is carried out smooth operation then, be to carry out stylization to handle at last.
Step B-2-b extracts and reads parameters of operating part again.Fundamental frequency sequence to handling through step B-2-a is carried out high-pass filtering, utilizes gradient method to extract maximum value and the minimal value part that wherein curved transition is violent automatically.Calculate the quantity of acute variation part, as step number characteristic.Average duration and the variance of calculating the acute variation part change, as average snap time and snap time variance characteristic.Step number, average snap time, snap time variance are read again 3 of parts and tieed up prosodic features as coming from.
Step B-2-c extracts the benchmark fundamental frequency.For the fundamental frequency sequence of handling through step 3-2-a, remove the HFS that step B-2-b extracts, to form the fundamental frequency sequence of low pass.Find out the minimum point of this low pass fundamental frequency sequence, with this as benchmark fundamental frequency characteristic.
Step B-2-d extracts the phrase parameters of operating part.Fundamental frequency sequence after the step B2a processing is deducted the benchmark fundamental frequency that step B-2-c extracts, form the fundamental frequency sequence curve of reflection phrase parts.Calculate the quantity of acute variation part, as impulse number characteristic.Calculate the average impulse amplitude and the amplitude variance of acute variation part, as average impulse amplitude and impulse amplitude variance characteristic.Impulse number, average impulse amplitude, impulse amplitude variance 3 are tieed up prosodic features as what come from the phrase parts.
The 1 dimension benchmark fundamental frequency characteristic that step B-2-e, 3 dimension prosodic features, the step B-2-c that step B-2-b is extracted extract, the 3 dimension prosodic features that step B-2-d extracts splice, as the 7 dimension prosodic features that produce model based on the rhythm.
Step B-3 extracts rhythm effect characteristics.
To the prosodic features that influences knowledge source based on the rhythm, what it was paid close attention to is the correlation degree of language learner for English and self mother tongue, the i.e. pure degree of Oral English Practice.In general, the pure degree of English is grasped people preferably, in spoken language, can occur usually rising and falling and in picturesque disorder rhythm variation.The variation characteristic of fundamental frequency and duration characteristic, especially fundamental frequency and duration plays key effect in the pure kilsyth basalt of English is existing.PVI (Pairwise Variability Index) operator has been obtained remarkable performance in the work of distinguishing the different language kind; The present invention expands to different segment levels with it in rhythm test; PVI is calculated at continuous consonant, vowel and three segment levels of syllable respectively, to obtain the prosodic features that influences knowledge source based on the rhythm.
Step B-3-a extracts the fundamental frequency PVI characteristic of consonant section, first segment and syllable section respectively, and computing method are following:
In the following formula, the continuous speech section is divided into p
kAnd p
K+1Two parts, and represent the fundamental frequency mean value of k and k+1 voice segments respectively, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Step B-3-b extracts the duration PVI characteristic of consonant section, first segment and syllable section respectively, and computing method are following:
In the following formula, the continuous speech section is divided into d
kAnd d
K+1Two parts, and represent the duration value of k and k+1 voice segments respectively, m represents the number of continuous speech section.The voice segments here can be consonant section, first segment or syllable section.
Step B-3-c; The consonant segment base that step B-3-a is extracted is PVI characteristic, vowel segment base PVI characteristic, syllable segment base PVI characteristic frequently frequently frequently; And step B-3-b the consonant section duration PVI characteristic, first segment duration PVI characteristic, the syllable section duration PVI characteristic that extract, merge the back and 6 tie up prosodic features as what influence knowledge source based on the rhythm.
Step B4, prosodic features merges.The 6 dimension prosodic features that 7 dimension prosodic features that 12 dimension prosodic features, the step B-2 that step B-1 is extracted extract and step B-3 extract are merged into 25 final dimension prosodic features.
Step C ', training mark match device.
For the mapping of characteristic to mark, the present invention will utilize the development set data to train to obtain the match device.The development set data have the rhythm rating fraction that the expert marks out; With the input of each Partial Feature of each speech samples in the development set data as the match device; Artificial mark mark is as the output of match device; Obtain match device parameter through match device training algorithm, to accomplish the training process of mark match device.
With reference to accompanying drawing 3, the concrete steps of training rhythm mark match device are:
Step C '-1 collects speech data as development set, and speech samples is carried out the mark of artificial rhythm mark.
Step C '-2 selects suitable rhythm mark match device, and the present invention does not limit particular type, can be a kind of in the common sorter model, for example mixed Gauss model (GMM), SVM (SVM), multilayer perceptron network (MLP) etc.
Step C '-3 extracts the prosodic features of each speech samples through step B, and as the input parameter of mark match device.With the artificial rhythm mark of each speech samples, as the output result of rhythm mark match device.
Step C '-4 on the basis of step C '-3, utilizes corresponding model training algorithm, and training rhythm mark match device finally obtains the model parameter of rhythm mark match device.
Step C with English spoken most many knowledge sources characteristic parameters input rhythm mark match devices to be evaluated, obtains rhythm horizontal checkout mark, with this as objective examination to the rhythm level of this tested speech sample.
Need to prove that the step of above rhythm test is suitable for sentence and paragraph level, concrete levels of testing is decided according to actual conditions.
Above-described specific embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely specific embodiment of the present invention; Be not limited to the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.