CN106531189A - Intelligent spoken language evaluation method - Google Patents

Intelligent spoken language evaluation method Download PDF

Info

Publication number
CN106531189A
CN106531189A CN201611181451.5A CN201611181451A CN106531189A CN 106531189 A CN106531189 A CN 106531189A CN 201611181451 A CN201611181451 A CN 201611181451A CN 106531189 A CN106531189 A CN 106531189A
Authority
CN
China
Prior art keywords
feature
section
vowel
consonant
spectrum energy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611181451.5A
Other languages
Chinese (zh)
Inventor
潘奕君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201611181451.5A priority Critical patent/CN106531189A/en
Publication of CN106531189A publication Critical patent/CN106531189A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention provides an intelligent spoken language evaluation method. A computer-based recording device acquires the oral speech data information of a user and the phonetic features of the user are extracted out of the oral speech data information of the user. The phonetic features of the user are compared and aligned with standard phonetic features, wherein vowels and consonants in the phonetic features of the user are respectively and correspondingly compared with vowels and consonants in the standard phonetic features to form the contrast data information. The contrast data information is scored and then both the contrast data information and the scored result are stored in a database. Therefore, the user can find out inaccurately pronounced words in the spoken language through comparing the spoken language with the standard spoken language. As a result, the language learning convenience of learners is improved, and the foreign language learning efficiency is improved. The learning interest of the user is improved.

Description

A kind of intelligence spoken language assessment method
Technical field
The present invention relates to language communication field, more particularly to a kind of intelligence spoken language assessment method.
Background technology
With the development of global economic integration, English increasingly shows its important work as international language With.Increasingly frequently, increasing people needs to learn a foreign language, so for the activity such as commercial exchange, cultural exchanges, transnational tourist The raising of oral communicative competence has become the active demand of foreign language learning.How the results of learning of foreign language are improved, preferably Meet the demand of user's Foreign Language Learning, have become current problem demanding prompt solution.
The content of the invention
In order to overcome above-mentioned deficiency of the prior art, it is an object of the present invention to provide a kind of intelligence spoken language test and appraisal side Method, method include:
S1:The spoken voice data message of user is obtained using the sound pick-up outfit of computer, is extracted in user voice data User vocal feature;
S2:User vocal feature is alignd with received pronunciation feature, and by the vowel in user vocal feature, consonant The vowel with received pronunciation feature is corresponded to respectively, and consonant is contrasted, and is contrasted data message;
S3:Correction data information is scored;
S4:Correction data information and appraisal result are stored into database.
Preferably, also include before step S1:Setting standard reads aloud text, and the received pronunciation that acquisition standard reads aloud text is special Levy;
Received pronunciation feature is temporally segmented, is divided into n sections, with 20ms as a time slice;
Each time period received pronunciation feature is divided into into static nature and behavioral characteristics;
The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation is decomposited special The spectrum energy distribution of the vowel section levied and the spectrum energy distribution of consonant section;
The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are set;
By the vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, the storage of consonant section MFCC characteristic vectors Into database.
Preferably, step S1 also includes:
User voice data is temporally segmented by S11, is divided into n sections, with 20ms as a time slice, to each time Section user voice data adds rectangular window, or Hamming window to process and obtain being segmented voice signal Xn, n is segments;
S12 is to being segmented voice signal XnShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be turned in short-term Turn to frequency-region signal Yn, and pass through Qn=│ Yn2Calculate its short-time energy and compose Qn
Short-time energy is composed Q by the way of first in first out by S13nBandpass filter is moved to from vector space S to be filtered Ripple;As in each frequency band, acting in human ear for component is superposition, therefore the energy in each filter band is entered Row is superimposed, and at this moment k-th filter output power composes x'(k);
The output of each wave filter is taken the logarithm by S14, obtains the log power spectrum of frequency band;And carry out anti-discrete cosine Conversion, obtains M MFCC coefficient, and general M takes 13~15;MFCC coefficients are:
S15 using the user speech MFCC features of each time period for obtaining as static nature, then by the static nature Single order and second differnce are done, corresponding behavioral characteristics are obtained.
Preferably, step S1 also includes:
Obtain the spectrum energy (f of each voice segments frequency rangek), upper frequency limit value k in the voice segments1, lower limit k2, obtain the spectrum energy ratio PN in voice segmentsn
Preferably, step S1 also includes:
If spectrum energy (f in voice segmentsk) >=first threshold, spectrum energy ratio PN in the voice segmentsn>=Second Threshold, Then judge that this voice segments is vowel section;First threshold 0.1-0.5, Second Threshold take 60%-85%;
On the basis of the spectrum energy with vowel section, the spectrum energy before spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes the spectrum energy for the consonant section before vowel, the 3rd threshold Value takes 100;
On the basis of the spectrum energy with vowel section, the spectrum energy after spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel;
If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy For the last frame of voice segments, then it is judged as nose tail consonant.
Preferably, step S2 also includes:
The vowel section MFCC characteristic vectors of user vocal feature in each time period, consonant section MFCC characteristic vectors are set;
Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error and Corresponding DTW distances;
Based on the align to path and corresponding DTW distances, by the vowel section MFCC of user vocal feature in same time period The vowel section MFCC characteristic vectors of characteristic vector and received pronunciation feature carry out speech comparison and by user in same time period The consonant section MFCC characteristic vectors of phonetic feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, obtain The pronunciation difference gone out between user vocal feature and received pronunciation feature.
Preferably, step S2 also includes:
The quasi- speech feature vector of vowel feast-brand mark for arranging each time period internal standard phonetic feature is P1=[p1(1),p1 (2),…,p1(R)], first-order difference vector is PΔ1=[pΔ1(1),pΔ1(2),…,pΔ1(R)] (mothers of the R for received pronunciation feature Syllable verbal audio length), PΔ1(n)=| p1(n)-p1(n-1) |, n=1,2 ..., R, p1(0)=0;
The quasi- speech feature vector of consonant feast-brand mark for arranging each time period internal standard phonetic feature is P '1=[p '1(1), p '1 (2) ..., p '1(R)], first-order difference vector is P 'Δ1=[p 'Δ1(1), p 'Δ1(2) ..., p 'Δ1(R)] (R is received pronunciation feature Voice length), P 'Δ1(n)=| p '1(n)-p’1(n-1) |, n=1,2 ..., R, p '1(0)=0;
Preferably, step S2 also includes:
The vowel section characteristic vector for arranging user vocal feature in each time period is P2=[p2(1),p2(2),…,p2 (T)], its first-order difference vector is PΔ2=[pΔ2(1),pΔ2(2),…,pΔ2(T)] (T is the length of voice to be evaluated), PΔ2(n) =| p2(n)-p2(n-1) |, n=1,2 ..., T, p2(0)=0;
The consonant section characteristic vector for arranging user vocal feature in each time period is P '2=[p '2(1), p '2(2) ..., p’2(T)], its first-order difference vector is P 'Δ2=[p 'Δ2(1), p 'Δ2(2) ..., p 'Δ2(T)] (T is the length of voice to be evaluated Degree),
P’Δ2(n)=| p '2(n)-p’2(n-1) |, n=1,2 ..., T, p '2(0)=0;
Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error, The vowel section and consonant section carried out in each time period compares;
Comparison draws gap d of vowel sectionp, and the gap Δ d of variable quantityp, compare gap d for drawing consonant section 'p, with And the gap Δ d ' of variable quantityp, the similarity of user vocal feature and received pronunciation feature is obtained, i.e.,:
dp=| p1(n)-p2(m)|
d’p=| p '1(n)-p’2(m)|
Δdp=| Δ p1(n)-Δp2(m)|
Δd’p=| Δ p '1(n)-Δp’2(m)|
Wherein, Δ pi(n)=| pi(n)-pi(n-1)|
Δp’i(n)=| p 'i(n)-p’i(n-1)|。
Preferably, step S3 also includes:Scoring s be:
S=ω 1 (ω 11s11+ ω 12s12+ ...+ω 1js1j) 2 (ω 21s21+ ω 22s22+ ...+ω of+ω 2js2j)+……+ωn(ωn1sn1+ωn2sn2+……+ωnjsnj)
Wherein, ω 1, ω 2, ω n represent the weight of each voice segments respectively;
J represents the total quantity that vowel section in each voice segments adds consonant section;
ω 11, ω 12 ... ω 1j represent the weight of syllable in first voice segments respectively;
S11, s12 ...+s1j, represents each syllable in first voice segments;
ω 21, ω 22 ... ω 2j represent the weight of syllable in second voice segments respectively;
S21, s22 ...+s2j, represents each syllable in second voice segments;
ω n1, ω n2 ... ω nj represent the weight of syllable in n-th voice segments respectively;
Sn1, sn2 ...+snj, represents each syllable in n-th voice segments.
As can be seen from the above technical solutions, the present invention has advantages below:
Intelligence spoken language assessment method causes user to obtain same a piece of text with computer, carries out reading aloud contrast, uses Family can learn which word pronunciation the spoken language of oneself has inaccurate with the spoken language of standard, in addition it is also necessary to be changed in which word Enter and further learn.The convenience of study language is so brought to learner, the efficiency of foreign language learning is improved, increases user Learning interest.
Description of the drawings
Fig. 1 is the flow chart of intelligence spoken language assessment method.
Specific embodiment
To enable goal of the invention of the invention, feature, advantage more obvious and understandable, below will be with specific Embodiment and accompanying drawing, are clearly and completely described to the technical scheme of present invention protection, it is clear that enforcement disclosed below Example is only a part of embodiment of the invention, and not all embodiment.Based on the embodiment in this patent, the common skill in this area All other embodiment that art personnel are obtained under the premise of creative work is not made, belongs to the model of this patent protection Enclose.
The present invention provides a kind of intelligence spoken language assessment method, as shown in figure 1, this method reads aloud text using a standard, meter Calculation machine first obtains the content that the standard reads aloud text, and obtains the standard pronunciation that standard reads aloud text.Side involved in the present invention Method is to coordinate corresponding program to realize based on computer hardware.So user obtains same a piece of text with computer, carries out Read aloud contrast so that user can learn which word pronunciation the spoken language of oneself has inaccurate with the spoken language of standard, in addition it is also necessary to It is improved and further learns in which word.The convenience of study language is so brought to learner, foreign language is improved The efficiency of habit, increases user learning interest.
Method includes:
S1:The spoken voice data message of user is obtained using the sound pick-up outfit of computer, is extracted in user voice data User vocal feature;
S2:User vocal feature is alignd with received pronunciation feature, and by the vowel in user vocal feature, consonant The vowel with received pronunciation feature is corresponded to respectively, and consonant is contrasted, and is contrasted data message;
S3:Correction data information is scored;
S4:Correction data information and appraisal result are stored into database.
Also include before step S1:Setting standard reads aloud text, and acquisition standard reads aloud the received pronunciation feature of text;
Received pronunciation feature is temporally segmented, is divided into n sections, with 20ms as a time slice;
Each time period received pronunciation feature is divided into into static nature and behavioral characteristics;
The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation is decomposited special The spectrum energy distribution of the vowel section levied and the spectrum energy distribution of consonant section;
The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are set;
By the vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, the storage of consonant section MFCC characteristic vectors Into database.
Step S1 also includes:
User voice data is temporally segmented by S11, is divided into n sections, with 20ms as a time slice, to each time Section user voice data adds rectangular window, or Hamming window to process and obtain being segmented voice signal Xn, n is segments;
S12 is to being segmented voice signal XnShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be turned in short-term Turn to frequency-region signal Yn, and pass through Qn=│ Yn2Calculate its short-time energy and compose Qn
Short-time energy is composed Q by the way of first in first out by S13nBandpass filter is moved to from vector space S to be filtered Ripple;As in each frequency band, acting in human ear for component is superposition, therefore the energy in each filter band is entered Row is superimposed, and at this moment k-th filter output power composes x'(k);
The output of each wave filter is taken the logarithm by S14, obtains the log power spectrum of frequency band;And carry out anti-discrete cosine Conversion, obtains M MFCC coefficient, and general M takes 13~15;MFCC coefficients are:
S15 using the user speech MFCC features of each time period for obtaining as static nature, then by the static nature Single order and second differnce are done, corresponding behavioral characteristics are obtained.
In the present embodiment, step S1 also includes:
Obtain the spectrum energy (f of each voice segments frequency rangek), upper frequency limit value k in the voice segments1, lower limit k2, obtain the spectrum energy ratio PN in voice segmentsn
Step S1 also includes:
If spectrum energy (f in voice segmentsk) >=first threshold, spectrum energy ratio PN in the voice segmentsn>=Second Threshold, Then judge that this voice segments is vowel section;First threshold 0.1-0.5, Second Threshold take 60%-85%;
On the basis of the spectrum energy with vowel section, the spectrum energy before spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes the spectrum energy for the consonant before vowel, the 3rd threshold value Take 100;
On the basis of the spectrum energy with vowel section, the spectrum energy after spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel;
If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy For the last frame of voice segments, then it is judged as nose tail consonant.
Each voice segments of user are carried out into decomposition and draw vowel section, consonant section and voice segments last frame whether There is nose tail consonant, nose tail consonant is nasal sound.
The vowel section that standard reads aloud each voice segments in text is pre-set in computer, consonant section and in voice segments Last frame whether have nose tail consonant, nose tail consonant is nasal sound.The vowel section of each voice segments that user is read aloud, consonant The nose tail consonant of section and the last frame in voice segments, is compared with received pronunciation feature respectively.
In the present embodiment, step S2 also includes:
The vowel section MFCC characteristic vectors of user vocal feature in each time period, consonant section MFCC characteristic vectors are set;
Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error and Corresponding DTW distances;
Based on the align to path and corresponding DTW distances, by the vowel section MFCC of user vocal feature in same time period The vowel section MFCC characteristic vectors of characteristic vector and received pronunciation feature carry out speech comparison and by user in same time period The consonant section MFCC characteristic vectors of phonetic feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, obtain The pronunciation difference gone out between user vocal feature and received pronunciation feature.
In the present embodiment, step S2 also includes:
The quasi- speech feature vector of vowel feast-brand mark for arranging each time period internal standard phonetic feature is P1=[p1(1),p1 (2),…,p1(R)], first-order difference vector is PΔ1=[pΔ1(1),pΔ1(2),…,pΔ1(R)] (mothers of the R for received pronunciation feature Syllable verbal audio length), PΔ1(n)=| p1(n)-p1(n-1) |, n=1,2 ..., R, p1(0)=0;
The quasi- speech feature vector of consonant feast-brand mark for arranging each time period internal standard phonetic feature is P '1=[p '1(1), p '1 (2) ..., p '1(R)], first-order difference vector is P 'Δ1=[p 'Δ1(1), p 'Δ1(2) ..., p 'Δ1(R)] (R is received pronunciation feature Voice length), P 'Δ1(n)=| p '1(n)-p’1(n-1) |, n=1,2 ..., R, p '1(0)=0;
Step S2 also includes:
The vowel section characteristic vector for arranging user vocal feature in each time period is P2=[p2(1),p2(2),…,p2 (T)], its first-order difference vector is PΔ2=[pΔ2(1),pΔ2(2),…,pΔ2(T)] (T is the length of voice to be evaluated), PΔ2(n) =| p2(n)-p2(n-1) |, n=1,2 ..., T, p2(0)=0;
The consonant section characteristic vector for arranging user vocal feature in each time period is P '2=[p '2(1), p '2(2) ..., p’2(T)], its first-order difference vector is P 'Δ2=[p 'Δ2(1), p 'Δ2(2) ..., p 'Δ2(T)] (T is the length of voice to be evaluated Degree),
P’Δ2(n)=| p '2(n)-p’2(n-1) |, n=1,2 ..., T, p '2(0)=0;
Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error, The vowel section and consonant section carried out in each time period compares;
Comparison draws gap d of vowel sectionp, and the gap Δ d of variable quantityp, compare gap d for drawing consonant section 'p, with And the gap Δ d ' of variable quantityp, the similarity of user vocal feature and received pronunciation feature is obtained, i.e.,:
dp=| p1(n)-p2(m)|
d’p=| p '1(n)-p’2(m)|
Δdp=| Δ p1(n)-Δp2(m)|
Δd’p=| Δ p '1(n)-Δp’2(m)|
Wherein, Δ pi(n)=| pi(n)-pi(n-1)|
Δp’i(n)=| p 'i(n)-p’i(n-1)|。
Step S3 also includes:Scoring s be:
S=ω 1 (ω 11s11+ ω 12s12+ ...+ω 1js1j) 2 (ω 21s21+ ω 22s22+ ...+ω of+ω 2js2j)+……+ωn(ωn1sn1+ωn2sn2+……+ωnjsnj)
Wherein, ω 1, ω 2, ω n represent the weight of each voice segments respectively;
J represents the total quantity that vowel section in each voice segments adds consonant section;
ω 11, ω 12 ... ω 1j represent the weight of syllable in first voice segments respectively;
S11, s12 ...+s1j, represents each syllable in first voice segments;
In first voice segments if first syllable is consonant section syllable supplemented by s11, if first syllable is Then s11 is vowel section to vowel section;The syllable supplemented by s12 if first syllable is consonant section, if first syllable is vowel Then s12 is vowel section to section;Each voice segments is by that analogy.
ω 21, ω 22 ... ω 2j represent the weight of syllable in second voice segments respectively;
S21, s22 ...+s2j, represents each syllable in second voice segments;
ω n1, ω n2 ... ω nj represent the weight of syllable in n-th voice segments respectively;
Sn1, sn2 ...+snj, represents each syllable in n-th voice segments.
Each weight parameter, is to draw via substantial amounts of experiment, it is also possible to distributed by the weight proportion of each voice segments Know.Can also be according to each voice segments for the importance of text sets.Can also be obtained based on after many experiments by research staff Go out optimum efficiency to be set.

Claims (9)

1. it is a kind of intelligence spoken language assessment method, it is characterised in that method includes:
S1:The spoken voice data message of user is obtained using the sound pick-up outfit of computer, the use in user voice data is extracted Family phonetic feature;
S2:User vocal feature is alignd with received pronunciation feature, and by the vowel in user vocal feature, consonant is distinguished The vowel of correspondence and received pronunciation feature, consonant are contrasted, and are contrasted data message;
S3:Correction data information is scored;
S4:Correction data information and appraisal result are stored into database.
2. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes:
Also include before step S1:Setting standard reads aloud text, and acquisition standard reads aloud the received pronunciation feature of text;
Received pronunciation feature is temporally segmented, is divided into n sections, with 20ms as a time slice;
Each time period received pronunciation feature is divided into into static nature and behavioral characteristics;
The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation feature is decomposited The spectrum energy distribution of vowel section and the spectrum energy distribution of consonant section;
The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are set;
The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are stored to number According in storehouse.
3. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes:
Step S1 also includes:
User voice data is temporally segmented by S11, is divided into n sections, with 20ms as a time slice, each time period is used Family speech data adds rectangular window, or Hamming window to process and obtain being segmented voice signal Xn, n is segments;
S12 is to being segmented voice signal XnShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be converted in short-term Frequency-region signal Yn, and pass through Qn=│ Yn2Calculate its short-time energy and compose Qn
Short-time energy is composed Q by the way of first in first out by S13nBandpass filter is moved to from vector space S to be filtered;By In each frequency band component act on human ear in be superposition, therefore the energy in each filter band is folded Plus, at this moment k-th filter output power composes x'(k);
The output of each wave filter is taken the logarithm by S14, obtains the log power spectrum of frequency band;And carry out anti-discrete cosine change Change, obtain M MFCC coefficient, general M takes 13~15;MFCC coefficients are:
C n = Σ k = 1 M log x ( k ) cos ( ( 2 k + 1 ) π M ) ;
The user speech MFCC features of each time period for obtaining are done one as static nature, then by the static nature by S15 Rank and second differnce, obtain corresponding behavioral characteristics.
4. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes:
Step S1 also includes:
Obtain the spectrum energy (f of each voice segments frequency rangek), upper frequency limit value k in the voice segments1, lower limit k2, obtain Take the spectrum energy ratio PN in voice segmentsn
PN n = Σ k 1 k 2 h ( f k ) Σ k h ( f k ) × 100 % .
5. it is according to claim 4 intelligence spoken language assessment method, it is characterised in that method includes:
Step S1 also includes:
If spectrum energy (f in voice segmentsk) >=first threshold, spectrum energy ratio PN in the voice segmentsn>=Second Threshold, then sentence This voice segments of breaking are vowel section;First threshold 0.1-0.5, Second Threshold take 60%-85%;
On the basis of the spectrum energy with vowel section, the zero passage of the spectrum energy before the spectrum energy with vowel section is judged Whether rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes the spectrum energy for the consonant section before vowel, and the 3rd threshold value takes 100;
On the basis of the spectrum energy with vowel section, the zero passage of the spectrum energy after spectrum energy of the judgement with vowel section Whether rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel;
If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy is language The last frame of segment, then be judged as nose tail consonant.
6. it is according to claim 5 intelligence spoken language assessment method, it is characterised in that method includes:
Step S2 also includes:
The vowel section MFCC characteristic vectors of user vocal feature in each time period, consonant section MFCC characteristic vectors are set;
Using DTW algorithms, obtain the minimum align to path of an error to obtain the minimum align to path of an error and correspondence DTW distances;
Based on the align to path and corresponding DTW distances, by the vowel section MFCC features of user vocal feature in same time period Vector and the vowel section MFCC characteristic vectors of received pronunciation feature carry out speech comparison and by user speech in same time period The consonant section MFCC characteristic vectors of feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, draw use Pronunciation difference between family phonetic feature and received pronunciation feature.
7. it is according to claim 5 intelligence spoken language assessment method, it is characterised in that method includes:
Step S2 also includes:
The quasi- speech feature vector of vowel feast-brand mark for arranging each time period internal standard phonetic feature is P1=[p1(1),p1(2),…, p1(R)], first-order difference vector is PΔ1=[pΔ1(1),pΔ1(2),…,pΔ1(R)] (female syllable verbal audios of the R for received pronunciation feature Length), PΔ1(n)=| p1(n)-p1(n-1) |, n=1,2 ..., R, p1(0)=0;
The quasi- speech feature vector of consonant feast-brand mark for arranging each time period internal standard phonetic feature is P '1=[p '1(1), p '1 (2) ..., p '1(R)], first-order difference vector is P 'Δ1=[p 'Δ1(1), p 'Δ1(2) ..., p 'Δ1(R)] (R is received pronunciation feature Voice length), P 'Δ1(n)=| p '1(n)-p’1(n-1)|, n=1,2 ..., R, p '1(0)=0.
8. it is according to claim 7 intelligence spoken language assessment method, it is characterised in that method includes:
Step S2 also includes:
The vowel section characteristic vector for arranging user vocal feature in each time period isP 2=[p2(1),p2(2),…,p2(T)], its First-order difference vector is PΔ2=[pΔ2(1),pΔ2(2),…,pΔ2(T)] (T is the length of voice to be evaluated), PΔ2(n)=| p2 (n)-p2(n-1) |, n=1,2 ..., T, p2(0)=0;
The consonant section characteristic vector for arranging user vocal feature in each time period is P '2=[p '2(1), p '2(2) ..., p '2 (T)], its first-order difference vector is P 'Δ2=[p 'Δ2(1), p 'Δ2(2) ..., p 'Δ2(T)] (T is the length of voice to be evaluated),
P’Δ2(n)=| p '2(n)-p’2(n-1) |, n=1,2 ..., T, p '2(0)=0;
Using DTW algorithms, obtain the minimum align to path of an error to obtain the minimum align to path of an error, carry out Vowel section and consonant section in each time period compares;
Comparison draws gap d of vowel sectionp, and the gap Δ d of variable quantityp, compare gap d for drawing consonant section 'p, Yi Jibian The gap Δ d ' of change amountp, the similarity of user vocal feature and received pronunciation feature is obtained, i.e.,:
dp=| p1(n)-p2(m)|
d’p=| p '1(n)-p’2(m)|
Δdp=| Δ p1(n)-Δp2(m)
Δd’p=| Δ p '1(n)-Δp’2(m)|
Wherein, Δ pi(n)=| pi(n)-pi(n-1)|
Δp’i(n)=| p 'i(n)-pp’i(n-1)|。
9. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes:
Step S3 also includes:Scoring s be:
S=ω 1 (ω 11s11+ ω 12s12+ ...+ω 1js1j)+ω 2 (ω 21s21+ ω 22s22+ ...+ω 2js2j) +……+ωn(ωn1sn1+ωn2sn2+……+ωnjsnj)
Wherein, ω 1, ω 2, ω n represent the weight of each voice segments respectively;
J represents the total quantity that vowel section in each voice segments adds consonant section;
ω 11, ω 12 ... ω 1j represent the weight of syllable in first voice segments respectively;
S11, s12 ...+s1j, represents each syllable in first voice segments;
ω 21, ω 22 ... ω 2j represent the weight of syllable in second voice segments respectively;
S21, s22 ...+s2j, represents each syllable in second voice segments;
ω n1, ω n2 ... ω nj represent the weight of syllable in n-th voice segments respectively;
Sn1, sn2 ...+snj, represents each syllable in n-th voice segments.
CN201611181451.5A 2016-12-20 2016-12-20 Intelligent spoken language evaluation method Pending CN106531189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611181451.5A CN106531189A (en) 2016-12-20 2016-12-20 Intelligent spoken language evaluation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611181451.5A CN106531189A (en) 2016-12-20 2016-12-20 Intelligent spoken language evaluation method

Publications (1)

Publication Number Publication Date
CN106531189A true CN106531189A (en) 2017-03-22

Family

ID=58340401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611181451.5A Pending CN106531189A (en) 2016-12-20 2016-12-20 Intelligent spoken language evaluation method

Country Status (1)

Country Link
CN (1) CN106531189A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767862A (en) * 2017-11-06 2018-03-06 深圳市领芯者科技有限公司 Voice data processing method, system and storage medium
CN108470476A (en) * 2018-05-15 2018-08-31 黄淮学院 A kind of pronunciation of English matching correcting system
CN109300474A (en) * 2018-09-14 2019-02-01 北京网众共创科技有限公司 A kind of audio signal processing method and device
CN109300484A (en) * 2018-09-13 2019-02-01 广州酷狗计算机科技有限公司 Audio alignment schemes, device, computer equipment and readable storage medium storing program for executing
CN109727608A (en) * 2017-10-25 2019-05-07 香港中文大学深圳研究院 A kind of ill voice appraisal procedure based on Chinese speech
CN110825244A (en) * 2019-11-06 2020-02-21 王一峰 Modern Shanghai input method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040073291A (en) * 2004-01-08 2004-08-19 정보통신연구진흥원 appraisal system of foreign language pronunciation and method thereof
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN101996635A (en) * 2010-08-30 2011-03-30 清华大学 English pronunciation quality evaluation method based on accent highlight degree

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040073291A (en) * 2004-01-08 2004-08-19 정보통신연구진흥원 appraisal system of foreign language pronunciation and method thereof
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN101996635A (en) * 2010-08-30 2011-03-30 清华大学 English pronunciation quality evaluation method based on accent highlight degree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周晓兰: "计算机辅助普通话水平测试中的语音", 《农村经济与科技》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727608A (en) * 2017-10-25 2019-05-07 香港中文大学深圳研究院 A kind of ill voice appraisal procedure based on Chinese speech
CN107767862A (en) * 2017-11-06 2018-03-06 深圳市领芯者科技有限公司 Voice data processing method, system and storage medium
CN108470476A (en) * 2018-05-15 2018-08-31 黄淮学院 A kind of pronunciation of English matching correcting system
CN108470476B (en) * 2018-05-15 2020-06-30 黄淮学院 English pronunciation matching correction system
CN109300484A (en) * 2018-09-13 2019-02-01 广州酷狗计算机科技有限公司 Audio alignment schemes, device, computer equipment and readable storage medium storing program for executing
CN109300484B (en) * 2018-09-13 2021-07-02 广州酷狗计算机科技有限公司 Audio alignment method and device, computer equipment and readable storage medium
CN109300474A (en) * 2018-09-14 2019-02-01 北京网众共创科技有限公司 A kind of audio signal processing method and device
CN109300474B (en) * 2018-09-14 2022-04-26 北京网众共创科技有限公司 Voice signal processing method and device
CN110825244A (en) * 2019-11-06 2020-02-21 王一峰 Modern Shanghai input method

Similar Documents

Publication Publication Date Title
CN106531189A (en) Intelligent spoken language evaluation method
CN103065626B (en) Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN106782609A (en) A kind of spoken comparison method
CN110457432A (en) Interview methods of marking, device, equipment and storage medium
CN103617799A (en) Method for detecting English statement pronunciation quality suitable for mobile device
US11282511B2 (en) System and method for automatic speech analysis
CN106847260A (en) A kind of Oral English Practice automatic scoring method of feature based fusion
CN103366735B (en) The mapping method of speech data and device
CN102723077B (en) Method and device for voice synthesis for Chinese teaching
CN107886968A (en) Speech evaluating method and system
Hirson et al. Glottal fry and voice disguise: a case study in forensic phonetics
Sabu et al. Automatic Assessment of Children's L2 Reading for Accuracy and Fluency.
CN112767961B (en) Accent correction method based on cloud computing
CN111210845B (en) Pathological voice detection device based on improved autocorrelation characteristics
Dumpala et al. Analysis of the Effect of Speech-Laugh on Speaker Recognition System.
Luo et al. Investigation of the effects of automatic scoring technology on human raters' performances in L2 speech proficiency assessment
Li et al. English sentence pronunciation evaluation using rhythm and intonation
Liu Application of speech recognition technology in pronunciation correction of college oral English teaching
CN103021226B (en) Voice evaluating method and device based on pronunciation rhythms
Duan et al. An English pronunciation and intonation evaluation method based on the DTW algorithm
Jambi et al. An Empirical Performance Analysis of the Speak Correct Computerized Interface
Yu Evaluation of English Pronunciation Quality Based on Decision Tree Algorithm
Pakhomov et al. Forced-alignment and edit-distance scoring for vocabulary tutoring applications
CN101546553A (en) Objective examination method of flat-tongue sound and cacuminal in standard Chinese
Bolanos et al. Automatic assessment of oral reading fluency for Spanish speaking ELs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170322