CN106531189A

CN106531189A - Intelligent spoken language evaluation method

Info

Publication number: CN106531189A
Application number: CN201611181451.5A
Authority: CN
Inventors: 潘奕君
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-03-22

Abstract

The present invention provides an intelligent spoken language evaluation method. A computer-based recording device acquires the oral speech data information of a user and the phonetic features of the user are extracted out of the oral speech data information of the user. The phonetic features of the user are compared and aligned with standard phonetic features, wherein vowels and consonants in the phonetic features of the user are respectively and correspondingly compared with vowels and consonants in the standard phonetic features to form the contrast data information. The contrast data information is scored and then both the contrast data information and the scored result are stored in a database. Therefore, the user can find out inaccurately pronounced words in the spoken language through comparing the spoken language with the standard spoken language. As a result, the language learning convenience of learners is improved, and the foreign language learning efficiency is improved. The learning interest of the user is improved.

Description

A kind of intelligence spoken language assessment method

Technical field

The present invention relates to language communication field, more particularly to a kind of intelligence spoken language assessment method.

Background technology

With the development of global economic integration, English increasingly shows its important work as international language With.Increasingly frequently, increasing people needs to learn a foreign language, so for the activity such as commercial exchange, cultural exchanges, transnational tourist The raising of oral communicative competence has become the active demand of foreign language learning.How the results of learning of foreign language are improved, preferably Meet the demand of user's Foreign Language Learning, have become current problem demanding prompt solution.

The content of the invention

In order to overcome above-mentioned deficiency of the prior art, it is an object of the present invention to provide a kind of intelligence spoken language test and appraisal side Method, method include：

S1：The spoken voice data message of user is obtained using the sound pick-up outfit of computer, is extracted in user voice data User vocal feature；

S2：User vocal feature is alignd with received pronunciation feature, and by the vowel in user vocal feature, consonant The vowel with received pronunciation feature is corresponded to respectively, and consonant is contrasted, and is contrasted data message；

S3：Correction data information is scored；

S4：Correction data information and appraisal result are stored into database.

Preferably, also include before step S1：Setting standard reads aloud text, and the received pronunciation that acquisition standard reads aloud text is special Levy；

Received pronunciation feature is temporally segmented, is divided into n sections, with 20ms as a time slice；

Each time period received pronunciation feature is divided into into static nature and behavioral characteristics；

The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation is decomposited special The spectrum energy distribution of the vowel section levied and the spectrum energy distribution of consonant section；

The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are set；

By the vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, the storage of consonant section MFCC characteristic vectors Into database.

Preferably, step S1 also includes：

User voice data is temporally segmented by S11, is divided into n sections, with 20ms as a time slice, to each time Section user voice data adds rectangular window, or Hamming window to process and obtain being segmented voice signal X_n, n is segments；

S12 is to being segmented voice signal X_nShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be turned in short-term Turn to frequency-region signal Y_n, and pass through Q_n=│ Y_n│²Calculate its short-time energy and compose Q_n；

Short-time energy is composed Q by the way of first in first out by S13_nBandpass filter is moved to from vector space S to be filtered Ripple；As in each frequency band, acting in human ear for component is superposition, therefore the energy in each filter band is entered Row is superimposed, and at this moment k-th filter output power composes x'(k)；

The output of each wave filter is taken the logarithm by S14, obtains the log power spectrum of frequency band；And carry out anti-discrete cosine Conversion, obtains M MFCC coefficient, and general M takes 13～15；MFCC coefficients are：

S15 using the user speech MFCC features of each time period for obtaining as static nature, then by the static nature Single order and second differnce are done, corresponding behavioral characteristics are obtained.

Preferably, step S1 also includes：

Obtain the spectrum energy (f of each voice segments frequency range_k), upper frequency limit value k in the voice segments₁, lower limit k₂, obtain the spectrum energy ratio PN in voice segments_n；

Preferably, step S1 also includes：

If spectrum energy (f in voice segments_k) >=first threshold, spectrum energy ratio PN in the voice segments_n>=Second Threshold, Then judge that this voice segments is vowel section；First threshold 0.1-0.5, Second Threshold take 60%-85%；

On the basis of the spectrum energy with vowel section, the spectrum energy before spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes the spectrum energy for the consonant section before vowel, the 3rd threshold Value takes 100；

On the basis of the spectrum energy with vowel section, the spectrum energy after spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel；

If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy For the last frame of voice segments, then it is judged as nose tail consonant.

Preferably, step S2 also includes：

The vowel section MFCC characteristic vectors of user vocal feature in each time period, consonant section MFCC characteristic vectors are set；

Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error and Corresponding DTW distances；

Based on the align to path and corresponding DTW distances, by the vowel section MFCC of user vocal feature in same time period The vowel section MFCC characteristic vectors of characteristic vector and received pronunciation feature carry out speech comparison and by user in same time period The consonant section MFCC characteristic vectors of phonetic feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, obtain The pronunciation difference gone out between user vocal feature and received pronunciation feature.

Preferably, step S2 also includes：

The quasi- speech feature vector of vowel feast-brand mark for arranging each time period internal standard phonetic feature is P₁=[p₁(1),p₁ (2),…,p₁(R)], first-order difference vector is P_Δ1=[p_Δ1(1),p_Δ1(2),…,p_Δ1(R)] (mothers of the R for received pronunciation feature Syllable verbal audio length), P_Δ1(n)=| p₁(n)-p₁(n-1) |, n=1,2 ..., R, p₁(0)=0；

The quasi- speech feature vector of consonant feast-brand mark for arranging each time period internal standard phonetic feature is P '₁=[p '₁(1), p '₁ (2) ..., p '₁(R)], first-order difference vector is P '_Δ1=[p '_Δ1(1), p '_Δ1(2) ..., p '_Δ1(R)] (R is received pronunciation feature Voice length), P '_Δ1(n)=| p '₁(n)-p’₁(n-1) |, n=1,2 ..., R, p '₁(0)=0；

Preferably, step S2 also includes：

The vowel section characteristic vector for arranging user vocal feature in each time period is P₂=[p₂(1),p₂(2),…,p₂ (T)], its first-order difference vector is P_Δ2=[p_Δ2(1),p_Δ2(2),…,p_Δ2(T)] (T is the length of voice to be evaluated), P_Δ2(n) =| p₂(n)-p₂(n-1) |, n=1,2 ..., T, p₂(0)=0；

The consonant section characteristic vector for arranging user vocal feature in each time period is P '₂=[p '₂(1), p '₂(2) ..., p’₂(T)], its first-order difference vector is P '_Δ2=[p '_Δ2(1), p '_Δ2(2) ..., p '_Δ2(T)] (T is the length of voice to be evaluated Degree),

P’_Δ2(n)=| p '₂(n)-p’₂(n-1) |, n=1,2 ..., T, p '₂(0)=0；

Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error, The vowel section and consonant section carried out in each time period compares；

Comparison draws gap d of vowel section_p, and the gap Δ d of variable quantity_p, compare gap d for drawing consonant section '_p, with And the gap Δ d ' of variable quantity_p, the similarity of user vocal feature and received pronunciation feature is obtained, i.e.,：

d_p=| p₁(n)-p₂(m)|

d’_p=| p '₁(n)-p’₂(m)|

Δd_p=| Δ p₁(n)-Δp₂(m)|

Δd’_p=| Δ p '₁(n)-Δp’₂(m)|

Wherein, Δ p_i(n)=| p_i(n)-p_i(n-1)|

Δp’_i(n)=| p '_i(n)-p’_i(n-1)|。

Preferably, step S3 also includes：Scoring s be：

S=ω 1 (ω 11s11+ ω 12s12+ ...+ω 1js1j) 2 (ω 21s21+ ω 22s22+ ...+ω of+ω 2js2j)+……+ωn(ωn1sn1+ωn2sn2+……+ωnjsnj)

Wherein, ω 1, ω 2, ω n represent the weight of each voice segments respectively；

J represents the total quantity that vowel section in each voice segments adds consonant section；

ω 11, ω 12 ... ω 1j represent the weight of syllable in first voice segments respectively；

S11, s12 ...+s1j, represents each syllable in first voice segments；

ω 21, ω 22 ... ω 2j represent the weight of syllable in second voice segments respectively；

S21, s22 ...+s2j, represents each syllable in second voice segments；

ω n1, ω n2 ... ω nj represent the weight of syllable in n-th voice segments respectively；

Sn1, sn2 ...+snj, represents each syllable in n-th voice segments.

As can be seen from the above technical solutions, the present invention has advantages below：

Intelligence spoken language assessment method causes user to obtain same a piece of text with computer, carries out reading aloud contrast, uses Family can learn which word pronunciation the spoken language of oneself has inaccurate with the spoken language of standard, in addition it is also necessary to be changed in which word Enter and further learn.The convenience of study language is so brought to learner, the efficiency of foreign language learning is improved, increases user Learning interest.

Description of the drawings

Fig. 1 is the flow chart of intelligence spoken language assessment method.

Specific embodiment

To enable goal of the invention of the invention, feature, advantage more obvious and understandable, below will be with specific Embodiment and accompanying drawing, are clearly and completely described to the technical scheme of present invention protection, it is clear that enforcement disclosed below Example is only a part of embodiment of the invention, and not all embodiment.Based on the embodiment in this patent, the common skill in this area All other embodiment that art personnel are obtained under the premise of creative work is not made, belongs to the model of this patent protection Enclose.

The present invention provides a kind of intelligence spoken language assessment method, as shown in figure 1, this method reads aloud text using a standard, meter Calculation machine first obtains the content that the standard reads aloud text, and obtains the standard pronunciation that standard reads aloud text.Side involved in the present invention Method is to coordinate corresponding program to realize based on computer hardware.So user obtains same a piece of text with computer, carries out Read aloud contrast so that user can learn which word pronunciation the spoken language of oneself has inaccurate with the spoken language of standard, in addition it is also necessary to It is improved and further learns in which word.The convenience of study language is so brought to learner, foreign language is improved The efficiency of habit, increases user learning interest.

Method includes：

S3：Correction data information is scored；

S4：Correction data information and appraisal result are stored into database.

Also include before step S1：Setting standard reads aloud text, and acquisition standard reads aloud the received pronunciation feature of text；

Step S1 also includes：

In the present embodiment, step S1 also includes：

Step S1 also includes：

On the basis of the spectrum energy with vowel section, the spectrum energy before spectrum energy of the judgement with vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes the spectrum energy for the consonant before vowel, the 3rd threshold value Take 100；

Each voice segments of user are carried out into decomposition and draw vowel section, consonant section and voice segments last frame whether There is nose tail consonant, nose tail consonant is nasal sound.

The vowel section that standard reads aloud each voice segments in text is pre-set in computer, consonant section and in voice segments Last frame whether have nose tail consonant, nose tail consonant is nasal sound.The vowel section of each voice segments that user is read aloud, consonant The nose tail consonant of section and the last frame in voice segments, is compared with received pronunciation feature respectively.

In the present embodiment, step S2 also includes：

Step S2 also includes：

P’_Δ2(n)=| p '₂(n)-p’₂(n-1) |, n=1,2 ..., T, p '₂(0)=0；

d_p=| p₁(n)-p₂(m)|

d’_p=| p '₁(n)-p’₂(m)|

Δd_p=| Δ p₁(n)-Δp₂(m)|

Δd’_p=| Δ p '₁(n)-Δp’₂(m)|

Wherein, Δ p_i(n)=| p_i(n)-p_i(n-1)|

Δp’_i(n)=| p '_i(n)-p’_i(n-1)|。

Step S3 also includes：Scoring s be：

S11, s12 ...+s1j, represents each syllable in first voice segments；

In first voice segments if first syllable is consonant section syllable supplemented by s11, if first syllable is Then s11 is vowel section to vowel section；The syllable supplemented by s12 if first syllable is consonant section, if first syllable is vowel Then s12 is vowel section to section；Each voice segments is by that analogy.

S21, s22 ...+s2j, represents each syllable in second voice segments；

Sn1, sn2 ...+snj, represents each syllable in n-th voice segments.

Each weight parameter, is to draw via substantial amounts of experiment, it is also possible to distributed by the weight proportion of each voice segments Know.Can also be according to each voice segments for the importance of text sets.Can also be obtained based on after many experiments by research staff Go out optimum efficiency to be set.

Claims

1. it is a kind of intelligence spoken language assessment method, it is characterised in that method includes：

S1：The spoken voice data message of user is obtained using the sound pick-up outfit of computer, the use in user voice data is extracted Family phonetic feature；

S2：User vocal feature is alignd with received pronunciation feature, and by the vowel in user vocal feature, consonant is distinguished The vowel of correspondence and received pronunciation feature, consonant are contrasted, and are contrasted data message；

S3：Correction data information is scored；

S4：Correction data information and appraisal result are stored into database.

2. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes：

The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation feature is decomposited The spectrum energy distribution of vowel section and the spectrum energy distribution of consonant section；

The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are stored to number According in storehouse.

3. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes：

Step S1 also includes：

User voice data is temporally segmented by S11, is divided into n sections, with 20ms as a time slice, each time period is used Family speech data adds rectangular window, or Hamming window to process and obtain being segmented voice signal X_n, n is segments；

S12 is to being segmented voice signal X_nShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be converted in short-term Frequency-region signal Y_n, and pass through Q_n=│ Y_n│²Calculate its short-time energy and compose Q_n；

Short-time energy is composed Q by the way of first in first out by S13_nBandpass filter is moved to from vector space S to be filtered；By In each frequency band component act on human ear in be superposition, therefore the energy in each filter band is folded Plus, at this moment k-th filter output power composes x'(k)；

The output of each wave filter is taken the logarithm by S14, obtains the log power spectrum of frequency band；And carry out anti-discrete cosine change Change, obtain M MFCC coefficient, general M takes 13～15；MFCC coefficients are：

C_{n} = Σ_{k = 1}^{M} \log x (k) \cos ((2 k + 1) \frac{π}{M});

The user speech MFCC features of each time period for obtaining are done one as static nature, then by the static nature by S15 Rank and second differnce, obtain corresponding behavioral characteristics.

4. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes：

Step S1 also includes：

Obtain the spectrum energy (f of each voice segments frequency range_k), upper frequency limit value k in the voice segments₁, lower limit k₂, obtain Take the spectrum energy ratio PN in voice segments_n；

{PN}_{n} = \frac{Σ_{k_{1}}^{k_{2}} h (f_{k})}{\underset{k}{Σ} h (f_{k})} \times 100 % .

5. it is according to claim 4 intelligence spoken language assessment method, it is characterised in that method includes：

Step S1 also includes：

If spectrum energy (f in voice segments_k) >=first threshold, spectrum energy ratio PN in the voice segments_n>=Second Threshold, then sentence This voice segments of breaking are vowel section；First threshold 0.1-0.5, Second Threshold take 60%-85%；

On the basis of the spectrum energy with vowel section, the zero passage of the spectrum energy before the spectrum energy with vowel section is judged Whether rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes the spectrum energy for the consonant section before vowel, and the 3rd threshold value takes 100；

On the basis of the spectrum energy with vowel section, the zero passage of the spectrum energy after spectrum energy of the judgement with vowel section Whether rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel；

If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy is language The last frame of segment, then be judged as nose tail consonant.

6. it is according to claim 5 intelligence spoken language assessment method, it is characterised in that method includes：

Step S2 also includes：

Using DTW algorithms, obtain the minimum align to path of an error to obtain the minimum align to path of an error and correspondence DTW distances；

Based on the align to path and corresponding DTW distances, by the vowel section MFCC features of user vocal feature in same time period Vector and the vowel section MFCC characteristic vectors of received pronunciation feature carry out speech comparison and by user speech in same time period The consonant section MFCC characteristic vectors of feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, draw use Pronunciation difference between family phonetic feature and received pronunciation feature.

7. it is according to claim 5 intelligence spoken language assessment method, it is characterised in that method includes：

Step S2 also includes：

The quasi- speech feature vector of vowel feast-brand mark for arranging each time period internal standard phonetic feature is P₁=[p₁(1),p₁(2),…, p₁(R)], first-order difference vector is P_Δ1=[p_Δ1(1),p_Δ1(2),…,p_Δ1(R)] (female syllable verbal audios of the R for received pronunciation feature Length), P_Δ1(n)=| p₁(n)-p₁(n-1) |, n=1,2 ..., R, p₁(0)=0；

The quasi- speech feature vector of consonant feast-brand mark for arranging each time period internal standard phonetic feature is P '₁=[p '₁(1), p '₁ (2) ..., p '₁(R)], first-order difference vector is P '_Δ1=[p '_Δ1(1), p '_Δ1(2) ..., p '_Δ1(R)] (R is received pronunciation feature Voice length), P '_Δ1(n)=| p '₁(n)-p’₁(n-1₎|, n=1,2 ..., R, p '₁(0)=0.

8. it is according to claim 7 intelligence spoken language assessment method, it is characterised in that method includes：

Step S2 also includes：

The vowel section characteristic vector for arranging user vocal feature in each time period is^P ₂=[p₂(1),p₂(2),…,p₂(T)], its First-order difference vector is P_Δ2=[p_Δ2(1),p_Δ2(2),…,p_Δ2(T)] (T is the length of voice to be evaluated), P_Δ2(n)=| p₂ (n)-p₂(n-1) |, n=1,2 ..., T, p₂(0)=0；

The consonant section characteristic vector for arranging user vocal feature in each time period is P '₂=[p '₂(1), p '₂(2) ..., p '₂ (T)], its first-order difference vector is P '_Δ2=[p '_Δ2(1), p '_Δ2(2) ..., p '_Δ2(T)] (T is the length of voice to be evaluated),

P’_Δ2(n)=| p '₂(n)-p’₂(n-1) |, n=1,2 ..., T, p '₂(0)=0；

Using DTW algorithms, obtain the minimum align to path of an error to obtain the minimum align to path of an error, carry out Vowel section and consonant section in each time period compares；

Comparison draws gap d of vowel section_p, and the gap Δ d of variable quantity_p, compare gap d for drawing consonant section '_p, Yi Jibian The gap Δ d ' of change amount_p, the similarity of user vocal feature and received pronunciation feature is obtained, i.e.,：

d_p=| p₁(n)-p₂(m)|

d’_p=| p '₁(n)-p’₂(m)|

Δd_p=| Δ p₁(n)-Δp₂(m)

Δd’_p=| Δ p '₁(n)-Δp’₂(m)|

Wherein, Δ p_i(n)=| p_i(n)-p_i(n-1)|

Δp’_i(n)=| p '_i(n)-_pp’_i(n-1)|。

9. it is according to claim 1 intelligence spoken language assessment method, it is characterised in that method includes：

Step S3 also includes：Scoring s be：

S=ω 1 (ω 11s11+ ω 12s12+ ...+ω 1js1j)+ω 2 (ω 21s21+ ω 22s22+ ...+ω 2js2j) +……+ωn(ωn1sn1+ωn2sn2+……+ωnjsnj)

S11, s12 ...+s1j, represents each syllable in first voice segments；

S21, s22 ...+s2j, represents each syllable in second voice segments；

Sn1, sn2 ...+snj, represents each syllable in n-th voice segments.