CN106782609A

CN106782609A - A kind of spoken comparison method

Info

Publication number: CN106782609A
Application number: CN201710003810.6A
Authority: CN
Inventors: 杨白宇
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-12-20
Filing date: 2017-01-03
Publication date: 2017-05-31

Abstract

The present invention provides a kind of spoken comparison method, sets received text, obtains the received pronunciation feature of received text, and received pronunciation feature is stored into database；Received text is read aloud by user, user voice data is obtained, the user vocal feature in user voice data is extracted；User vocal feature is alignd with received pronunciation feature, and user vocal feature is contrasted with received pronunciation feature；User vocal feature and comparing result are stored into database.Allow users to learn which word pronunciation the spoken language of oneself has inaccurate with the spoken language of standard.The convenience of study language is so brought to learner, the efficiency of foreign language learning is improved, increases user learning interest.

Description

A kind of spoken comparison method

Technical field

The present invention relates to language communication field, more particularly to a kind of spoken comparison method.

Background technology

Voice is the acoustics performance of language, is the means of Human communication's information.Allow one to more efficiently produce, pass Defeated, storage and acquisition language message, promote the development of society.

As China's reform and opening-up and external cooperation are deepened constantly, the activity such as commercial exchange, cultural exchanges, transnational tourist Increasingly frequently, increasing people needs to learn a foreign language.The problem that learner's foreign language studying is present is true cacoepy, is learned Habit person cannot learn under one text which the pronunciation of itself has with the pronunciation of standard, is so brought to learner Larger puzzlement, and the efficiency of foreign language is acquired in influence.

The content of the invention

In order to overcome above-mentioned deficiency of the prior art, it is an object of the present invention to provide a kind of spoken comparison method, side Method includes：

S1：Received text is set, the received pronunciation feature of received text is obtained, received pronunciation feature is stored to database In；

S2：Received text is read aloud by user, user voice data is obtained, the user speech in user voice data is extracted Feature；

S3：User vocal feature is alignd with received pronunciation feature, and user vocal feature and received pronunciation is special Levy and contrasted；

S4：User vocal feature and comparing result are stored into database.

Preferably, step S2 also includes：

Temporally be segmented user voice data by S21, is divided into n sections, is a time slice with 20ms, to each time Section user voice data adds rectangular window, or Hamming window treatment to obtain segmentation voice signal X_n, n is segments；

S22 is to segmentation voice signal X_nShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be turned in short-term Turn to frequency-region signal Y_n, and by Q_n=│ Y_n│²Calculate its short-time energy spectrum Q_n；

Short-time energy is composed Q by S23 by the way of first in first out_nBandpass filter is moved to from vector space S to be filtered Ripple；Because acting in human ear for component is superposition in each frequency band, therefore the energy in each filter band is entered Row is superimposed, at this moment k-th filter output power spectrum x'(k)；

S24 takes the logarithm the output of each wave filter, obtains the log power spectrum of frequency band；And carry out anti-discrete cosine Conversion, obtains M MFCC coefficient, and general M takes 13~15；MFCC coefficients are：

The MFCC features that S25 will be obtained do single order and second differnce as static nature, then by the static nature, obtain Corresponding behavioral characteristics.

Preferably, step S2 also includes：

Obtain the spectrum energy (f of each voice segments frequency range_k), the upper frequency limit value k in the voice segments₁, lower limit k₂, obtain the spectrum energy ratio PN in voice segments_n；

Preferably, step S3 also includes：

If spectrum energy (f in voice segments_k) >=first threshold, spectrum energy ratio PN in the voice segments_n>=Second Threshold, Then judge that this voice segments is vowel section；First threshold 0.1-0.5, Second Threshold takes 60%-85%；

On the basis of the spectrum energy with vowel section, judgement has the spectrum energy before the spectrum energy of vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes that the spectrum energy is the consonant before vowel, the 3rd threshold value Take 100；

On the basis of the spectrum energy with vowel section, judgement has the spectrum energy after the spectrum energy of vowel section Whether zero-crossing rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel；

If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy It is the last frame of voice segments, then is judged as nose tail consonant.

Preferably, step S1 also includes：

Received pronunciation feature is temporally segmented, is divided into n sections, be a time slice with 20ms；

Each time period received pronunciation feature is divided into static nature and behavioral characteristics；

The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation is decomposited special The spectrum energy distribution of the vowel section levied and the spectrum energy distribution of consonant section；

The vowel section MFCC characteristic vectors of each time period internal standard phonetic feature, consonant section MFCC characteristic vectors are set.

Preferably, step S3 also includes：

The vowel section MFCC characteristic vectors of user vocal feature in each time period, consonant section MFCC characteristic vectors are set；

Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error and Corresponding DTW distances；

Based on the align to path and corresponding DTW distances, by the vowel section MFCC of user vocal feature in same time period The vowel section MFCC characteristic vectors of characteristic vector and received pronunciation feature carry out speech comparison and by user in same time period The consonant section MFCC characteristic vectors of phonetic feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, obtain The pronunciation difference gone out between user vocal feature and received pronunciation feature.

Preferably, step S1 also includes：

The vowel segment standard speech feature vector for setting each time period internal standard phonetic feature is P₁=[p₁(1),p₁ (2),…,p₁(R)], first-order difference vector is P_Δ1=[p_Δ1(1),p_Δ1(2),…,p_Δ1(R)] (R is the mother of received pronunciation feature Segment voice length), P_Δ1(n)=| p₁(n)-p₁(n-1) |, n=1,2 ..., R, p₁(0)=0；

The consonant segment standard speech feature vector for setting each time period internal standard phonetic feature is P '₁=[p '₁(1), p '₁ (2) ..., p '₁(R)], first-order difference vector is P '_Δ1=[p '_Δ1(1), p '_Δ1(2) ..., p '_Δ1(R)] (R is received pronunciation feature Voice length), P '_Δ1(n)=| p '₁(n)-p’₁(n-1) |, n=1,2 ..., R, p '₁(0)=0；

Preferably, step S3 also includes：

The vowel section characteristic vector for setting user vocal feature in each time period is P₂=[p₂(1),p₂(2),…,p₂ (T)], its first-order difference vector is P_Δ2=[p_Δ2(1),p_Δ2(2),…,p_Δ2(T)] (T is the length of voice to be evaluated), P_Δ2(n) =| p₂(n)-p₂(n-1) |, n=1,2 ..., T, p₂(0)=0；

The consonant section characteristic vector for setting user vocal feature in each time period is P '₂=[p '₂(1), p '₂(2) ..., p’₂(T)], its first-order difference vector is P '_Δ2=[p '_Δ2(1), p '_Δ2(2) ..., p '_Δ2(T)] (T is the length of voice to be evaluated Degree), P '_Δ2(n)=| p '₂(n)-p’₂(n-1) |, n=1,2 ..., T, p '₂(0)=0；

Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error, Carry out the section of the vowel in each time period and consonant section compares；

Compare the gap d for drawing vowel section_p, and variable quantity gap Δ d_p, compare the gap d ' for drawing consonant section_p, with And the gap Δ d ' of variable quantity_pTo obtain the similarity of user vocal feature and received pronunciation feature, i.e.,：

d_p=| p₁(n)-p₂(m)|

d’_p=| p '₁(n)-p’₂(m)|

Δd_p=| Δ p₁(n)-Δp₂(m)|

Δd’_p=| Δ p '₁(n)-Δp’₂(m)|

Wherein, Δ p_i(n)=| p_i(n)-p_i(n-1)|

Δp’_i(n)=| p '_i(n)-p’_i(n-1)|。

As can be seen from the above technical solutions, the present invention has advantages below：

Spoken comparison method causes user's a piece of text same with computer acquisition, carries out reading aloud contrast, enables users to Which word pronunciation enough learns the spoken language of oneself has inaccurate with the spoken language of standard, in addition it is also necessary to be improved in which word and Further study.The convenience of study language is so brought to learner, the efficiency of foreign language learning is improved, increases user learning Interest.

Brief description of the drawings

Fig. 1 is the flow chart of spoken comparison method.

Specific embodiment

To enable that goal of the invention of the invention, feature, advantage are more obvious and understandable, below will be with specific Embodiment and accompanying drawing, the technical scheme to present invention protection are clearly and completely described, it is clear that implementation disclosed below Example is only a part of embodiment of the invention, and not all embodiment.Based on the embodiment in this patent, the common skill in this area All other embodiment that art personnel are obtained under the premise of creative work is not made, belongs to the model of this patent protection Enclose.

The present invention provides a kind of spoken comparison method, as shown in figure 1, this method uses a received text, computer first to obtain The content of the received text is taken, and obtains the standard pronunciation of received text.Method involved in the present invention is hard based on computer Part coordinates corresponding program to realize.So user's a piece of text same with computer acquisition, carries out reading aloud contrast so that user The spoken language of oneself can be learnt has inaccurate with the spoken language of standard for which word pronunciation, in addition it is also necessary to be improved in which word And further study.The convenience of study language is so brought to learner, the efficiency of foreign language learning is improved, is increased user and is learned Practise interest.

Method includes：

S4：User vocal feature and comparing result are stored into database.

Step S2 also includes：

In the present embodiment, step S2 also includes：

Step S3 also includes：

By each voice segments of user carry out decomposition draw vowel section, consonant section and voice segments last frame whether There is nose tail consonant, nose tail consonant is nasal sound.

In computer pre-sets received text each voice segments vowel section, consonant section and in voice segments most Whether a later frame has nose tail consonant, and nose tail consonant is nasal sound.Each voice segments that user is read aloud vowel section, consonant section with And the nose tail consonant of the last frame in voice segments, it is compared with received pronunciation feature respectively.

Step S1 also includes：

Step S3 also includes：

Step S1 also includes：

Step S3 also includes：

d_p=| p₁(n)-p₂(m)|

d’_p=| p '₁(n)-p’₂(m)|

Δd_p=| Δ p₁(n)-Δp₂(m)|

Δd’_p=| Δ p '₁(n)-Δp’₂(m)|

Wherein, Δ p_i(n)=| p_i(n)-p_i(n-1)|

Δp’_i(n)=| p '_i(n)-p’_i(n-1)|。

Claims

1. a kind of spoken comparison method, it is characterised in that method includes：

S1：Received text is set, the received pronunciation feature of received text is obtained, received pronunciation feature is stored into database；

S2：Received text is read aloud by user, user voice data is obtained, the user speech extracted in user voice data is special Levy；

S3：User vocal feature is alignd with received pronunciation feature, and user vocal feature is entered with received pronunciation feature Row contrast；

S4：User vocal feature and comparing result are stored into database.

2. spoken comparison method according to claim 1, it is characterised in that method includes：

Step S2 also includes：

Temporally be segmented user voice data by S21, is divided into n sections, is a time slice with 20ms, and each time period is used Family speech data adds rectangular window, or Hamming window treatment to obtain segmentation voice signal X_n, n is segments；

S22 is to segmentation voice signal X_nShort Time Fourier Transform is carried out, frequency-region signal is transformed to, time-domain signal will be converted into short-term Frequency-region signal Y_n, and by Q_n=│ Y_n│²Calculate its short-time energy spectrum Q_n；

Short-time energy is composed Q by S23 by the way of first in first out_nBandpass filter is moved to from vector space S to be filtered；By In each frequency band component act on human ear in be superposition, therefore the energy in each filter band is folded Plus, at this moment k-th filter output power composes x'(k)；

S24 takes the logarithm the output of each wave filter, obtains the log power spectrum of frequency band；And carry out anti-discrete cosine change Change, obtain M MFCC coefficient, general M takes 13~15；MFCC coefficients are：

C_{n} = Σ_{k = 1}^{M} \log x (k) \cos ((2 k + 1) \frac{π}{M});

3. spoken comparison method according to claim 1, it is characterised in that method includes：

Step S2 also includes：

Obtain the spectrum energy (f of each voice segments frequency range_k), the upper frequency limit value k in the voice segments₁, lower limit k₂, obtain Take the spectrum energy ratio PN in voice segments_n；

{PN}_{n} = \frac{Σ_{k_{1}}^{k_{2}} h (f_{k})}{\underset{k}{Σ} h (f_{k})} \times 100 % .

4. spoken comparison method according to claim 1, it is characterised in that method includes：

Step S3 also includes：

If spectrum energy (f in voice segments_k) >=first threshold, spectrum energy ratio PN in the voice segments_n>=Second Threshold, then sentence This voice segments of breaking are vowel section；First threshold 0.1-0.5, Second Threshold takes 60%-85%；

On the basis of the spectrum energy with vowel section, the zero passage of the spectrum energy before the spectrum energy with vowel section is judged Whether rate is more than the 3rd threshold value, if being more than the 3rd threshold value, concludes that the spectrum energy is the consonant before vowel, and the 3rd threshold value takes 100；

On the basis of the spectrum energy with vowel section, the zero passage of the spectrum energy after the spectrum energy with vowel section is judged Whether rate is more than the 3rd threshold value, if being more than the 3rd threshold value, judges that the spectrum energy is the consonant after vowel；

If the zero-crossing rate of the spectrum energy after the spectrum energy with vowel section is more than the 3rd threshold value, and the spectrum energy is language The last frame of segment, then be judged as nose tail consonant.

5. spoken comparison method according to claim 4, it is characterised in that method includes：

Step S1 also includes：

The spectrum energy of each time period received pronunciation feature is decomposed, each time period received pronunciation feature is decomposited The spectrum energy distribution of vowel section and the spectrum energy distribution of consonant section；

6. spoken comparison method according to claim 5, it is characterised in that method includes：

Step S3 also includes：

Using DTW algorithms, obtain the minimum align to path of an error to obtain the minimum align to path of an error and correspondence DTW distances；

Based on the align to path and corresponding DTW distances, by the vowel section MFCC features of user vocal feature in same time period Vector and the vowel of received pronunciation feature section MFCC characteristic vectors carry out speech comparison and by user speech in same time period The consonant section MFCC characteristic vectors of feature carry out speech comparison with the consonant section MFCC characteristic vectors of received pronunciation feature, draw use Pronunciation difference between family phonetic feature and received pronunciation feature.

7. spoken comparison method according to claim 5, it is characterised in that method includes：

Step S1 also includes：

The vowel segment standard speech feature vector for setting each time period internal standard phonetic feature is P₁=[p₁(1),p₁(2),…, p₁(R)], first-order difference vector is P_Δ1=[p_Δ1(1),p_Δ1(2),…,p_Δ1(R)] (R is the vowel section voice of received pronunciation feature Length), P_Δ1(n)=| p₁(n)-p₁(n-1) |, n=1,2 ..., R, p₁(0)=0；

The consonant segment standard speech feature vector for setting each time period internal standard phonetic feature is P '₁=[p '₁(1), p '₁ (2) ..., p '₁(R)], first-order difference vector is P '_Δ1=[p '_Δ1(1), p '_Δ1(2) ..., p '_Δ1(R)] (R is received pronunciation feature Voice length), P '_Δ1(n)=| p '₁(n)-p’₁(n-1) |, n=1,2 ..., R, p '₁(0)=0.

8. spoken comparison method according to claim 7, it is characterised in that method includes：

Step S3 also includes：

The vowel section characteristic vector for setting user vocal feature in each time period is P₂=[p₂(1),p₂(2),…,p₂(T)], its First-order difference vector is P_Δ2=[p_Δ2(1),p_Δ2(2),…,p_Δ2(T)] (T is the length of voice to be evaluated), P_Δ2(n)=| p₂ (n)-p₂(n-1) |, n=1,2 ..., T, p₂(0)=0；

The consonant section characteristic vector for setting user vocal feature in each time period is P '₂=[p '₂(1), p '₂(2) ..., p '₂ (T)], its first-order difference vector is P '_Δ2=[p '_Δ2(1), p '_Δ2(2) ..., p '_Δ2(T)] (T is the length of voice to be evaluated),

P’_Δ2(n)=| p '₂(n)-p’₂(n-1) |, n=1,2 ..., T, p '₂(0)=0；

Using DTW algorithms, obtain the minimum align to path of an error with, obtain the minimum align to path of an error, carry out Vowel section and consonant section in each time period compare；

Compare the gap d for drawing vowel section_p, and variable quantity gap Δ d_p, compare the gap d ' for drawing consonant section_p, Yi Jibian The gap Δ d ' of change amount_pTo obtain the similarity of user vocal feature and received pronunciation feature, i.e.,：

d_p=| p₁(n)-p₂(m)|

d’_p=| p '₁(n)-p’₂(m)|

Δd_p=| Δ p₁(n)-Δp₂(m)|

Δd’_p=| Δ p '₁(n)-Δp’₂(m)|

Wherein, Δ p_i(n)=| p_i(n)-p_i(n-1)|

Δp’_i(n)=| p '_i(n)-p’_i(n-1)|。