CN109727608A

CN109727608A - A kind of ill voice appraisal procedure based on Chinese speech

Info

Publication number: CN109727608A
Application number: CN201711010103.6A
Authority: CN
Inventors: 李丹; 刘媛媛
Original assignee: Shenzhen Research Institute of CUHK
Current assignee: Shenzhen Research Institute of CUHK
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2019-05-07
Anticipated expiration: 2037-10-25
Also published as: CN109727608B

Abstract

The present invention provides a kind of ill voice appraisal procedure based on Chinese speech, includes the following steps: to complete using Chinese automatic speech verification system to complete phone segmentation to the continuous voice signal of input and its automatic aligning for corresponding to text；According to the time series of the phone segmentation, the continuous speech signal is subjected to cutting and is divided into vowel, voiceless consonant and voiced consonant, then carries out the extraction of feature；The feature composition characteristic parameter sets of the extraction, the set of characteristic parameters pass through the screening of feature selector, then input classifier, obtain the careful scoring of voice problem.This method uses extensive normal voice database training automatic speech recognition system, classify according to the sounding feature of different phonemes to the voice sequence of cutting, it is directed to different phoneme type design feature parameters respectively, it finally obtains for the objective whole scoring of voice problem and each careful scoring, provides important references for clinical diagnosis and rehabilitation.

Description

A kind of ill voice appraisal procedure based on Chinese speech

Technical field

The present invention relates to voice detection technique fields, and in particular to a kind of ill voice assessment side based on Chinese speech Method.

Background technique

Voice is the carrier that human language is linked up.Sonification system makes a sound under the driving of vocal cord vibration, and passes through pharynx The channel that larynx, oral cavity form propagates from.Sound carries different information, is to link up essential means between men. The sound that sonification system issues can be described by waveform signal, referred to as voice sound signal.When phonatory organ is in normal condition When, the vibration of vocal cords has apparent periodic law, the propagation ducts also regular variation of throat oral cavity composition, thus generates Voice sound signal also very regular follow.

In real life, voice only is not used as mutual exchange of information, but also for singing, performs, show emotion The purpose of.Voice Disorders can cause difficulty to our daily exchange, reduce communication efficiency.The main reason for voice disorder, had Degree throat, improper or phonatory organ infectious disease of pronouncing.Voice disorder has apparent embodiment in voice sound signal, such as The tone of voice, volume, formant occur the features such as sound quality, the tone color of irregular characteristic or voice occur obviously with speak Age of people, gender, mother tongue and the different anomalous variation of culture background.

In the clinical diagnosis and therapeutic process to voice disorder, carrying out human perception evaluation to voice sound signal is a mark Quasi- and important non-invasive methods.Although the evaluation criterion and operating procedure of human perception evaluation have developed many years, by In lacking enough experienced speech therapists, human perception evaluates the demand being unable to satisfy in modern society.Based on number The automatic voice of signal processing technology is assessed, and has important practical application valence for the clinical diagnosis and treatment of voice disorder Value.Its advantage includes Noninvasive, repeatability, objectivity, reliability etc..Traditional voice analysis method is typically just analyzed To fix the long vowel of position of articulation generation, this makes acoustic signal analysis become simple, but compared with connected speech, long vowel is deposited In following limitation:

(1) studies have shown that voice sound signal quality is influenced by language factor, if consonant-vowel is converted, and these language Factor cannot embody in long vowel, therefore voice sound signal when hair long vowel cannot represent throat when complicated natural language exchanges Sound signal；

(2) long vowel sounding has many restrictions, such as persistently keeps volume, pitch and phonatory organ position, this makes long vowel The acquisition of audio is easy flexibly not as good as nature continuous speech；

(3) in addition, also some researches show that speech therapist is not so good as using the reliability that long vowel carries out voice perception evaluation Use continuous speech.

At present focus mostly on the research that objectively evaluates of ill voice in the evaluation for carrying out entirety to voice sound signal quality, method There are regression analysis or classification；But the diverse clinical manifestations of actually ill voice, there is that hoarseness, sounding be laborious, dry throat, mistake Sound, trill etc., the whole severity for only evaluating voice problem cannot distinguish that specific voice problem provides help for clinic. Few correlative study work at present can be careful to the individual particular aspects for quantifying voice, such as trachyphonia, breath sound.We recognize There is key effect for clinical treatment for these careful characteristic researchs.

Summary of the invention

The present invention in order to solve the problems, such as in the prior art none of these methods can the ill voice of careful evaluation, provide one Ill voice appraisal procedure of the kind based on Chinese speech.

To solve the above-mentioned problems, the technical solution adopted by the present invention is as described below:

A kind of ill voice appraisal procedure based on Chinese speech includes the following steps: to identify using Chinese automatic language System is completed to complete phone segmentation to the continuous voice signal of input and its automatic aligning for corresponding to text；According to the sound The continuous speech signal is carried out cutting and is divided into vowel, voiceless consonant and voiced consonant, then carried out by the time series of plain cutting The extraction of feature；The feature composition characteristic parameter sets of the extraction, the set of characteristic parameters pass through the sieve of feature selector Choosing, then inputs classifier, obtains the careful scoring of voice problem.

Preferably, the Chinese automatic speech verification system is to be based on depth nerve net using normal voice database training Network-hidden Markov model Chinese speech recognizing system.

Preferably, the Chinese automatic speech verification system uses speaker adaptation technology.

Preferably, the Chinese automatic speech verification system is right with its by continuous voice signal using Viterbi algorithm The aligned phoneme sequence of text is answered to force alignment to complete phone segmentation work.

Preferably, the feature in the step S2 includes: the first category feature: fundamental frequency shake, amplitude jitter, signal are made an uproar Acoustic ratio, pitch, energy intensity；Second category feature: voice interruption, un-voiced signal ratio；Third category feature: mel cepstrum coefficients； 4th category feature: low-and high-frequency energy ratio, pitch variation amount, signal noise ratio variable quantity, Continuous Energy difference, Continuous Energy difference Variable quantity, normalized energy.

Preferably, to the parameter of the first category feature, the second category feature and the 4th category feature described in the vowel extraction；To institute State the parameter that voiced consonant extracts the second category feature, third category feature, the 4th category feature；It is special that third class is extracted to the voiceless consonant Sign, the parameter of the 4th category feature.

Preferably, in the 4th category feature: the calculation formula of the low-and high-frequency energy ratio HiLo are as follows:Wherein, E_hiIt is 1200HZ with the energy of super band, E_loFor 1200HZ or less The energy of frequency range, E₀For background noise energy；The calculation formula of the pitch variation amount var_F0 are as follows:Wherein, F0 is the fundamental frequency of a frame signal, and n is the totalframes of one section of continuous speech Mesh, var are the var functions in Calling MATLAB；The calculation formula of the signal noise ratio variable quantity var_HNR are as follows:Wherein, HNR is signal-to-noise ratio, and n is the totalframes mesh of one section of continuous speech； The calculation formula of the Continuous Energy difference D3_Energy are as follows:Wherein, E It (t) is the time domain energy of t frame；

The calculation formula of the Continuous Energy difference variable quantity var_D3 are as follows: Wherein, n is the totalframes of one section of continuous speech；The calculation formula of the normalized energy E_nor are as follows:Wherein, E_voiceFor the energy of voice signal, E_noiseIt makes an uproar energy for the bottom of mute part.

Preferably, careful scoring described in the step S3 is directed to voice problem to be assessed, the voice to be assessed Problem includes following voice problem: Q1: the severity of the lesion of common sensation the preceding paragraph voice；Q2: feel Heterophony, Irregularly, unintelligible；Q3: feeling in voice has expiration or gas leak phenomenon；Q4: feel pronunciation Overexertion；Q5: pronounced There is unexpected uncontrolled pause in Cheng Zhong；Q6: feel unexpected uncontrolled fundamental frequency or tone variation；Q7: phonation It is middle to there is quick continuous cracker；Q8: feel the variation of regular rhythmic tone or loudness；Q9: exist in voice Apparent fundamental frequency or tone and speaker's gender, age and the incongruent exception of culture and education background；Q10: pronunciation volume is obvious It is incongruent too large or too small with speech content.

Preferably, the careful scoring obtains severity for voice problem, the severity include: it is normal or Slightly, moderate, serious.

The ill voice assessment system based on Chinese speech that the present invention also provides a kind of, including voice input module, use The Speech Assessment module of any the method as above, assessment result output module, the voice input module is for receiving in real time Recording or audio, and give the real-time recording or audio transmission to Speech Assessment module；The Speech Assessment module is used for institute It states real-time recording or audio is assessed, and assessment result is transferred to assessment result output module；The result output module For the assessment result to be exported.

The invention has the benefit that providing a kind of ill voice appraisal procedure based on Chinese speech, this method is used Extensive normal voice database training, classifies to the voice sequence of cutting according to the sounding feature of different phonemes, for Different voice sequences designs different characteristic parameters, finally obtains the careful scoring of voice problem, and the objective of voice problem is commented Point, important references are provided for clinical diagnosis and rehabilitation.

Detailed description of the invention

The step of Fig. 1 is the ill voice appraisal procedure in the embodiment of the present invention 1 based on Chinese speech figure.

Fig. 2 is the demonstration graph of the ill voice appraisal procedure in the embodiment of the present invention 1 based on Chinese speech.

Fig. 3 is in the embodiment of the present invention 1 based on deep neural network-hidden Markov model Chinese speech recognition system System.

Fig. 4 is the schematic diagram of scoring classifier methods in the embodiment of the present invention 1.

Fig. 5 is the HiLo Distribution value histogram for not having trachyphonia voice medial vowel in the embodiment of the present invention 1.

Fig. 6 is the HiLo Distribution value histogram for having serious trachyphonia voice medial vowel in the embodiment of the present invention 1.

Fig. 7 is the HiLo Distribution value histogram for not having breath sound voice medial vowel in the embodiment of the present invention 1.

Fig. 8 is the HiLo Distribution value histogram for having serious breath sound voice medial vowel in the embodiment of the present invention 1.

Fig. 9 is the E_nor Distribution value histogram of volume normal voice medial vowel in the embodiment of the present invention 1.

There is the E_nor Distribution value histogram of moderate, serious amount of bass voice medial vowel in Figure 10 embodiment of the present invention 1.

Figure 11 is the schematic diagram of the ill voice assessment system in the embodiment of the present invention 3 based on Chinese speech.

Specific embodiment

The present invention is described in detail by specific embodiment with reference to the accompanying drawing, for a better understanding of this hair It is bright, but following embodiments are not intended to limit the scope of the invention.In addition, it is necessary to illustrate, diagram provided in following embodiments The basic conception that only the invention is illustrated in a schematic way, in attached drawing only display with related component in the present invention rather than according to reality Component count, shape when implementation and size are drawn, when actual implementation each component shape, quantity and ratio can for it is a kind of with The change of meaning, and its assembly layout form may also be increasingly complex.

Embodiment 1

In voice sound signal process field and speech therapy field, the objective analysis of ill voice has caused various passes Note.For example, MDVP is voice sound signal analyzing software system commonly used by a speech therapist.MDVP provides 22 sound Signal parameter is learned to describe voice quality.These parameters are used as by speech therapist according to progress voice assessment. LingWAVES is another business software system, for assisting doctor to make clinical diagnosis to voice problem.LingWAVES can To carry out basic acoustic signal analysis, such as spectrum sigtral response, tone analysis, volume analysis etc..In recent years many researchs Scholar carries out acoustic signal analysis using a freeware Praat, but Praat is not one specifically for lesion The tool of voice analysis also can not provide judgement to lesion type and severity.

From the angle of signal processing, relevant research is primarily upon how to extract effective characteristic parameter to describe lesion throat The irregular characteristic of sound signal.Most common characteristic parameter include fundamental frequency shake (jitter), amplitude jitter (shimmer) and Signal noise ratio (HNR).Fundamental frequency shake, amplitude jitter and their relevant similar characteristics parameters, describe in time-domain Change degree of the voice sound signal in tone and volume.Usually, ill voice due to being not sufficiently stable, tremble by fundamental frequency shake and amplitude Dynamic numerical value can be bigger.Signal noise ratio calculates the energy ratio between voice sound signal periodic component and additive noise ingredient.Morbid state The signal noise ratio of voice is usually all relatively low.Other relevant acoustical characteristic parameters include: pitch-period entropy (PPE), glottis quotient (GQ), voice turbulence index (VTI), and soft phonation Index (SPI) etc..On the other hand, mel cepstrum coefficients (Mel-Frequency Cepstral Coefficients, MFCC), It is common feature in speech recognition and Speaker Identification.This feature is also attempted for the detection and classification of ill voice.People Artificial neural networks (Artificial neural network, ANN) and mixed Gauss model (Gaussian mixture Model, GMM) it is used as voice classification tool.As previously mentioned, the voice signal of analysis is concentrated mainly on long vowel, closely Development because of machine learning in several years also starts to turn to continuous speech signal in terms of pathology voice signal research.These researchs Although there is no the careful voice signals for distinguishing different language content in acoustic signal analysis using connected speech signal. One section of continuous speech is inputted, usually temporally frame extracts features described above parameter first, then to all frame level feature vectors It statisticallys analyze, for example takes average value or standard variance or median of certain feature on all time frames, gained statistical value quilt It is considered as the feature of whole section of continuous speech signal, and as subsequent recurrence or the input of classification problem.

Summarize existing research work, it is believed that have the disadvantage that

Above-mentioned common feature parameter such as jitter, shimmer, main concentrate describe metastable pronunciation state, these Characteristic parameter can be used to identify normal voice and ill voice, but further careful can not detect lesion type and lesion Degree.Similar Jitter, shimmer parameter are more suitable for extracting from stablizing in long vowel, and natural continuous speech is in different phonemes Between be quickly converted, acoustic feature is also constantly changing, such as describe vocal cord vibration fundamental frequency and sound channel variation cepstrum coefficient. These changing factors in continuous speech not only make some traditional characteristic parameters (such as jitter, shimmer) fail, and continuous The statistical result of all frame level features of voice can not also eliminate the influence of language content variation.

These work mostly use greatly statistical data model, such as mixed Gauss model in detection and classification.Work as training data When amount is enough and training data is matched with actual test data, statistical method effectively can accurately describe different classes of voice Feature distribution.But the reason of morbid state voice, is many with type, acoustic feature distribution is simultaneously irregular, and the training number of many classifications According to not enough, therefore these statistical methods can not obtain answer it is effective.

Few correlative study work can be careful to the individual particular aspects for quantifying voice at present, such as trachyphonia, breath sound Deng.It is considered that these careful characteristic researchs have key effect for clinical treatment.

In view of the deficiencies of the prior art, this auxiliary voice assessment tool is according to language content cutting voice signal, for not Suitable characteristic parameter is designed or found with speech-like signal.In addition to voice problem carry out the overall evaluation, the design also directed to The different evaluation model of different voice problem training, such as trachyphonia (Roughness), dry throat (Glottal fry), sounding is laborious (Strain) etc., whole and careful multidimensional voice problem score is carried out to natural continuous speech signal to realize, is asked for voice The diagnosing and treating process of topic provides important references.

As shown in Figure 1, a kind of ill voice appraisal procedure based on Chinese speech, including step 1-3, institute specific as follows Show:

1. using Chinese automatic speech verification system complete to the continuous voice signal of input and its correspond to text from Dynamic alignment, completes phone segmentation；

As shown in Fig. 2, will be completed using Chinese automatic speech verification system to input continuous speech and its correspond to text Automatic aligning, to realize the purpose for carrying out phone segmentation to continuous speech.Although needed in use to ill voice voice into Row identification, but lack extensive ill voice continuous speech database.

As shown in figure 3, this method is based on the hidden Ma Erke of deep neural network-using extensive normal voice database training The Chinese speech recognizing system of husband's model (DNN-HMM).In Fig. 3,1 is the acoustic feature vector Ot of continuous speech signal, and 2 are Deep neural network DNN, 1 as 2 input, 3 for deep neural network output posterior probability vector (P (aj | Ot), aj table Show the HMM state an of phoneme, 4 be HMM chain.It is correct to the identification of ill voice voice to improve Chinese speech recognizing system Rate uses speaker adaptation technology (Speaker Adaptation Training, SAT) in training acoustic model.

In the assessment diagnosis of clinical voice, sufferer can be instructed to do different types of sounding by speech therapist, be generally comprised Long vowel reads short essay and open end interview.Because the short essay content read is it is known that continuous speech to this part, text It can be used as the input of speech recognition system.Viterbi algorithm is by the aligned phoneme sequence and voice in text inside speech recognition system Signal forces alignment to complete phone segmentation work.

2. according to the time series of the phone segmentation, the continuous speech signal is subjected to cutting and is divided into vowel, clear Then consonant and voiced consonant carry out the extraction of feature；

Extract the feature of 4 subclasses respectively, the first category feature: fundamental frequency shakes (jitter), amplitude jitter (shimmer), Signal noise ratio (HNR), pitch (pitch), energy intensity (intensity)；Second category feature: voice interruption (voice ), break un-voiced signal ratio (unvoiced fraction)；Third category feature: mel cepstrum coefficients (MFCC)；4th class: It include: low-and high-frequency energy ratio: HiLo from design parameter；Pitch variation amount: var_F0；Signal noise ratio variable quantity: var_HNR； Continuous Energy difference: D3_Energy；D3_Energy variable quantity: var_D3；Normalized energy.According to phone segmentation time sequence Continuous speech signal is carried out cutting and is divided into vowel, voiceless consonant and voiced consonant by column, is extracted for this few class phoneme voice signal Different characteristic parameters.Above-mentioned to vowel extraction first and second and four category feature parameters；To voiced consonant extract second and third, four classes it is special Levy parameter；Third and fourth category feature parameter is extracted to voiceless consonant.The 4th category feature parameter that this method proposes is described below:

(1) HiLo, low-and high-frequency energy ratio, formula is as follows, wherein E_hiFor high band (6000HZ or more) energy, E_loIt is low Frequency range (6000HZ or less) energy, E₀For background noise energy.Experimental data shows, for hoarseness (Roughness) and There is the voice signal of breath sound (Breathiness).

(2) var_F0, pitch variation amount, formula is as follows, and wherein F0 is the fundamental frequency of a frame signal, and n is one section of continuous speech Totalframes mesh, var is the var function in Calling MATLAB.

(3) var_HNR, signal-to-noise ratio variable quantity, formula is as follows, and wherein HNR is signal-to-noise ratio, and n is the total of one section of continuous speech Frame number.For some pathological voice voice signals, not only signal-to-noise ratio (HNR) is lower, and the variation of HNR is bigger, can embody On var_HNR.

(4) D3_Energy, Continuous Energy difference, formula is as follows, and wherein E (t) is the time domain energy of t frame.D3_ Energy describes the variation of 3 frame speech signal energies before and after t moment.

(5) var_D3, D3_Energy variable quantity, formula is as follows, and wherein n is the totalframes of one section of continuous speech.

(6) E_nor, normalized energy, formula is as follows, wherein E_voiceFor the energy of voice signal, E_noiseFor mute part Bottom make an uproar energy.E_nor can be used for measuring size of the speech signal energy relative to ambient noise, be conducive to judge that speaking volume is inclined Big or less than normal (Loudness High/Low).

3. the feature composition characteristic parameter sets of the extraction, the set of characteristic parameters passes through the sieve of feature selector Choosing, then inputs classifier, obtains the careful scoring of voice problem.

As shown in Fig. 2, Feature Selection device shares 10, it is directed to 10 class voice problems to be assessed respectively, is shown in Table 1.

1 voice problem list of table

Corresponding classifier shares 10, respectively for 10 class voice problems training to be assessed.Every a pair of feature selecting Device and classifier constitute one group of Rating Model, obtain for the specific a kind of training in 10 class voice problems.Characteristic extracting module The extensive set of characteristic parameters of output first passes through feature selector, and the parameter after screening will input classifier, final to obtain To the scoring of corresponding voice problem.In feature selecting, in addition to comprehensive algorithms most in use, such as LASSO (least absolute shrinkage and selection operator), mRMR(minimum redundancy maximum relevance) It, can also be with particular reference to designer to all kinds of voices with LLBFS (local learning-based feature selection) The experience of problem progress signal analysis.Experiment shows that specific voice problem has performance in specific acoustical characteristic parameters, such as Hoarseness can cause voice signal HiLo parameter to increase, and the var_F0 and var_HNR of trill are larger.With reference to clinical speech therapy Evaluation of the teacher to voice problem, the design use 4 class classifiers, and output can indicate that voice is normal or slight, moderate, serious throat Mail topic.Due to being multicategory classification problem, this method selection multilayer neural network (MLP, Multi-layer Perceptron) As classifier.Voice data used in training classifier are directed to the subjective scoring of every voice problem containing professional speech therapist, Reference standard of this subjective scoring as training MLP.

As shown in figure 4, illustrating each scoring classifier principle by taking trachyphonia (Roughness) as an example, to each input feature vector Parameter vector, MLP export the one-hot vector of one 4 dimension, indicate the severity of voice trachyphonia, are shown in Table 2.Other 9 kinds of voices The assessment of problem is also similarly.

The severity of 2 voice problem of table

0001	Normally (trachyphonia problem is not present)
		0010	Slightly (there are slight trachyphonias)
0100	Moderate (there are moderate trachyphonias)
		1000	Seriously (there are serious trachyphonias)

The effect in an experiment from design parameter is exemplified below:

(1) about HiLo value

As shown in figure 5, abscissa is HiLo value in figure without the HiLo Distribution value histogram of trachyphonia voice medial vowel, indulge Coordinate is sample size.

As shown in fig. 6, there is the HiLo Distribution value histogram of serious trachyphonia voice medial vowel, abscissa is HiLo value in figure, Ordinate is sample size.

HiLo value is higher, if Fig. 5 comparison diagram 6 is the HiLo Distribution value histogram that the comparison of no trachyphonia has serious trachyphonia voice Figure and Fig. 7 comparison diagram 8 are no breath sound and have serious breath sound voice HiLo Distribution value histogram.

As shown in fig. 7, abscissa is HiLo value in figure without the HiLo Distribution value histogram of breath sound voice medial vowel, Ordinate is sample size.

As shown in figure 8, there is the HiLo Distribution value histogram of serious breath sound voice medial vowel, abscissa is HiLo value, is indulged Coordinate is sample size.

(2) about var_F0 value: experimental data shows have the var_F0 value of serious voice problem voice higher, as table 3 is right It is no trill and the var_F0 distribution for having serious trill voice than table 4.

Var_F0 value of the table 3 without trill voice medial vowel

Voice serial number	1	2	3	4	5	6	7	8	9
										var_F0(*10^-3)	0.91	31.11	22.32	33.34	0.16	6.93	38.89	34.83	12.34

Table 4 has the var_F0 value of serious trill voice medial vowel

Voice serial number	1	2	3	4	5	6	7	8	9
										var_F0(*10^-3)	91.47	150.47	241.33	49.23	20.37	39.53	69.50	119.30	63.11

(3) about var_HNR: its var_HNR value of the voice signal of i.e. some pathological voices is bigger, such as 5 contrast table 6 of table For no trill and have shown in serious trill voice signal var_HNR Distribution value.

Var_HNR value of the table 5 without trill voice medial vowel

Voice serial number	1	2	3	4	5	6	7	8	9
										var_HNR(*10^-3)	4.07	6.07	7.47	3.67	2.00	4.38	8.16	5.65	6.73

Table 6 has the var_HNR value of serious trill voice medial vowel

Voice serial number	1	2	3	4	5	6	7	8	9
										var_HNR(*10^-3)	198.30	215.85	718.98	140.82	90.25	116.21	134.84	149.75	215.21

(4) about var_D3: general pathological voice can show biggish energy variation.D3_Energy is also used to calculate var_D3.Var_D3 is used to measure the energy stability of one section of voice signal, and the var_D3 value of usually morbid state voice is higher, such as table 7 contrast tables 8 is no trill and have shown in serious trill voice signal var_D3 Distribution value.

Var_D3 value of the table 7 without trill voice medial vowel

Voice serial number	1	2	3	4	5	6	7	8	9
										var_D3(*10^-5)	0.06	0.004	0.003	0.004	0.006	0.003	0.005	0.007	0.003

Table 8 has the var_D3 value of serious trill voice medial vowel

Voice serial number	1	2	3	4	5	6	7	8	9
										var_D3(*10^-5)	15	723	2218	5730	1217	825	62	840	306

(5) about E_nor: if Fig. 9 comparison diagram 10 is that volume is normal and have the E_nor Distribution value of amount of bass voice, it is seen that The E_nor value of amount of bass voice is relatively low.

Such as Fig. 9, the E_nor Distribution value histogram of volume normal voice medial vowel, wherein abscissa is E_nor value, indulges and sits It is designated as sample size.

Such as Figure 10, there is the E_nor Distribution value histogram of moderate, serious amount of bass voice medial vowel, wherein abscissa is E_ Nor value, ordinate are sample size.

Embodiment 2

The specific embodiment for carrying out ill voice assessment using method of the present invention is as described below:

(1) establish ill voice database: this database is jointly complete by hospital speech therapist, scholar and volunteer's sufferer At.Sufferer scale is that 230 mother tongues are Chinese personage, and gender, age balance respectively.Under relatively fixed environment, each disease Suffering from recording includes long vowel, the voice signal for reading aloud short essay and question answering.Recording audio is two-channel, 16bit, sample rate are The wav file of 44.1kHz.Collected voice signal is doing simple pre-processing (using speaker diarisation technology Delete speech therapist's speech content, fixed multiple amplification and noise reduction) after, it gives 48 professional speech therapists in total and decides See scoring.The subjective scoring of 10 voice problems is done to the recording of each sufferer.It is final right for the reliability for improving subjective scoring The subjective scoring of all speech therapists statisticallys analyze, and removes insincere high score or low point, asks median, variance etc., finally Its voice of each sufferer is obtained in the subjective scoring of 10 aspects.

(2) Chinese phonetic system training: the large-scale words automatic recognition of speech database training one of normal voice is used A Chinese speech automatic recognition system.

(3) voice for reading the short essay of each sufferer in ill voice database and corresponding text input to Chinese speech Automatic recognition system obtains phoneme time sequence alignment data.

(4) according to the phoneme time sequence alignment data in step (3), then to the member in the voice signal of short essay reading Sound, voiceless consonant and voiced consonant extract characteristic parameter respectively.

(5) all parameters extracted in step (4) pass through 10 automatic scoring modules (10 characteristic parameter selection devices respectively + 10 four class classifiers), this 10 automatic scoring modules are directed to 10 voice problems respectively.In the Rating Model training stage, step Suddenly target value of the subjective scoring as model training obtained in (1) is calculated with least mean-square error (Least Mean Square) Method keeps Rating Model optimal.

(6) after automatic scoring model is optimal, i.e., the method is implanted into computer system, is ground for clinical trial or science Study carefully.Test macro, the computing unit (NVIDIA GPU) needed for being calculated in addition to model, it is also necessary to which front end sound pick-up outfit is (available Shure omnidirectional microphone+TASCAM 2*2 sound card), Audio storage unit (calculator memory), appraisal result display unit (liquid Crystal display screen).Test start when, ask first subject's (mouth) distance microphone certain distance (30cm) hair long vowel (/a/ ,/ I/ ,/u/), short essay is read aloud, two problems are answered.Subject's voice signal of recording first passes through pre-processing, and then short essay is read aloud Part of speech signal will enter computing system and carry out phone segmentation, feature extraction, automatic scoring, will finally export in display unit All appraisal results of 10 phonetic problems.

Embodiment 3

As shown in figure 5, the present invention also provides a kind of ill voice assessment system based on Chinese speech, including voice input Module, the Speech Assessment module using method as described in Example 1, assessment result output module.

The voice input module is used to receive real-time recording or audio, and by the real-time recording or audio transmission to language Sound evaluation module；The Speech Assessment module transmits assessment result for assessing the real-time recording or audio Give assessment result output module；The result output module is for exporting the assessment result.

Method and system provided by the present invention have it is following the utility model has the advantages that

(1) based on automatic speech recognition technology to the ill voice continuous speech signal progress phone segmentation of input, and according to The sounding feature of different phonemes classifies to the voice sequence of cutting, such as vowel and consonant, voiced sound and voiceless sound；

(2) different characteristic parameters is designed for different voice sequences, traditional jitter is used to vowel, Shimmer parameter with reflect vocal cords vibration regularity, to voiceless sound using MFCC and HiLo parameter with the variation of vocal tract filter With energy each frequency range distribution；

(3) for the different Rating Models based on MLP of different voice problems training, realize to ill voice whole and The objective scoring of 10 kinds of voice problems such as trachyphonia, provides important references for clinical diagnosis and rehabilitation.

This method/system, which is able to achieve, carries out automatic voice assessment to the continuous speech of fixed text content.If building up in the future Extensive morbid state voice continuous speech database, this method/system then may extend to the natural continuous speech of arbitrary content.It is existing Technology does not consider the variation of language content in continuous speech, extracts same characteristic features parameter to all voice sequences.This method/system The voice sequence of different language content type can be handled respectively, it can be achieved that more careful analysis voice problem is to different phonetic content The influence of sounding, offers guidance and suggestion for rehabilitation.Different from the prior art to the single evaluation of voice, this method/be System makees various dimensions assessment to voice, realizes the judgement to voice Questions types and severity.

Under manpower license, artificial mask method, the method/be is can also be used in this method/system phone segmentation module System can obtain the phoneme time sequence of higher precision.If there is the continuous speech database of extensive ill voice, can be used to train Automatic Chinese speech recognition system cuts the phoneme for obtaining more reliable compared to the speech recognition system for using normal voice training Divide time series.In grading module, the method for sparse estimation can also be used to quantify to distinguish the classification and serious journey of ill voice Degree.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those skilled in the art to which the present invention belongs, it is not taking off Under the premise of from present inventive concept, several equivalent substitute or obvious modifications can also be made, and performance or use is identical, all answered When being considered as belonging to protection scope of the present invention.

Claims

1. a kind of ill voice appraisal procedure based on Chinese speech, which comprises the steps of:

S1: using Chinese automatic speech verification system complete to the continuous voice signal of input and its correspond to the automatic right of text Together, phone segmentation is completed；

S2: according to the time series of the phone segmentation, the continuous speech signal is subjected to cutting and is divided into vowel, voiceless consonant And voiced consonant, then carry out the extraction of feature；

S3: the feature composition characteristic parameter sets of the extraction, the set of characteristic parameters pass through the screening of feature selector, so After input classifier, obtain the careful scoring of voice problem.

2. the ill voice appraisal procedure based on Chinese speech as described in claim 1, which is characterized in that the Chinese is automatic Speech recognition system is using normal voice database training based on deep neural network-hidden Markov model Chinese speech Identifying system.

3. the ill voice appraisal procedure based on Chinese speech as described in claim 1, which is characterized in that the Chinese is automatic Speech recognition system uses speaker adaptation technology.

4. the ill voice appraisal procedure based on Chinese speech as described in claim 1, which is characterized in that the Chinese is automatic Continuous voice signal and its aligned phoneme sequence for corresponding to text are forced alignment with complete by speech recognition system using Viterbi algorithm It works at phone segmentation.

5. the ill voice appraisal procedure based on Chinese speech as described in claim 1, which is characterized in that in the step S2 The feature include:

First category feature: fundamental frequency shake, amplitude jitter, signal noise ratio, pitch, energy intensity；

Second category feature: voice interruption, un-voiced signal ratio；

Third category feature: mel cepstrum coefficients；

4th category feature: low-and high-frequency energy ratio, pitch variation amount, signal noise ratio variable quantity, Continuous Energy difference, Continuous Energy Difference variable quantity, normalized energy.

6. the ill voice appraisal procedure based on Chinese speech as claimed in claim 5, which is characterized in that mentioned to the vowel Take the parameter of first category feature, the second category feature and the 4th category feature；Second category feature, third are extracted to the voiced consonant The parameter of category feature, the 4th category feature；The parameter of third category feature, the 4th category feature is extracted to the voiceless consonant.

7. the ill voice appraisal procedure based on Chinese speech as claimed in claim 5, which is characterized in that the 4th class is special In sign:

The calculation formula of the low-and high-frequency energy ratio HiLo are as follows:Wherein, E_hiFor 6000HZ or more The energy of frequency range, E_loFor the energy of 6000HZ or less frequency range, E₀For background noise energy；

The calculation formula of the pitch variation amount var_F0 are as follows:Wherein, F0 mono- The fundamental frequency of frame signal, n are the totalframes mesh of one section of continuous speech, and var is the var function in Calling MATLAB；

The calculation formula of the signal noise ratio variable quantity var_HNR are as follows: Wherein, HNR is signal-to-noise ratio, and n is the totalframes mesh of one section of continuous speech；

The calculation formula of the Continuous Energy difference D3_Energy are as follows:Its In, E (t) is the time domain energy of t frame；

The calculation formula of the Continuous Energy difference variable quantity var_D3 are as follows: Wherein, n is the totalframes of one section of continuous speech；

The calculation formula of the normalized energy E_nor are as follows:Wherein, E_voiceFor voice signal Energy, E_noiseIt makes an uproar energy for the bottom of mute part.

8. the ill voice appraisal procedure based on Chinese speech as described in claim 1, which is characterized in that in the step S3 The careful scoring is directed to voice problem to be assessed, and the voice problem to be assessed includes following voice problem:

Q1: the severity of the lesion of common sensation the preceding paragraph voice；

Q2: feel Heterophony, irregular, unintelligible；

Q3: feeling in voice has expiration or gas leak phenomenon；

Q4: feel pronunciation Overexertion；

Q5: in phonation, there is unexpected uncontrolled pause；

Q6: feel unexpected uncontrolled fundamental frequency or tone variation；

Q7: there is quick continuous cracker in phonation；

Q8: feel the variation of regular rhythmic tone or loudness；

Q9: there are apparent fundamental frequency or tone and speaker's gender, age and the incongruent exceptions of culture and education background in voice；

Q10: pronunciation volume is obviously incongruent too large or too small with speech content.

9. the ill voice appraisal procedure based on Chinese speech as claimed in claim 8, which is characterized in that the careful scoring Obtain the severity for voice problem, the severity includes: normal or slight, moderate, serious.

10. a kind of ill voice assessment system based on Chinese speech, which is characterized in that including voice input module, using such as The Speech Assessment module of any the method for claim 1-4, assessment result output module,

The voice input module comments the real-time recording or audio transmission to voice for receiving real-time recording or audio Estimate module；

The Speech Assessment module is transferred to assessment knot for assessing the real-time recording or audio, and by assessment result Fruit output module；

The result output module is for exporting the assessment result.