CN100370519C

CN100370519C - Method and system for reinforcing electronic guttural sound

Info

Publication number: CN100370519C
Application number: CNB2005100961870A
Authority: CN
Inventors: 万明习; 赵钦; 刘汉军; 王素品; 王卫波
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2005-10-17
Filing date: 2005-10-17
Publication date: 2008-02-20
Anticipated expiration: 2025-10-17
Also published as: CN1776809A

Abstract

The present invention discloses a method for reinforcing electronic larynx speeches and a system thereof, which is composed of a speech data acquisition module, an A/D converting module, a DSP chip, a powered on reset control circuit, a D/A converting module, a wave filtering and shaping device, a power amplification device, a speech output module, an extending program storage module and an extending data storage module. The noises of electronic larynx speeches are eliminated, the electronic larynx speeches are enhanced by using a phase subtraction method of a power spectrum, background noises and random noises in the voices are eliminated, and the signal-to-noise ratio, the subjective intelligibility and the euphonious degree of speeches are improved; the signal processing algorithm used in the present invention is tested by software in a simulating way, and the objective and subjective quality of the electronic larynx speeches can be improved effectively. After being processed by the method, the signal-to-noise ratio of the electronic larynx speeches can be improved from 5 to 10 dB, the score of MOS can be improved from 1 to 2, the present invention can be applied to the fields of speech communication, speech recognition, etc., when the algorithm is combined with a hardware system, and a good effect is achieved.

Description

A kind of method and system thereof that strengthens electronic guttural sound

Technical field

The invention belongs to pathology speech reconstructing and speech enhancement technique field, particularly a kind of method and system thereof that strengthens electronic guttural sound.

Background technology

Electronic larynx is laryngeal pathological process patients such as larynx excision one of the most auxiliary sounding instrument of normal use.It uses simple, be easy to grasp, but because the characteristic of electronic larynx itself and the difference of user's skill level cause electronic guttural sound often to have the ground unrest of very big composition, all the more so among noisy environment, can have a strong impact on the intelligibility and the melodious degree of voice.

At present external relevant research algorithm mainly concentrates on the removal to the periodic noise that brings owing to the electronic larynx energy leakage, as auto adapted filtering etc.But these algorithms are only considered the situation of electronic larynx user under quiet environment, and do not consider that the electronic larynx user is in the ground unrest that may introduce among the actual communication environment.The design carries out more comprehensive noise remove and voice enhancement process then at the ground unrest in this periodic noise and the environment.

Summary of the invention

Defective or deficiency based on above-mentioned prior art existence, the objective of the invention is to, a kind of method and system thereof that strengthens electronic guttural sound is provided, this method at the steady ground unrest of periodicity and the random noise of peculiar strong energy in the electronic guttural sound, is carried out denoising and enhancement process to voice simultaneously.

But the present invention is directed to the system of practical application, in some particular application, improve the quality of electronic larynx user reconstructed speech, by the corresponding signal process algorithm, effectively radiated noise and the ground unrest in the filtering electronic guttural sound reaches the effect that voice strengthen.

For achieving the above object, the present invention adopts the hardware system based on DSP, utilizes the subtractive method of power spectrum, and electronic guttural sound is carried out denoising and enhancing, to eliminate periodicity ground unrest and the random noise that is had in the voice, improve voice signal to noise ratio (S/N ratio) and subjective intelligibility, melodious degree.Adopting power spectrum to subtract method, to carry out the step that electronic guttural sound strengthens as follows:

1) gathers the electronic guttural sound signal, and carry out digitized processing;

2) carry out Noise Estimation, obtaining length is a frame (about 20～40ms) noise valuation;

3) the FFT conversion is carried out in valuation to this section noise, gets its frequency spectrum, and further obtains the noise power spectrum valuation;

4) divide frame with digitize voice, interframe is overlapping to be 0%～75%;

5) every frame voice are carried out the FFT conversion, obtain its frequency spectrum, and further obtain the power spectrum of noisy speech;

6) the noisy speech power spectrum deducts the noise power spectrum valuation, obtains the valuation of clean speech power spectrum;

7) utilize of the valuation of the phase place of noisy speech frequency spectrum, return to frequency spectrum, remake IFFT and return to time domain, obtain the clean speech valuation from power spectrum as the clean speech spectral phase.

Whole data processing step can be expressed as follows:

If y (t)=s (t)+n (t), wherein y (t) is a noisy speech, and s (t) is a clean speech, n (t)=n ₁(t)+n ₂(t), n wherein ₁(t) be the periodicity ground unrest, n ₂(t) be random noise.This is to be based upon voice and noise is uncorrelated, thereby has under the hypothesis prerequisite of additivity.

The frequency spectrum of then obtaining y (t) is:

Y(ω)＝Y _R(ω)+iY _I(ω)＝FFT[y(t)]

Wherein Y (ω) is the frequency spectrum of y (t), Y _R(ω) and Y _I(ω) be respectively real part and the imaginary part of Y (ω).

Obtaining periodicity ground unrest and the valuation of random noise frequency spectrum accordingly is:

N(ω)＝N _R(ω)+iN _I(ω)＝FFT[n(t)]

Then the power spectrum of y (t) is:

P_{Y} (ω) \approx Y_{R}^{2} (ω) + Y_{I}^{2} (ω)

Periodically the valuation of ground unrest and random noise power spectrum is:

P_{N} (ω) \approx N_{R}^{2} (ω) + N_{I}^{2} (ω)

The valuation of clean speech power spectrum is:

P _S(ω)＝P _Y(ω)-P _N(ω)

The valuation of clean speech frequency spectrum is:

\hat{S} (ω) = P_{S} (ω) \times (\frac{Y_{R} (ω)}{\sqrt{Y_{R}^{2} (ω) + Y_{I}^{2} (ω)}} + \frac{{iY}_{I} (ω)}{\sqrt{Y_{R}^{2} (ω) + Y_{I}^{2} (ω)}})

The clean speech valuation is:

\hat{s} (t) = IFFT [\hat{S} (ω)]

The electronic guttural sound enhanced system of Shi Xianing according to the method described above, it is characterized in that this system comprises speech data collection module, A/D modular converter, dsp chip, electrification reset control circuit, D/A modular converter, filter shape, power amplification, voice output module, extender memory module, growth data memory module; Wherein speech data collection module is connected with dsp chip by the A/D modular converter, dsp chip also is connected with D/A modular converter, extender memory module, growth data memory module respectively, the electrification reset control circuit is connected with dsp chip, and the power supply supply of responsible total system, the D/A modular converter also is connected with filter shape, power amplification, voice output module.

The employed signal processing algorithm of method of the present invention can effectively improve the objective and subjective quality of electronic guttural sound through the software emulation check.In general, after electronic guttural sound was handled through the method, signal to noise ratio (S/N ratio) can improve 5～10dB, and the MOS score can improve 1～2 fen.Algorithm is combined with hardware system, can be applicable to the field such as communication, identification of pathology, reconstructed speech, obtain good effect.

Description of drawings

Fig. 1 is the structural drawing of whole electronic guttural sound enhanced system: label is wherein represented respectively: 1, speech data collection module, 2, A/D modular converter (TLC320AD50C), 3, DSP programmed control and data processing module (TMS320C5410), 4, power on, reset control circuit, 5, D/A modular converter (TLC320AD50C), 6, filter shape, power amplification, voice output module, 7, replicating machine, 8, extender district module, 9, growth data district module;

Fig. 2 is the global procedures block diagram of native system software section.

Fig. 3 is the block diagram of algorithm that native system adopts.

Fig. 4 passes through before the native system and electronic guttural sound sound wave comparison diagram afterwards, wherein (a) figure is periodicity ground unrest and the random noise that system acquisition arrives, (b) figure is by the electronic guttural sound signal before the system, (c) for passing through system's electronic guttural sound signal afterwards.

The present invention is described in further detail below in conjunction with accompanying drawing.

Embodiment

The present invention adopts the hardware system based on DSP, utilizes the subtractive method of power spectrum, and electronic guttural sound is carried out denoising and enhancing, to eliminate ground unrest and the random noise that is had in the voice, improves voice signal to noise ratio (S/N ratio) and subjective intelligibility, melodious degree.

The power spectrum subtraction that strengthens at electronic guttural sound is based on two hypothesis: one, and periodically ground unrest, random noise and voice all keep steady in short-term; Two, periodically ground unrest, random noise are uncorrelated with voice.Assume immediately when these two, the band electronic guttural sound of making an uproar is asked power spectrum, deduct the Noise Estimation power spectrum again, can obtain the estimated value of clean speech power spectrum, to time domain, can obtain the estimated value of clean speech its spectrum recovery.Because people's ear is insensitive to the voice phase place, therefore the spectral phase of available noisy speech is as the estimated value of clean speech spectral phase.

Adopting power spectrum to subtract method, to carry out the step that electronic guttural sound strengthens as follows:

4) divide frame with digitize voice, interframe is overlapping to be 0%～75%;

Whole data processing step can be expressed as follows:

The frequency spectrum of then obtaining y (t) is:

Y(ω)＝Y _R(ω)+iY _I(ω)＝FFT[y(t)]

N(ω)＝N _R(ω)+iN _I(ω)＝FFT[n(t)]

Then the power spectrum of y (t) is:

P_{Y} (ω) = Y_{R}^{2} (ω) + Y_{I}^{2} (ω)

Periodically the valuation of ground unrest and random noise power spectrum is:

P_{N} (ω) = N_{R}^{2} (ω) + N_{I}^{2} (ω)

The valuation of clean speech power spectrum is:

P _S(ω)＝P _Y(ω)-P _N(ω)

The valuation of clean speech frequency spectrum is:

\hat{S} (ω) = P_{S} (ω) \times (\frac{Y_{R} (ω)}{\sqrt{Y_{R}^{2} (ω) + Y_{I}^{2} (ω)}} + \frac{{iY}_{I} (ω)}{\sqrt{Y_{R}^{2} (ω) + Y_{I}^{2} (ω)}})

The clean speech valuation is:

\hat{s} (t) = IFFT [\hat{S} (ω)]

Referring to Fig. 1, Fig. 1 is the schematic diagram of a kind of electronic guttural sound enhanced system based on DSP of realizing according to the method described above.Total system is made up of speech data collection module 1, A/D modular converter 2, dsp chip 3, electrification reset control circuit 4, D/A modular converter 5, filter shape, power amplification, voice output module 6, extender memory module 8, growth data memory module 9.

Wherein speech data collection module 1 is connected with dsp chip 3 by A/D modular converter 2, dsp chip 3 is communicated with D/A modular converter 5, extender memory module 8, growth data memory module 9 respectively, electrification reset control circuit 4 is connected with dsp chip 3, A/D modular converter 2, D/A modular converter 5, and D/A modular converter 5 also is connected with filter shape, power amplification, voice output module 6.

Dsp chip of the present invention, A/D modular converter, D/A modular converter, electrification reset control circuit and required external interface are integrated on the circuit board.The dsp chip that adopts is the TMS320C5410 chip of TI company, and its work dominant frequency reaches as high as 100MHz, is furnished with the internal processes RAM of 64k-16bit.

The peripheral circuit of dsp chip 3 comprises:

1, the external program RAM of 64k-16bit, the outer program RAM of sheet only need 1 latent period when running on 100MHz;

2, extender memory module 8 and growth data memory module 9.Be the outer FlashMemory of sheet of 128k-8bit;

3, A/D modular converter 2 and D/A modular converter 5.A/D, D/A modular converter adopt TLC320AD50C, dynamic range 88dB, and signal to noise ratio (S/N ratio) 89dB, maximum sampling rate 22.05kHz, sampling precision 16bit, the RCA interface is for the simulating signal I/O;

4, the electrification reset control circuit 4.Total system use single power supply (+5V) power supply, system adopts the USB interface power supply when linking to each other debugging with host computer, adopt external DC adapter that power supply is provided when normally using;

5, system provides external universal audio interface, can link to each other with various universal phonetic equipment such as microphone, also communication apparatus such as telephone set or calling set directly can be solidified with native system to link to each other, for electronic larynx user application;

6, dsp chip 3 also can be connected with replicating machine 7 by jtag interface, when system debug and function expansion, is convenient to program and data are carried out real-time update.；

Electronic guttural sound signals collecting frequency setting is 8kHz, well below the DSP frequency of operation, therefore use the serial line interface MCBSP0 of dsp chip to link to each other with the D/A modular converter with the A/D modular converter, like this can the effective simplification circuit, the interference that brings to avoid high-frequency signal circuit hypotelorism.

Total system is under the prerequisite that guarantees normal use and extensibility, and is as far as possible that each is modular integrated, and miniaturization is avoided communicating by letter with host computer during independent the use, thereby portability, ease for operation based on dsp system are not fully exerted.

Spectral subtraction algorithm is a kind of voice enhancement algorithm of maturation, but at the singularity of electronic guttural sound, its algorithm is carried out corresponding modification and perfect, and its systematization is applied to reality, and this is a brand-new problem.

Because the electronic larynx user is when exchanging face to face with other people, although its voice quality is lower, the listener can comprehensively understand from aspects such as electronic larynx user's the shape of the mouth as one speaks, expressions.But when the electronic larynx user transmits voice by electronicss such as phone, calling set, microphones, the voice signal that is subjected to serious noise brings misunderstanding probably, therefore the present invention is primarily aimed at these electronic larynx application scenarios, system and telephone set (also other electronicss) are linked to each other, improve the quality of speech signal of transmission, in the hope of reaching the higher intelligibility of speech and melodious degree.

For voice are carried out digital signal processing, need earlier voice to be carried out digitized processing, just in system, use the A/D modular converter.Handle because the present invention requires that voice are carried out real time implementation, so signal Processing and programmed control adopt dsp chip to carry out, to reach processing speed and the precision that meets the demands.Because the sampling that voice are carried out upper frequency is to guarantee its precision, data volume also is sizable, need that the expansion external memory storage collects with storage with handle after speech data.The most reprocessed voice signal is through filter shape, after the power amplification, by the output of D/A modular converter, to obtain voice after the final processing.Whole procedure is solidificated in the system, and back operation automatically powers on.

Native system is had relatively high expectations to real-time.In general, the time interval between phonetic entry and the output should not surpass 0.5s, otherwise will bring the more inconvenience in the use.Therefore, processing speed and the algorithm efficiency to each ingredient of system all proposed higher requirement.In order to guarantee real-time, speech data collection module under the sampling rate of 8kHz, every frame speech sample 256 points.These 256 voice signals are carried out FFT conversion and IFFT conversion, and by D/A output, total system delay, processing time should be carried out strict control at last.

The data flow structure of total system is as described below: speech data collection module is carried out 8kHz to voice, the sampling of 16bit, obtain after 256 sampled points, send into DSP through the A/D modular converter and carry out the FFT computing, and gained result's phase angle noted, to real part and imaginary part squared and, obtain the power spectrum of these 256 voice.Deduct again and store good noise power spectrum, will multiply each other with the noisy speech phase angle of noting previously again behind the clean speech estimated power spectrum evolution, obtain the clean speech estimated spectral.Remake the IFFT conversion, obtain the clean speech estimated value,,, wherein can carry out corresponding power amplification as required by D/A output through after the filter shape.

According to above-mentioned principle, native system should be gathered one section periodically ground unrest and random noise signal in use earlier, preserves after treatment, and is used for the back computing.Specifically, after system boot powered on, the time of nearly 0.5s was the noise acquisition time.During this period of time, the user should be placed on telephone receiver normal use location, and keeps electronic larynx in opening, but sounding not.In general, electronic larynx patient can not be after the unlocking electronic larynx sounding at once, so this section noise acquisition time is acceptable, can not influence normal use.Only note when start-up system, should guarantee that the electronic larynx unlatching is just passable.

System work process can be with reference to system program flow process shown in Figure 2, and as described below: after system powered on, the Flash ROM of 128k-8bit was mapped as data space, the automatic loading (BootLoading) of program when being used to start.After Program reset finishes, enter master routine, the collection of beginning noise data, this process will continue about 0.5s, therefore, the user should guarantee that the preceding electronic larynx that is about to of system start-up places correct position and unlatching when using native system, but do not carry out the sounding action, behind about 0.5s after the system start-up, carry out normal electronic larynx sounding action again.In this process, the letter road that should as far as possible guarantee communication device (as phone, microphone, calling set etc.) is unimpeded, and the relative position of microphone and electronic larynx should be maintained fixed as far as possible, the electronic guttural sound noise of being gathered with assurance has stability, and can be suitable for the electronic guttural sound of back is carried out enhancement process.The electronic guttural sound noise that collects imports DSP into by serial port, and stores as retention data through after the A/D conversion, carries out using when voice strengthen for subsequent step.

After the acquisition noise step finished, system changed voice automatically over to and strengthens process.This moment, program loop was carried out the voice collecting part in the master routine, and voice acquisition module is sampled repeatedly by the sampling rate of 8kHz, and was digital signal with the data that collect by the A/D module converts, sent into the signals collecting buffer zone successively.When collecting 256 data, after the signals collecting buffer zone is filled and expires, produce an interruption, make system change interrupt handling routine automatically over to, a frame voice signal that is about to the signals collecting buffer zone is sent into DSP by general serial mouth MCBSP0, and execution spectral subtraction algorithm, the noise signal substitution spectral subtraction algorithm that a frame band that collects is made an uproar electronic guttural sound (256 point) and gathered before, after treatment steps such as spectrum transformation, obtain 256 pure electronic guttural sound valuation signals, interrupt routine finishes.Again this signal is sent into data by MCBSP0 and send buffer zone, send into the D/A module, be converted to analog voice signal output according to the frequency of 8kHz.Data handling procedure can be with reference to signal processing flow shown in Figure 3.

One frame voice transmit into after the data acquisition buffer zone fully, just can handle, and signal processing carries out between twice collection voice signal, therefore the time delay of whole data collection, processing and process of transmitting is about: 256/8000=0.032s, this time delay can not come any influence to normal hum bar, satisfies the requirement of system to real-time.

The control section of total system operation is also finished by dsp chip.Because the TMS320C5410 type dsp chip arithmetic speed that native system adopts is very high with respect to data acquiring frequency, therefore systems control division can be divided also to place the DSP program to finish, thereby save hardware resource.

Therefore native system designs at the singularity of electronic guttural sound, but its theoretical algorithm and hardware system all have certain universality, for the pathology reconstructed speech of other types such as esophageal voice, food tracheae voice, all has and strengthens effect preferably.

Claims

1. method that strengthens electronic guttural sound, it is characterized in that, this method adopts the hardware system based on DSP, utilize the subtractive method of power spectrum, electronic guttural sound is carried out denoising and enhancing, to eliminate ground unrest and the random noise that is had in the voice, improve voice signal to noise ratio (S/N ratio) and subjective intelligibility, melodious degree; Specifically comprise the following steps:

2) carry out periodicity ground unrest and random noise and estimate that obtaining length is the frame noise valuation of 20～40ms;

4) divide frame with digitize voice, interframe is overlapping to be 0%～75%;

7) utilize of the valuation of the phase place of noisy speech frequency spectrum, return to frequency spectrum, remake IFFT and return to time domain, can obtain the clean speech valuation from power spectrum as the clean speech spectral phase.

2. the method for claim 1 is characterized in that, concrete signal Processing Algorithm treatment step is as follows:

If y (t)=s (t)+n (t), wherein y (i) is a noisy speech, and s (t) is a clean speech, n (t)=n ₁(t)+n ₂(t), n wherein ₁(t) be the periodicity ground unrest, n ₂(t) be random noise.

The frequency spectrum of then obtaining y (t) is:

Y(ω)＝Y _R(ω)+iY _l(ω)＝FFT[y(t)]

Wherein Y (ω) is the frequency spectrum of y (t), Y _R(ω) and Y _l(ω) be respectively real part and the imaginary part of Y (ω);

N(ω)＝N _R(ω)+iN _l(ω)＝FFT[n(t)]

Then the power spectrum of y (t) is:

P_{Y} (ω) = Y_{R}^{2} (ω) + Y_{I}^{2} (ω)

Periodically the valuation of ground unrest and random noise power spectrum is:

P_{N} (ω) = N_{R}^{2} (ω) + N_{I}^{2} (ω)

The valuation of clean speech power spectrum is:

P _S(ω)＝P _Y(ω)-P _N(ω)

The valuation of clean speech frequency spectrum is:

\hat{S} (ω) = P_{S} (ω) \times (\frac{Y_{R} (ω)}{\sqrt{Y_{R}^{2} (ω) + Y_{I}^{2} (ω)}} + \frac{i Y_{I} (ω)}{\sqrt{Y_{R}^{2} (ω) + Y_{I}^{2} (ω)}})

The clean speech valuation is:

\hat{s} (t) = IFFT [\hat{S} (ω)] .

3. an electronic guttural sound enhanced system is characterized in that, this system adopts the hardware system of DSP, utilizes the subtractive method of power spectrum, and electronic guttural sound is carried out denoising and enhancing; Specifically comprise the following steps:

4) divide frame with digitize voice, interframe is overlapping to be 0%～75%;

7) utilize of the valuation of the phase place of noisy speech frequency spectrum, return to frequency spectrum, remake IFFT and return to time domain, can obtain the clean speech valuation from power spectrum as the clean speech spectral phase;

The hardware system of described DSP comprises speech data collection module (1), A/D modular converter (2), dsp chip (3), electrification reset control circuit (4), D/A modular converter (5), filter shape, power amplification, voice output module (6), extender memory module (8), growth data memory module (9); Wherein speech data collection module (1) is connected with dsp chip (3) by A/D modular converter (2), dsp chip (3) is communicated with D/A modular converter (5), extender memory module (8), growth data memory module (9) respectively, electrification reset control circuit (4) is connected with dsp chip (3), A/D conversion chip (2) links to each other with D/A modular converter (5), and D/A modular converter (5) also is connected with filter shape, power amplification, voice output module (6).

4. electronic guttural sound enhanced system as claimed in claim 3 is characterized in that, described dsp chip (3), A/D conversion chip (2), D/A conversion chip (5), electrification reset control circuit (4) and required external interface are integrated on the circuit board.

5. electronic guttural sound enhanced system as claimed in claim 3 is characterized in that, described dsp chip (3) is the TMS320C5410 chip of TI company, and its work dominant frequency is up to 100MHz, is furnished with the internal processes RAM of 64k-16bit.

6. as claim 3 or 4 described electronic guttural sound enhanced system, it is characterized in that described dsp chip (3) also can be connected with replicating machine (7) by jtag interface.

7. electronic guttural sound enhanced system as claimed in claim 6 is characterized in that, described jtag interface can link to each other with host computer when system debug and function expansion, is used for program and data are carried out real-time update.