CN107545888A

CN107545888A - A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method

Info

Publication number: CN107545888A
Application number: CN201610466117.8A
Authority: CN
Inventors: 徐天同
Original assignee: Changzhou Michelle Wai Intelligent Technology Co Ltd
Current assignee: Changzhou Michelle Wai Intelligent Technology Co Ltd
Priority date: 2016-06-24
Filing date: 2016-06-24
Publication date: 2018-01-05

Abstract

The present invention relates to a kind of self-adjustable pharyngeal cavity electronic larynx voice synthesis and communication system and method,Based on computer software platform and external hardware device,Including camera,Microphone and electronic larynx oscillator,By the visual speech characteristic information for extracting user's face and neck movement image,Realize and electronic larynx working condition and pharyngeal cavity voice source synthesis are automatically controlled,Not only make the use of electronic larynx without hand-held,It is simpler convenient,And solve the problems, such as to synthesize voice source and electronic larynx and apply that position is inconsistent and electronic guttural sound is mechanical unnatural,Enter Mobile state denoising enhancing processing to pharyngeal cavity electronic larynx reconstructed speech simultaneously,Improve the quality and intelligibility of reconstructed speech,And the long-range real-time communication of electronic guttural sound is realized by network transmission technology,The application of electronic larynx is further expanded,Improve the quality of life of laryngect.

Description

A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method

Technical field

The invention belongs to lesion speech reconstructing and speech communication technical field, more particularly to a kind of pharynx that can be automatically adjusted Chamber electronic larynx voice communication system and method.

Background technology

China has a large amount of patients to lose vocality because larynx is cut off every year, and electronic larynx of the prior art is suitable with it With scope it is wide, it is simple to operate, long-time sounding and can should be readily appreciated that and be widely used.But current electronic guttural sound not from So, inconvenience, and radiation background noise and ambient noise with very big composition are used, has had a strong impact on the reason of voice Solution and melodious degree.

The electronic larynx used both at home and abroad at present is mainly the outer formula of neck, and operation principle is that waveform generator provides glottis voice source Waveform, to drive transducer vibrations, but the application position of electronic larynx is not at glottis during use, but neck both sides are swallowed Chamber position, this causes the sound channel between glottis and pharyngeal cavity acts on to be ignored and cause the distortion of reconstructed speech, have impact on electronic larynx The use of voice.

How to improve electronic guttural sound, meet voice source frequency and the requirement automatically adjusted by voice and language needs, be The focus of domestic and foreign scholars research in recent years.Have at present and be applied to the pressure on pressure drag component with finger to realize to electronic larynx Frequency of oscillation regulation, also have by controlling expiration amount and vocal cords tensity to adjust the electronics of the frequency and intensity of voice The E.A.Goldstein of larynx, also Harvard University was researched and proposed with throat electromyographic signal feature equal to 2004 to control The method of electronic larynx switch, is yielded good result.But this several method all exist using it is difficult, training method is complicated, The shortcomings that cost is high.

With the development and popularization of computer and network technologies, the development of electronic larynx is also required to meet the needs of networking, And the electronic larynx of network communication is specifically adapted at present also without related report.

The content of the invention

Use difficult, training method complexity, cost are high to lack present in application for above-mentioned prior art electronic larynx Point, the present invention provide a kind of self-adjustable pharyngeal cavity electronic larynx voice communication system and method, and the system is with computer hardware Based on system, the pharyngeal cavity voice source automatically adjusted based on face and neck movement feature is realized by software development and synthesized, electricity Sub- larynx is easy to use without hand-held, while is integrated with the enhancing processing function of pharyngeal cavity electronic larynx reconstructed speech, and passes through internet Technology realizes the networked realtime communication of electronic guttural sound, has further expanded the function of electronic larynx.

A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted, including microphone, camera, electronic larynx oscillator, Audio-video collection module and computer software and hardware system, camera and microphone are fixed on microphone holder, are set below earphone Fixing band is equipped with, electronic larynx oscillator is arranged in fixing band, and the system also includes following three main modulars：

1) voiced process septum reset and neck movement IMAQ and processing module, realize from moving image and analyze vision language The extraction of sound characteristic parameter；

2) pharyngeal cavity voice source dynamic synthesis module, the visual speech characteristic parameter of extraction is converted into voice source synthetic model ginseng Number, and according to pharyngeal cavity voice source mathematical modeling synthetic waveform；

3) pharyngeal cavity electronic larynx reconstructed speech real time enhancing and network communication module, the pharyngeal cavity electronic larynx reconstructed speech of collection is carried out Real time enhancing processing, and the voice after processing is subjected to telecommunication network transmission, realize network communication function；

Camera regard the moving image collected as input signal by data connecting line and is transferred to moving image processing module Carry out visual speech characteristic parameter extraction；The visual speech characteristic parameter exported after moving image processing, is used as input signal again Into the synthesis of pharyngeal cavity voice source synthesis module control waveform；The pharyngeal cavity voice source waveform of synthesis again by data wire export to Electronic larynx oscillator, put at neck pharyngeal cavity；The pharyngeal cavity electronic larynx voice of reconstruction passes through data after microphone apparatus gathers Line inputs speech enhan-cement module, while the module also receives the input of control signal；The input of communication module then includes camera The voice signal two parts exported after the vision signal of collection and enhancing, eventually pass through network and be output to another client, together When the audio-video signal that sends of another client be also to receive and play in communication module.

Handled from electronic larynx bringing device to computer transmission audio-video signal, the electronic larynx switch of computer extraction Signal and the pharyngeal cavity voice source signal of synthesis are transferred to electronic larynx bringing device, the electric energy needed for the work of electronic larynx bringing device by Computer provides.

A kind of method of the pharyngeal cavity electronic larynx voice communication automatically adjusted, audio-video collection module start simultaneously at work, profit The moving image of user's face and neck in voiced process is obtained by the use of video acquisition module camera to input as system, image Processing module pre-processes to input picture, removes interference signal, then carry by target-region locating, segmentation, characteristic parameter Take, and the tracking of characteristic area motion, the visual speech characteristic parameter related to sound mark is obtained, then close through automatically controlling Changed into system by relation and derive that pharyngeal cavity voice source synthesizes required model parameter and switching signal by Visual Speech Parameters, Control the vibration of pharyngeal cavity voice source Waveform composition and bringing device；At the same time, audio collection module microphone record is swallowed Chamber electronic larynx reconstructed speech signal, with reference to switch controlling signal and pharyngeal cavity voice source composite signal, instruct leakage periodic noise and The estimation of ambient noise, and spectrum subtract the adjustment of parameter, and subtract speech enhan-cement processing to there is sound frame to carry out spectrum, most obtain at last After video image and the audio signal strengthened are integrated, complete to send, receive and locally play by network system module, realize long-range Communication.

The System and method for of the present invention, believed by the visual speech feature for extracting user's face and neck movement image Breath, realizes and electronic larynx working condition and pharyngeal cavity voice source synthesis is automatically controlled, without hold during use, it is simpler just Victory, solves the problems, such as to synthesize that voice source and electronic larynx application position are inconsistent and electronic guttural sound is unnatural, while to pharyngeal cavity Electronic larynx reconstructed speech enters Mobile state denoising enhancing processing, improves the quality and intelligibility of reconstructed speech, and pass by network Transferring technology realizes the long-range real-time communication of electronic guttural sound, has expanded the application of electronic larynx, has improved laryngect Quality of life.

Brief description of the drawings

Fig. 1 is pharyngeal cavity electronic larynx voice of the present invention synthesis and the structural representation of communication system.

Fig. 2 is moving image processing routine flow chart of the present invention.

Fig. 3 is continuous speech oscillogram of the present invention.

Fig. 4 is lip feature curve (solid line) corresponding with Fig. 3 continuous speech, threshold value (dotted line) and switching signal (dotted line) Comparison diagram.

Fig. 5 is pharyngeal cavity voice source synthetic schemes of the present invention.

Fig. 6 is electronic larynx bringing device outside drawing of the present invention, and wherein label represents respectively：Earphone 1；Electronic larynx oscillator 2； Video camera and microphone 3；Connecting line 4, fixing band 5.

Fig. 7 is that pharyngeal cavity electronic larynx voice of the present invention strengthens process flow diagram flow chart.

Embodiment

The present invention is described in further detail below in conjunction with accompanying drawing.

The present invention is based on computer hardware system, using microphone, the first-class audio-video collection module of shooting to sounding mistake The moving image and pharyngeal cavity electronic larynx reconstructed speech of user's face and neck are gathered in real time in journey, pass through computer program The system software of the various functions such as visual speech characteristic parameter extraction, the synthesis of pharyngeal cavity voice source is realized in design, is completed to pharyngeal cavity electricity Sub- larynx voice source waveform automatically controls synthesis, then puts on the output vibration of neck pharyngeal cavity, reconstructed speech collection by oscillator Handled afterwards by speech enhan-cement, the function of telecommunication is finally realized by network communication module.

The structure chart that whole system is realized can be referring to Fig. 1, and image capture module is by image processing module with automatically controlling Module is connected, while image capture module is connected by communication module with external network two-phase；Voice acquisition module, voice increase Strong module by communication module with external network is mutually two-way is connected；Automatic control module passes through voice source synthesis module and electronics Larynx bringing device is connected.After system starts, audio-video collection module starts simultaneously at work, utilizes video acquisition module camera Obtain the moving image of user's face and neck in voiced process to input as system, image processing module enters input picture Row pretreatment, removes interference signal, then by target-region locating, segmentation, characteristic parameter extraction, and characteristic area motion Tracking, obtains the visual speech characteristic parameter related to sound mark, then through automatically control synthesis system by relation change by Visual Speech Parameters derive the model parameter and switching signal needed for the synthesis of pharyngeal cavity voice source, and control pharyngeal cavity voice source waveform closes Into and bringing device vibration；At the same time, audio collection module microphone records pharyngeal cavity electronic larynx reconstructed speech signal, With reference to switch controlling signal and pharyngeal cavity voice source composite signal, the estimation of leakage periodic noise and ambient noise, Yi Jipu are instructed Subtract the adjustment of parameter, and to there is sound frame to carry out the audio that spectrum subtracts speech enhan-cement processing, the video image most obtained at last and enhancing After signal integration, complete to send, receive and locally play by network system module, realize telecommunication.

The first module of the present invention is made up of face and neck movement IMAQ with processing module.The module is from vision language Sound feature is set out, and the moving image of voiced process septum reset and neck is gathered using camera, and is used as system using vision signal Input, by pretreatment, target area detection and positioning, segmentation of feature regions and tracking, extraction obtains reflecting sounding feature Lip and neck visual speech characteristic parameter, including lip opening and closing degree, neck movement signal, and used in this, as output In the synthesis for instructing pharyngeal cavity voice source, automatically adjusting in real time to electronic guttural sound is realized.

The present invention the second module by automatic control module with can dynamic regulation pharyngeal cavity voice source synthesis module and electronics Larynx bringing device forms.Lip and neck visual speech characteristic parameter of the module to extract pass through vision spy as input Corresponding relation between sign and sound mark, conversion obtain corresponding pharyngeal cavity voice source model parameter, including control electronics The switching signal of larynx synthesis, voice source pitch variation parameter, and vocal tract shape parameter on glottis, these parameters will according to source- Filter model dynamic synthesis pharyngeal cavity voice source waveform, is exported, and put on neck eventually through pharyngeal cavity electronic larynx peripheral hardware oscillator Portion pharyngeal cavity position.The problem of for applying position and inconsistent synthesis voice source, the module considers sound when synthesizing voice source The modulating action of door up to pharyngeal cavity section sound channel, there is provided the pharyngeal cavity voice source waveform being consistent with applying position.

The 3rd module of the present invention is by voice acquisition module, the real time enhancing and communication module of pharyngeal cavity electronic larynx reconstructed speech Composition.Speech enhan-cement is radiated based on adjustable parameter spectrum-subtraction by the use of voice source composite signal as reference guide electronic larynx The estimation of ambient noise, according to the ambient noise feature of pharyngeal cavity electronic larynx voice, dynamic select suitably spectrum subtracts coefficient, in conjunction with Electronic larynx switch controlling signal, targetedly voiced speech is selected to carry out enhancing processing, and the then Jing Yin output of tone-off frame, while it is right Ambient noise more new estimation；Network communication is based on transmission control protocol (TCP), and client computer, which has, sends audio frequency and video letter Number, receive audio-video signal and local playing audio-video signal three working cells, finally realize regarding for electronic guttural sound Frequency communicates.

Present system software section uses Streaming Media development technique, and whole Software for Design is divided into user interface, control is patrolled Collect, the three-decker of data separating；Modularized design, makes that each functional module is separate, and coupling is small.

Visible Fig. 2 of implementation process of image processing section, for each frame video image of input, first have to by pre- place Reason, to eliminate the influence of ambient noise, slow various interference noises such as motion (including breathe, the action such as swallow) and illumination.Through The image for crossing processing use the method for detecting human face based on the colour of skin, the Complexion filter device in selection different color space, obtain lip, The colour of skin spatial image of face and neck.In different colour of skin spaces, asked for most using improved maximum between-cluster variance (Otsu) method Good threshold value, obtain the pre-segmentation image of lip, face and neck.It can join in the image of pre-segmentation due to influences such as illumination, the colours of skin It is miscellaneous to have smaller and scattered interference block, using threshold area elimination approach, eliminate less interference block and retain larger target area Domain.For different characteristic portions, different characteristic parameters is extracted respectively, obtains different control signals.

Processing for face-image mainly utilizes the change detection electronic larynx of lip shape characteristic reaction sounding start-stop Switching signal.Comprise the following steps that：

1) initiation parameter, a frame video image is gathered；

2) the lip color characteristic value of lip color filter computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip Color characteristic value image.If there is former frame, using former frame lip-region scope and colour of skin mean eigenvalue, this frame is instructed Calculate；

3) optimal segmenting threshold is calculated using improved maximum between-cluster variance (Otsu) method, image binaryzation segmentation is carried out with this, Obtain lip pre-segmentation image.If there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold；

4) threshold area cancellation processing is carried out to lip pre-segmentation image, eliminates less picture noise and ambient interferences block；

5) profile and central point are carried out to lip region to extract, is obtained using improved one-dimensional Hough (Hough) change detection Elliptical model parameters with lip, predominantly major and minor axis, while lip region scope is obtained, for instructing next frame lip color characteristic Value calculates.If there is former frame, the Ellipse Matching of this frame is instructed using the major and minor axis of former frame；

6) differentiated using ratio of semi-minor axis length as mouth shape, by compared with threshold value, obtaining switch level signal, output is used as electronic larynx Switch controlling signal.

Processing for neck image mainly extracts voice source fundamental frequency, width using the motor message of larynx upper neck region Degree change control signal.Comprise the following steps that：

1) initiation parameter, a frame video image is gathered；

2) the features of skin colors value of Complexion filter device computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip Color characteristic value image.If there is former frame, using former frame larynx upper neck region scope, this frame computer capacity is instructed；

3) optimal segmenting threshold is calculated using maximum between-cluster variance (Otsu) method, and carries out image binaryzation segmentation, obtain face With neck area of skin color image.If there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold；

4) threshold area cancellation processing is carried out to segmentation figure picture, eliminates less picture noise and ambient interferences block；

5) lip lower edge information is referred to, segmentation obtains the larynx since under lip into image the bottom of area of skin color Neck target area, Save Range are used to instruct next frame features of skin colors value to calculate；

6) optical flow field in larynx low portion of neck region is calculated using the Lucas-Kanada differential methods, obtains reacting the speed point of motion feature Measure information；

7) cluster analysis is carried out to optical flow field, it is calculated and each cluster centre distance for averagely obtaining, with this determination frequency, amplitude Change, obtains frequency, changes in amplitude coefficient, and input as pharyngeal cavity voice source synthetic parameters.

The system employs the method for detecting human face based on the colour of skin, using the cluster of the colour of skin, is calculated in YUV color spaces Lip color characteristic value and features of skin colors value strengthen the discrimination of target area and background.

Target enters segmentation link after being strengthened, the system is chosen most using improved maximum between-cluster variance (Otsu) method Good segmentation threshold.In order to be adapted for lip color and skin color segmentation, and execution efficiency is improved, following improvement has been done in the system：

1) solution of maximum between-cluster variance (Otsu) method is not dependent on gray value or a certain color component of RGB color image, and It is that gray level 0~255 is normalized to the lip color and features of skin colors value of each pixel, and using between maximum kind on this gray-scale map Variance (Otsu) method seeks optimal threshold T；

2) continuity of time-based continuity and changes of threshold, the optimal segmenting threshold with this by previous frame image, and The optimal segmenting threshold of this two field picture is searched in its neighborhood, not only meets that segmentation requires, and improve and perform speed.

Noise reduction is carried out using area threshold elimination approach, noise and interference block is removed, retains target area.Area threshold it is big It is small, it is arranged to track 1st/50th of area moment shape frame size.

It is accurate lip and neck target area after image denoising, has met wanting for characteristic parameter extraction algorithm Ask.The extraction of parameter uses different methods for different genius locis：Lip-region mainly utilizes mouth shape feature, therefore uses The method of ellipses detection；Neck area mainly utilizes motion feature, therefore uses optical flow method extraction rate information.

It is in general oval, it is necessary to which 5 parameters determine：Centre coordinate, major and minor axis, major axis and X-axis angle, the present invention The outer contour shape information of lip is only utilized, while for the consideration of requirement of real-time, it is assumed that transverse and X-axis into 0 degree of angle, And elliptical center coordinate can calculate approximation on the average by lip outline point and obtain, it is left two ginsengs of major semiaxis a and semi-minor axis b Number, converted using one-dimensional Hough (Hough) and obtain optimal parameter, efficiency is substantially increased on the premise of meeting to require.

According to the elliptical shape parameter of extraction, the ratio b/a of present invention selection semi-minor axis and major semiaxis as judge index, As Fig. 3 is continuous speech oscillogram of the present invention, Fig. 4 is lip feature curve (solid line) corresponding with Fig. 3 continuous speech, threshold value The comparison diagram of (dotted line) and switching signal (dotted line), it is seen that there is good Shape invariance using b/a values, can overcome due to Collection distance causes lip size in image to change and caused misjudgment, accurately reflects the situation of change of mouth shape, uses Judgement signal that it is obtained and speech waveform have a good goodness of fit, and judging nicety rate is higher.During for continuous pronunciation, use Delay pattern, remove the OFF signal that word wrap is brought so that during keep ON signal, when occur for a long time pause when, just OFF signal occurs, meets electronic larynx use habit.

The present invention uses the small movements information of the Lucas-Kanada methods extraction neck in the differential method.Using object pixel as Suitable neighborhood is chosen at center, is calculated the light stream of the pixel using Lucas-Kanada equations in whole neighborhood, and with Same method calculates whole image with regard to that can obtain the optical flow field of whole image.

Include the information of frequency change in the motion of neck image, by experiment statisticses, light stream is changed according to frequency Change carries out cluster analysis, obtains two typical case's clusters, i.e. frequency rise cluster and frequency reduces cluster.Each two field picture is carried The optical flow field information taken carries out Distance Judgment with cluster template, when distance is less than certain limit, regards as being raised and lowered, no Then think that frequency is constant, exported in this, as frequency change parameter.

Include spatially and temporally two parts information for a complete vision signal, correspond to respectively in frame and interframe is believed Breath.It is slowly continuously it is assumed that employing time-space domain connection in the image procossing of the present invention based on face when speaking and neck change The real-Time Tracking Control method of conjunction, i.e., by the segmentation of previous frame image cut zone this frame of information guiding target area, very well Make use of in frame and inter-frame information, not only compensate for the problem of still image segmentation is inaccurate, and improve splitting speed.

Tracking and controlling method is mainly reflected in following several respects in the system of the present invention：

1) when characteristic area detects, the lip that obtains using former frame, neck target area scope, instruct to set the detection of this frame Scope, so reduces the picture size of processing, while removes part ambient interferences, makes the effect of subsequent treatment more preferable.

2) when maximum between-cluster variance (Otsu) method solves segmentation threshold, using the optimal threshold of former frame, this frame figure is reduced As threshold search scope, it is possible to reduce amount of calculation, and be avoided that to obtain the segmentation threshold of local optimum, and there are two interframe The mistake of threshold value mutation, ensure the stationarity of threshold curve.

3) during one-dimensional Hough (Hough) conversion ellipses detection, searching for this frame b values is reduced using the semi-minor axis b values of previous frame Rope scope, ensure the continuity of tracking, prevent Hough (Hough) conversion from occurring the situation of transition in itself, meanwhile, set correction to sentence Off line system, if b/a values do not meet the normal rates scope of mouth shape, give up result this time, keep the result of previous frame.

Image processing section of the present invention, on the premise of real-time is met, successfully it is extracted from vision signal various Phonetic synthesis parameter control signal, and the synthesis of pharyngeal cavity voice source is automatically adjusted as control signal, and assisted reconstruction voice Enhancing processing.

Pharyngeal cavity voice source automatically controls synthesis, using pharyngeal cavity voice source model as guidance, is extracted using from moving image Visual speech characteristic parameter automatically adjust the synthetic parameters of pharyngeal cavity voice source model, automatically control synthesis pharyngeal cavity throat so as to reach The purpose of sound wave, eventually through electronic larynx bringing device by synthetic waveform output vibration.

Pharyngeal cavity voice source Waveform composition uses source-filter model in the present invention.As shown in figure 5, first with glottis throat The parameter model of source of sound, according to acquisition system parameter, the switching signal of extraction and model parameter signal and customer parameter, adjust It is whole and set each model parameter value, synthesize glottis voice source waveform according to mathematical modeling.Secondly, the single tube mould of uniform area is utilized Type, channel model parameter is adjusted according to control signal, synthesizes the frequency response function of sound channel on glottis, and to glottis voice source ripple Shape is modulated, final to synthesize pharyngeal cavity voice source model.

The synthesis of glottis voice source uses segmentation parameter model, and specific mathematical notation is as follows：

ug=Asin(in1π)n2n1i≤n1-Asin(i-n12n2π)n1<i≤(n1+n2)-Aατsup(i-n1-n2)cos(i- n1-n2N2πλ)(n1+n2)<i≤(N=n1+n2+n3)]]>

Wherein, τ_supFor damped oscillation coefficient on glottis, α is closure phase amplitude attenuation factor, is set all in accordance with experiment；n₁、n₂、n₃ For the form parameter of voice source monocycle waveform, open phase ascent stage, open phase descending branch and closure phase length are represented respectively, its Ratio is set according to beep pattern, and N is Cycle Length, i.e. N=n₁+n₂+n₃；A controls for amplitude, and λ is that sound channel first is common on glottis Shake peak frequency F₁With fundamental frequency f₀Ratio, these three values all adjust according to the control signal of extraction dynamic.

Because sound channel length is shorter on glottis to pharyngeal glottis, therefore can be approximately the single tube model of uniform area, its frequency Rate receptance function and formant frequency are：

H(f)=1cos(2πfl/c)]]>

Fn=(2n-1)c4l=(2n-1)F1(n=1,2,3...)]]>

Wherein, l is sound channel length, can dynamically be adjusted in smaller range by control parameter, and its change can be according to above formula influence sound Upper first formant of door, while adjust voice source synthetic parameters λ value.

Fundamental frequency f in the present invention₀, the model parameter such as amplitude A and sound channel length l dynamic adjustment, be all according to former frame Value is used as benchmark, and appropriate adjustment is made according to control signal.For the first frame then by initial value design, wherein fundamental frequency f₀Initial value according to User's sex is set according to average fundamental frequency, and amplitude A can also be set by user according to effect, sound channel length l then bases Experimental result average value is set.Finally, pharyngeal cavity voice source waveform is obtained by glottis voice source waveform after sound channel modulation on glottis.

The pharyngeal cavity voice source waveform of synthesis is vibrated by electronic larynx bringing device to be exported, and puts on pharyngeal cavity position under neck, Its appearance design is as shown in Figure 6.Total design is similar to earphone shape, and camera and microphone are fixed on microphone holder On, fixing band is provided with below earphone, electronic larynx oscillator is placed on it, and armamentarium together, is made by framework integration Used time is securable to need position, without hand-held.Wherein, the position of electronic larynx oscillator can be adjusted in connect band, with Meet the needs of different users.

Whole electronic larynx bringing device must be connected progress with computer system by standard universal serial bus (USB) interface Signal transmits, mainly including following three aspects：First, transmitted from electronic larynx bringing device to computer at audio-video signal Reason；Second, the electronic larynx switching signal and the pharyngeal cavity voice source signal of synthesis of computer extraction are transferred to electronic larynx bringing device； 3rd, the electric energy needed for the work of electronic larynx bringing device is provided by computer.

The present invention can be referring to Fig. 7 for the idiographic flow of pharyngeal cavity electronic larynx reconstructed speech enhancing, and this method is with adjustable parameter Spectrum-subtraction based on, electronic guttural sound is determined whether using switching signal, if the then Jing Yin output, while update ring of tone-off frame Border noise, enhancing processing is carried out if there is sound frame then to subtract using adjustable parameter power spectrum, to eliminate the leakage carried in voice Periodic noise and ambient noise, improve voice signal to noise ratio and subjective intelligibility, melodious degree.

Pharyngeal cavity electronic larynx voice Enhancement Method is all kept in short-term based on periodicity ambient noise, ambient noise and reconstructed speech Steady and incoherent to subtract it is assumed that carrying out Parameter Energy spectrum in frequency domain, specific formula is as follows：

Wherein, Y (ω), S (ω), N (ω) are respectively the frequency spectrum of noisy speech, clean speech and noise, and thread is threshold value system Number, its value sets by experiment statisticses, and α subtracts parameter for adjustable spectrum, and β be spectrum smoothing coefficient, its value can according to noisy speech energy with The ratio dynamic of estimated noise energy adjusts, i.e. and hypothesis γ=| Y (ω) | 2 | N (ω) | 2 ,]]>Then spectrum subtract coefficient can be according to following formula Adjustment：

α=1+γ/k1β=γ/k2]]>

Wherein k₁、k₂Two coefficients are set by statistical experiment.

Then clean speech valuation is：

s^(t)=IFFT[|S^(ω)|·ejargY(ω)]]]>

The part that spectrum-subtraction carries out speech enhan-cement most critical is exactly noise estimation, and the system utilizes switch controlling signal and voice source Synthetic parameters etc., carry out noise estimation in terms of electronic larynx reveals periodic noise and ambient noise two respectively.

Electronic larynx leakage noise is periodic noise, and its periodicity is consistent with the electronic larynx vibration period, can utilize and close Into the fundamental frequency f of pharyngeal cavity voice source waveform₀, the parameter information such as amplitude A, estimation electronic larynx leakage periodic noise, and according to voice The dynamic regulation of source synthesis, the estimation of electronic larynx leakage noise can also adjust therewith, ensure that the renewal at any time of noise.

The estimation of ambient noise is divided into initial noisc estimation and noise updates two parts：

The estimation of initial noisc is before system starts, user's sounding, and continuous acquisition L frames noise simultaneously calculates average work( Rate is composed, as initial noisc power spectrum：

|N^0(ω)|2=1LΣl=1L|Nl(ω)|2]]>

Continue to gather M frame noises, verify whether to meet following condition with the power spectrum of this M frame noise：

(1-χ)|N^0(ω)|2<|Nm(ω)|2<(1+χ)|N^0(ω)|2]]>

If satisfied, then qualified, initial noisc estimation terminates；If not satisfied, then resurvey noise estimation.X is loose in above formula Coefficient, unsuitable excessive also unsuitable too small, the system is taken as 0.4.

Noise renewal is a critically important step in ambient noise estimation, in the whole electronic larynx course of work ambient noise without Method ensures stable state, and the system is using weighted average come adaptive renewal noise.It is as follows with formulae express：

Wherein it is current noise power Spectral Estimation, is former frame power Spectral Estimation, λ and ε is fixed coefficient.In view of the steady of algorithm Tracking performance qualitative and to nonstationary noise, the general values of λ are that the general values of 0.9~0.98, ε are 1.5~2.5.

Network communication portion is mainly locally realizing the Socket transport modules of audio, video data, and sound is then realized in remote port The Socket receiving modules of video data, then in local broadcasting.Module uses the transmission method that audio, video data is separated, and is A Socket connection is respectively created in they, and on each Socket, sending and receiving for data can be carried out simultaneously.By Send and receive in audio-visual synchronization, can solve stationary problem.Because audio, video data is substantial amounts of, continuous, needs can By transmission, therefore transmission control protocol (TCP) is selected in the transmission of these data.

The audio-video collection module of the present invention has versatility and applicability to different hardware system, for audio-video collection Module is not particularly limited, using USB camera as video acquisition module in system, microphone audio collection mould by default Block.

Vision signal uses PAL system (PAL), and image acquisition parameter can carry property pages by camera and be adjusted, In order to ensure the fluency of video and segmentation tracking effect, gather image is sized such that 640 × 480, coloured image metadata Form is 24 bitmaps, and video frame rate is defaulted as 20 frames/second, and video delay is 50ms.

Audio signal uses two-channel, and quantified precision is 16.The setting of audio buffer is critically important, if too small can must influence Audio collection efficiency, it is excessive, larger delay is produced, and it is related to the stationary problem of audio frequency and video with video acquisition frame rate, warp Measuring is crossed, 70ms is defaulted as in the system.

Requirement of the system of the present invention to real-time is very high, in general, between the time between audio frequency and video input and output Every 0.5s is not to be exceeded.The external equipment of present system is less, performs speed mainly by Computer signal processing method speed Influence.Processing is simplified because the complexity of various algorithms is not very high, and using technological means such as audio frequency and video tracking Process so that total system delay obtains strict control, ensure that requirement of real-time.

Claims

1. a kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted, including microphone, camera, electronic larynx oscillator (2), Audio-video collection module and computer software and hardware system, camera and microphone (3) are fixed on microphone holder, earphone (1) Lower section is provided with fixing band, and electronic larynx oscillator is arranged in fixing band (5), it is characterised in that：

The system includes following three main modulars：

Camera regard the moving image collected as input signal by data connecting line and is transferred to moving image processing module Carry out visual speech characteristic parameter extraction；The visual speech characteristic parameter exported after moving image processing, is used as input signal again Into the synthesis of pharyngeal cavity voice source synthesis module control waveform；The pharyngeal cavity voice source waveform of synthesis again by data wire export to Electronic larynx oscillator, put at neck pharyngeal cavity；The pharyngeal cavity electronic larynx voice of reconstruction passes through data after microphone apparatus gathers Line inputs speech enhan-cement module, while the module also receives the input of control signal；The input of communication module then includes camera The voice signal two parts exported after the vision signal of collection and enhancing, eventually pass through network and be output to another client, together When the audio-video signal that sends of another client be also to receive and play in communication module；

Handled from electronic larynx bringing device to computer transmission audio-video signal, the electronic larynx switching signal of computer extraction Electronic larynx bringing device is transferred to the pharyngeal cavity voice source signal of synthesis, the electric energy needed for the work of electronic larynx bringing device is by calculating Machine provides.

A kind of 2. method of the pharyngeal cavity electronic larynx voice communication automatically adjusted, it is characterised in that：Audio-video collection module is opened simultaneously Beginning work, obtain the moving image of user's face and neck in voiced process by the use of video acquisition module camera and be used as system Input, image processing module pre-process to input picture, remove interference signal, recycle face complexion characteristic target region Positioning, segmentation, characteristic parameter extraction, and the tracking of characteristic area motion, it is special to obtain the visual speech related to sound mark Levy parameter, then needed for deriving that pharyngeal cavity voice source synthesizes by Visual Speech Parameters through automatically controlling synthesis system by relation conversion Model parameter and switching signal, control the vibration of pharyngeal cavity voice source Waveform composition and bringing device；At the same time, audio is adopted Collection module microphone records pharyngeal cavity electronic larynx reconstructed speech signal, with reference to switch controlling signal and pharyngeal cavity voice source synthesis letter Breath, instruct the estimation of leakage periodic noise and ambient noise, and spectrum to subtract the adjustment of parameter, and subtract voice to there is sound frame to carry out spectrum Enhancing is handled, and after the video image and the audio signal of enhancing most obtained at last is integrated, is completed to send by network system module, is connect By with local broadcasting, realize telecommunication.

3. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that：Described Facial movement image procossing is mainly the change detection electronic larynx switching signal using lip shape characteristic reaction sounding start-stop, is had Body step is as follows：

1) initiation parameter, a frame video image is gathered；

2) the lip color characteristic value of lip color filter computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip Color characteristic value image, if there is former frame, using former frame lip-region scope and colour of skin mean eigenvalue, instruct this frame Calculate；

3) optimal segmenting threshold is calculated using improved maximum between-cluster variance Otsu methods, image binaryzation segmentation is carried out with this, obtained To lip pre-segmentation image, if there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold；

5) profile is carried out to lip region and central point extracts, detected and matched using improved one-dimensional Hough Hough transform The elliptical model parameters of lip, predominantly major and minor axis, while lip region scope is obtained, for instructing next frame lip color characteristic value Calculate, if there is former frame, the Ellipse Matching of this frame is instructed using the major and minor axis of former frame；

4. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that：Described Neck image processing is to extract voice source fundamental frequency, changes in amplitude control signal using the motor message of larynx upper neck region, tool Body step is as follows：

1) initiation parameter, a frame video image is gathered；

2) the features of skin colors value of Complexion filter device computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip Color characteristic value image, if there is former frame, using former frame larynx upper neck region scope, instruct this frame computer capacity；

3) calculate optimal segmenting threshold using maximum between-cluster variance Otsu methods, and carry out image binaryzation segmentation, obtain face and Neck area of skin color image, if there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold；

5. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that：Using base In the method for detecting human face of the colour of skin, using the cluster of the colour of skin, lip color characteristic value and features of skin colors value are calculated in YUV color spaces To strengthen the discrimination of target area and background, target enters segmentation link after being strengthened, using maximum between-cluster variance Otsu Method chooses optimal segmenting threshold, in order to be adapted for lip color and skin color segmentation, improves execution efficiency, has done following improvement：

1) solution of maximum between-cluster variance Otsu methods is not dependent on gray value or a certain color component of RGB color image, but Lip color and features of skin colors value to each pixel normalize to gray level 0~255, and side between maximum kind is utilized on this gray-scale map Poor Otsu methods seek optimal threshold T；

2) continuity of time-based continuity and changes of threshold, by the optimal segmenting threshold of previous frame image, and at it The optimal segmenting threshold of this two field picture is searched in neighborhood, meets that segmentation requires, and improve and perform speed.

6. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that：The ginseng Several extractions use different methods for different genius locis：Detect to obtain matching mouth using one-dimensional Hough Hough transform The elliptical model parameters of lip, the mouth shape characteristic parameter of lip-region is extracted, the control signal as pharyngeal cavity electronic larynx switch；Using Optical flow method extracts the movable information characteristic parameter of neck area, is used as pharyngeal cavity electronic larynx voice source frequency and width by cluster analysis The control signal of degree, pharyngeal cavity voice source automatically control synthesis, based on pharyngeal cavity voice source model, using from moving image The visual speech characteristic parameter of extraction automatically adjusts the synthetic parameters of pharyngeal cavity voice source model, synthesizes pharyngeal cavity voice source waveform, leads to Cross electronic larynx bringing device and exported by synthetic waveform and vibrated.