CN107545888A - A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method - Google Patents
A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method Download PDFInfo
- Publication number
- CN107545888A CN107545888A CN201610466117.8A CN201610466117A CN107545888A CN 107545888 A CN107545888 A CN 107545888A CN 201610466117 A CN201610466117 A CN 201610466117A CN 107545888 A CN107545888 A CN 107545888A
- Authority
- CN
- China
- Prior art keywords
- pharyngeal cavity
- electronic larynx
- larynx
- lip
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Prostheses (AREA)
Abstract
The present invention relates to a kind of self-adjustable pharyngeal cavity electronic larynx voice synthesis and communication system and method,Based on computer software platform and external hardware device,Including camera,Microphone and electronic larynx oscillator,By the visual speech characteristic information for extracting user's face and neck movement image,Realize and electronic larynx working condition and pharyngeal cavity voice source synthesis are automatically controlled,Not only make the use of electronic larynx without hand-held,It is simpler convenient,And solve the problems, such as to synthesize voice source and electronic larynx and apply that position is inconsistent and electronic guttural sound is mechanical unnatural,Enter Mobile state denoising enhancing processing to pharyngeal cavity electronic larynx reconstructed speech simultaneously,Improve the quality and intelligibility of reconstructed speech,And the long-range real-time communication of electronic guttural sound is realized by network transmission technology,The application of electronic larynx is further expanded,Improve the quality of life of laryngect.
Description
Technical field
The invention belongs to lesion speech reconstructing and speech communication technical field, more particularly to a kind of pharynx that can be automatically adjusted
Chamber electronic larynx voice communication system and method.
Background technology
China has a large amount of patients to lose vocality because larynx is cut off every year, and electronic larynx of the prior art is suitable with it
With scope it is wide, it is simple to operate, long-time sounding and can should be readily appreciated that and be widely used.But current electronic guttural sound not from
So, inconvenience, and radiation background noise and ambient noise with very big composition are used, has had a strong impact on the reason of voice
Solution and melodious degree.
The electronic larynx used both at home and abroad at present is mainly the outer formula of neck, and operation principle is that waveform generator provides glottis voice source
Waveform, to drive transducer vibrations, but the application position of electronic larynx is not at glottis during use, but neck both sides are swallowed
Chamber position, this causes the sound channel between glottis and pharyngeal cavity acts on to be ignored and cause the distortion of reconstructed speech, have impact on electronic larynx
The use of voice.
How to improve electronic guttural sound, meet voice source frequency and the requirement automatically adjusted by voice and language needs, be
The focus of domestic and foreign scholars research in recent years.Have at present and be applied to the pressure on pressure drag component with finger to realize to electronic larynx
Frequency of oscillation regulation, also have by controlling expiration amount and vocal cords tensity to adjust the electronics of the frequency and intensity of voice
The E.A.Goldstein of larynx, also Harvard University was researched and proposed with throat electromyographic signal feature equal to 2004 to control
The method of electronic larynx switch, is yielded good result.But this several method all exist using it is difficult, training method is complicated,
The shortcomings that cost is high.
With the development and popularization of computer and network technologies, the development of electronic larynx is also required to meet the needs of networking,
And the electronic larynx of network communication is specifically adapted at present also without related report.
The content of the invention
Use difficult, training method complexity, cost are high to lack present in application for above-mentioned prior art electronic larynx
Point, the present invention provide a kind of self-adjustable pharyngeal cavity electronic larynx voice communication system and method, and the system is with computer hardware
Based on system, the pharyngeal cavity voice source automatically adjusted based on face and neck movement feature is realized by software development and synthesized, electricity
Sub- larynx is easy to use without hand-held, while is integrated with the enhancing processing function of pharyngeal cavity electronic larynx reconstructed speech, and passes through internet
Technology realizes the networked realtime communication of electronic guttural sound, has further expanded the function of electronic larynx.
A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted, including microphone, camera, electronic larynx oscillator,
Audio-video collection module and computer software and hardware system, camera and microphone are fixed on microphone holder, are set below earphone
Fixing band is equipped with, electronic larynx oscillator is arranged in fixing band, and the system also includes following three main modulars:
1) voiced process septum reset and neck movement IMAQ and processing module, realize from moving image and analyze vision language
The extraction of sound characteristic parameter;
2) pharyngeal cavity voice source dynamic synthesis module, the visual speech characteristic parameter of extraction is converted into voice source synthetic model ginseng
Number, and according to pharyngeal cavity voice source mathematical modeling synthetic waveform;
3) pharyngeal cavity electronic larynx reconstructed speech real time enhancing and network communication module, the pharyngeal cavity electronic larynx reconstructed speech of collection is carried out
Real time enhancing processing, and the voice after processing is subjected to telecommunication network transmission, realize network communication function;
Camera regard the moving image collected as input signal by data connecting line and is transferred to moving image processing module
Carry out visual speech characteristic parameter extraction;The visual speech characteristic parameter exported after moving image processing, is used as input signal again
Into the synthesis of pharyngeal cavity voice source synthesis module control waveform;The pharyngeal cavity voice source waveform of synthesis again by data wire export to
Electronic larynx oscillator, put at neck pharyngeal cavity;The pharyngeal cavity electronic larynx voice of reconstruction passes through data after microphone apparatus gathers
Line inputs speech enhan-cement module, while the module also receives the input of control signal;The input of communication module then includes camera
The voice signal two parts exported after the vision signal of collection and enhancing, eventually pass through network and be output to another client, together
When the audio-video signal that sends of another client be also to receive and play in communication module.
Handled from electronic larynx bringing device to computer transmission audio-video signal, the electronic larynx switch of computer extraction
Signal and the pharyngeal cavity voice source signal of synthesis are transferred to electronic larynx bringing device, the electric energy needed for the work of electronic larynx bringing device by
Computer provides.
A kind of method of the pharyngeal cavity electronic larynx voice communication automatically adjusted, audio-video collection module start simultaneously at work, profit
The moving image of user's face and neck in voiced process is obtained by the use of video acquisition module camera to input as system, image
Processing module pre-processes to input picture, removes interference signal, then carry by target-region locating, segmentation, characteristic parameter
Take, and the tracking of characteristic area motion, the visual speech characteristic parameter related to sound mark is obtained, then close through automatically controlling
Changed into system by relation and derive that pharyngeal cavity voice source synthesizes required model parameter and switching signal by Visual Speech Parameters,
Control the vibration of pharyngeal cavity voice source Waveform composition and bringing device;At the same time, audio collection module microphone record is swallowed
Chamber electronic larynx reconstructed speech signal, with reference to switch controlling signal and pharyngeal cavity voice source composite signal, instruct leakage periodic noise and
The estimation of ambient noise, and spectrum subtract the adjustment of parameter, and subtract speech enhan-cement processing to there is sound frame to carry out spectrum, most obtain at last
After video image and the audio signal strengthened are integrated, complete to send, receive and locally play by network system module, realize long-range
Communication.
The System and method for of the present invention, believed by the visual speech feature for extracting user's face and neck movement image
Breath, realizes and electronic larynx working condition and pharyngeal cavity voice source synthesis is automatically controlled, without hold during use, it is simpler just
Victory, solves the problems, such as to synthesize that voice source and electronic larynx application position are inconsistent and electronic guttural sound is unnatural, while to pharyngeal cavity
Electronic larynx reconstructed speech enters Mobile state denoising enhancing processing, improves the quality and intelligibility of reconstructed speech, and pass by network
Transferring technology realizes the long-range real-time communication of electronic guttural sound, has expanded the application of electronic larynx, has improved laryngect
Quality of life.
Brief description of the drawings
Fig. 1 is pharyngeal cavity electronic larynx voice of the present invention synthesis and the structural representation of communication system.
Fig. 2 is moving image processing routine flow chart of the present invention.
Fig. 3 is continuous speech oscillogram of the present invention.
Fig. 4 is lip feature curve (solid line) corresponding with Fig. 3 continuous speech, threshold value (dotted line) and switching signal (dotted line)
Comparison diagram.
Fig. 5 is pharyngeal cavity voice source synthetic schemes of the present invention.
Fig. 6 is electronic larynx bringing device outside drawing of the present invention, and wherein label represents respectively:Earphone 1;Electronic larynx oscillator 2;
Video camera and microphone 3;Connecting line 4, fixing band 5.
Fig. 7 is that pharyngeal cavity electronic larynx voice of the present invention strengthens process flow diagram flow chart.
Embodiment
The present invention is described in further detail below in conjunction with accompanying drawing.
The present invention is based on computer hardware system, using microphone, the first-class audio-video collection module of shooting to sounding mistake
The moving image and pharyngeal cavity electronic larynx reconstructed speech of user's face and neck are gathered in real time in journey, pass through computer program
The system software of the various functions such as visual speech characteristic parameter extraction, the synthesis of pharyngeal cavity voice source is realized in design, is completed to pharyngeal cavity electricity
Sub- larynx voice source waveform automatically controls synthesis, then puts on the output vibration of neck pharyngeal cavity, reconstructed speech collection by oscillator
Handled afterwards by speech enhan-cement, the function of telecommunication is finally realized by network communication module.
The structure chart that whole system is realized can be referring to Fig. 1, and image capture module is by image processing module with automatically controlling
Module is connected, while image capture module is connected by communication module with external network two-phase;Voice acquisition module, voice increase
Strong module by communication module with external network is mutually two-way is connected;Automatic control module passes through voice source synthesis module and electronics
Larynx bringing device is connected.After system starts, audio-video collection module starts simultaneously at work, utilizes video acquisition module camera
Obtain the moving image of user's face and neck in voiced process to input as system, image processing module enters input picture
Row pretreatment, removes interference signal, then by target-region locating, segmentation, characteristic parameter extraction, and characteristic area motion
Tracking, obtains the visual speech characteristic parameter related to sound mark, then through automatically control synthesis system by relation change by
Visual Speech Parameters derive the model parameter and switching signal needed for the synthesis of pharyngeal cavity voice source, and control pharyngeal cavity voice source waveform closes
Into and bringing device vibration;At the same time, audio collection module microphone records pharyngeal cavity electronic larynx reconstructed speech signal,
With reference to switch controlling signal and pharyngeal cavity voice source composite signal, the estimation of leakage periodic noise and ambient noise, Yi Jipu are instructed
Subtract the adjustment of parameter, and to there is sound frame to carry out the audio that spectrum subtracts speech enhan-cement processing, the video image most obtained at last and enhancing
After signal integration, complete to send, receive and locally play by network system module, realize telecommunication.
The first module of the present invention is made up of face and neck movement IMAQ with processing module.The module is from vision language
Sound feature is set out, and the moving image of voiced process septum reset and neck is gathered using camera, and is used as system using vision signal
Input, by pretreatment, target area detection and positioning, segmentation of feature regions and tracking, extraction obtains reflecting sounding feature
Lip and neck visual speech characteristic parameter, including lip opening and closing degree, neck movement signal, and used in this, as output
In the synthesis for instructing pharyngeal cavity voice source, automatically adjusting in real time to electronic guttural sound is realized.
The present invention the second module by automatic control module with can dynamic regulation pharyngeal cavity voice source synthesis module and electronics
Larynx bringing device forms.Lip and neck visual speech characteristic parameter of the module to extract pass through vision spy as input
Corresponding relation between sign and sound mark, conversion obtain corresponding pharyngeal cavity voice source model parameter, including control electronics
The switching signal of larynx synthesis, voice source pitch variation parameter, and vocal tract shape parameter on glottis, these parameters will according to source-
Filter model dynamic synthesis pharyngeal cavity voice source waveform, is exported, and put on neck eventually through pharyngeal cavity electronic larynx peripheral hardware oscillator
Portion pharyngeal cavity position.The problem of for applying position and inconsistent synthesis voice source, the module considers sound when synthesizing voice source
The modulating action of door up to pharyngeal cavity section sound channel, there is provided the pharyngeal cavity voice source waveform being consistent with applying position.
The 3rd module of the present invention is by voice acquisition module, the real time enhancing and communication module of pharyngeal cavity electronic larynx reconstructed speech
Composition.Speech enhan-cement is radiated based on adjustable parameter spectrum-subtraction by the use of voice source composite signal as reference guide electronic larynx
The estimation of ambient noise, according to the ambient noise feature of pharyngeal cavity electronic larynx voice, dynamic select suitably spectrum subtracts coefficient, in conjunction with
Electronic larynx switch controlling signal, targetedly voiced speech is selected to carry out enhancing processing, and the then Jing Yin output of tone-off frame, while it is right
Ambient noise more new estimation;Network communication is based on transmission control protocol (TCP), and client computer, which has, sends audio frequency and video letter
Number, receive audio-video signal and local playing audio-video signal three working cells, finally realize regarding for electronic guttural sound
Frequency communicates.
Present system software section uses Streaming Media development technique, and whole Software for Design is divided into user interface, control is patrolled
Collect, the three-decker of data separating;Modularized design, makes that each functional module is separate, and coupling is small.
Visible Fig. 2 of implementation process of image processing section, for each frame video image of input, first have to by pre- place
Reason, to eliminate the influence of ambient noise, slow various interference noises such as motion (including breathe, the action such as swallow) and illumination.Through
The image for crossing processing use the method for detecting human face based on the colour of skin, the Complexion filter device in selection different color space, obtain lip,
The colour of skin spatial image of face and neck.In different colour of skin spaces, asked for most using improved maximum between-cluster variance (Otsu) method
Good threshold value, obtain the pre-segmentation image of lip, face and neck.It can join in the image of pre-segmentation due to influences such as illumination, the colours of skin
It is miscellaneous to have smaller and scattered interference block, using threshold area elimination approach, eliminate less interference block and retain larger target area
Domain.For different characteristic portions, different characteristic parameters is extracted respectively, obtains different control signals.
Processing for face-image mainly utilizes the change detection electronic larynx of lip shape characteristic reaction sounding start-stop
Switching signal.Comprise the following steps that:
1) initiation parameter, a frame video image is gathered;
2) the lip color characteristic value of lip color filter computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip
Color characteristic value image.If there is former frame, using former frame lip-region scope and colour of skin mean eigenvalue, this frame is instructed
Calculate;
3) optimal segmenting threshold is calculated using improved maximum between-cluster variance (Otsu) method, image binaryzation segmentation is carried out with this,
Obtain lip pre-segmentation image.If there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold;
4) threshold area cancellation processing is carried out to lip pre-segmentation image, eliminates less picture noise and ambient interferences block;
5) profile and central point are carried out to lip region to extract, is obtained using improved one-dimensional Hough (Hough) change detection
Elliptical model parameters with lip, predominantly major and minor axis, while lip region scope is obtained, for instructing next frame lip color characteristic
Value calculates.If there is former frame, the Ellipse Matching of this frame is instructed using the major and minor axis of former frame;
6) differentiated using ratio of semi-minor axis length as mouth shape, by compared with threshold value, obtaining switch level signal, output is used as electronic larynx
Switch controlling signal.
Processing for neck image mainly extracts voice source fundamental frequency, width using the motor message of larynx upper neck region
Degree change control signal.Comprise the following steps that:
1) initiation parameter, a frame video image is gathered;
2) the features of skin colors value of Complexion filter device computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip
Color characteristic value image.If there is former frame, using former frame larynx upper neck region scope, this frame computer capacity is instructed;
3) optimal segmenting threshold is calculated using maximum between-cluster variance (Otsu) method, and carries out image binaryzation segmentation, obtain face
With neck area of skin color image.If there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold;
4) threshold area cancellation processing is carried out to segmentation figure picture, eliminates less picture noise and ambient interferences block;
5) lip lower edge information is referred to, segmentation obtains the larynx since under lip into image the bottom of area of skin color
Neck target area, Save Range are used to instruct next frame features of skin colors value to calculate;
6) optical flow field in larynx low portion of neck region is calculated using the Lucas-Kanada differential methods, obtains reacting the speed point of motion feature
Measure information;
7) cluster analysis is carried out to optical flow field, it is calculated and each cluster centre distance for averagely obtaining, with this determination frequency, amplitude
Change, obtains frequency, changes in amplitude coefficient, and input as pharyngeal cavity voice source synthetic parameters.
The system employs the method for detecting human face based on the colour of skin, using the cluster of the colour of skin, is calculated in YUV color spaces
Lip color characteristic value and features of skin colors value strengthen the discrimination of target area and background.
Target enters segmentation link after being strengthened, the system is chosen most using improved maximum between-cluster variance (Otsu) method
Good segmentation threshold.In order to be adapted for lip color and skin color segmentation, and execution efficiency is improved, following improvement has been done in the system:
1) solution of maximum between-cluster variance (Otsu) method is not dependent on gray value or a certain color component of RGB color image, and
It is that gray level 0~255 is normalized to the lip color and features of skin colors value of each pixel, and using between maximum kind on this gray-scale map
Variance (Otsu) method seeks optimal threshold T;
2) continuity of time-based continuity and changes of threshold, the optimal segmenting threshold with this by previous frame image, and
The optimal segmenting threshold of this two field picture is searched in its neighborhood, not only meets that segmentation requires, and improve and perform speed.
Noise reduction is carried out using area threshold elimination approach, noise and interference block is removed, retains target area.Area threshold it is big
It is small, it is arranged to track 1st/50th of area moment shape frame size.
It is accurate lip and neck target area after image denoising, has met wanting for characteristic parameter extraction algorithm
Ask.The extraction of parameter uses different methods for different genius locis:Lip-region mainly utilizes mouth shape feature, therefore uses
The method of ellipses detection;Neck area mainly utilizes motion feature, therefore uses optical flow method extraction rate information.
It is in general oval, it is necessary to which 5 parameters determine:Centre coordinate, major and minor axis, major axis and X-axis angle, the present invention
The outer contour shape information of lip is only utilized, while for the consideration of requirement of real-time, it is assumed that transverse and X-axis into 0 degree of angle,
And elliptical center coordinate can calculate approximation on the average by lip outline point and obtain, it is left two ginsengs of major semiaxis a and semi-minor axis b
Number, converted using one-dimensional Hough (Hough) and obtain optimal parameter, efficiency is substantially increased on the premise of meeting to require.
According to the elliptical shape parameter of extraction, the ratio b/a of present invention selection semi-minor axis and major semiaxis as judge index,
As Fig. 3 is continuous speech oscillogram of the present invention, Fig. 4 is lip feature curve (solid line) corresponding with Fig. 3 continuous speech, threshold value
The comparison diagram of (dotted line) and switching signal (dotted line), it is seen that there is good Shape invariance using b/a values, can overcome due to
Collection distance causes lip size in image to change and caused misjudgment, accurately reflects the situation of change of mouth shape, uses
Judgement signal that it is obtained and speech waveform have a good goodness of fit, and judging nicety rate is higher.During for continuous pronunciation, use
Delay pattern, remove the OFF signal that word wrap is brought so that during keep ON signal, when occur for a long time pause when, just
OFF signal occurs, meets electronic larynx use habit.
The present invention uses the small movements information of the Lucas-Kanada methods extraction neck in the differential method.Using object pixel as
Suitable neighborhood is chosen at center, is calculated the light stream of the pixel using Lucas-Kanada equations in whole neighborhood, and with
Same method calculates whole image with regard to that can obtain the optical flow field of whole image.
Include the information of frequency change in the motion of neck image, by experiment statisticses, light stream is changed according to frequency
Change carries out cluster analysis, obtains two typical case's clusters, i.e. frequency rise cluster and frequency reduces cluster.Each two field picture is carried
The optical flow field information taken carries out Distance Judgment with cluster template, when distance is less than certain limit, regards as being raised and lowered, no
Then think that frequency is constant, exported in this, as frequency change parameter.
Include spatially and temporally two parts information for a complete vision signal, correspond to respectively in frame and interframe is believed
Breath.It is slowly continuously it is assumed that employing time-space domain connection in the image procossing of the present invention based on face when speaking and neck change
The real-Time Tracking Control method of conjunction, i.e., by the segmentation of previous frame image cut zone this frame of information guiding target area, very well
Make use of in frame and inter-frame information, not only compensate for the problem of still image segmentation is inaccurate, and improve splitting speed.
Tracking and controlling method is mainly reflected in following several respects in the system of the present invention:
1) when characteristic area detects, the lip that obtains using former frame, neck target area scope, instruct to set the detection of this frame
Scope, so reduces the picture size of processing, while removes part ambient interferences, makes the effect of subsequent treatment more preferable.
2) when maximum between-cluster variance (Otsu) method solves segmentation threshold, using the optimal threshold of former frame, this frame figure is reduced
As threshold search scope, it is possible to reduce amount of calculation, and be avoided that to obtain the segmentation threshold of local optimum, and there are two interframe
The mistake of threshold value mutation, ensure the stationarity of threshold curve.
3) during one-dimensional Hough (Hough) conversion ellipses detection, searching for this frame b values is reduced using the semi-minor axis b values of previous frame
Rope scope, ensure the continuity of tracking, prevent Hough (Hough) conversion from occurring the situation of transition in itself, meanwhile, set correction to sentence
Off line system, if b/a values do not meet the normal rates scope of mouth shape, give up result this time, keep the result of previous frame.
Image processing section of the present invention, on the premise of real-time is met, successfully it is extracted from vision signal various
Phonetic synthesis parameter control signal, and the synthesis of pharyngeal cavity voice source is automatically adjusted as control signal, and assisted reconstruction voice
Enhancing processing.
Pharyngeal cavity voice source automatically controls synthesis, using pharyngeal cavity voice source model as guidance, is extracted using from moving image
Visual speech characteristic parameter automatically adjust the synthetic parameters of pharyngeal cavity voice source model, automatically control synthesis pharyngeal cavity throat so as to reach
The purpose of sound wave, eventually through electronic larynx bringing device by synthetic waveform output vibration.
Pharyngeal cavity voice source Waveform composition uses source-filter model in the present invention.As shown in figure 5, first with glottis throat
The parameter model of source of sound, according to acquisition system parameter, the switching signal of extraction and model parameter signal and customer parameter, adjust
It is whole and set each model parameter value, synthesize glottis voice source waveform according to mathematical modeling.Secondly, the single tube mould of uniform area is utilized
Type, channel model parameter is adjusted according to control signal, synthesizes the frequency response function of sound channel on glottis, and to glottis voice source ripple
Shape is modulated, final to synthesize pharyngeal cavity voice source model.
The synthesis of glottis voice source uses segmentation parameter model, and specific mathematical notation is as follows:
ug=Asin(in1π)n2n1i≤n1-Asin(i-n12n2π)n1<i≤(n1+n2)-Aατsup(i-n1-n2)cos(i-
n1-n2N2πλ)(n1+n2)<i≤(N=n1+n2+n3)]]>
Wherein, τsupFor damped oscillation coefficient on glottis, α is closure phase amplitude attenuation factor, is set all in accordance with experiment;n1、n2、n3
For the form parameter of voice source monocycle waveform, open phase ascent stage, open phase descending branch and closure phase length are represented respectively, its
Ratio is set according to beep pattern, and N is Cycle Length, i.e. N=n1+n2+n3;A controls for amplitude, and λ is that sound channel first is common on glottis
Shake peak frequency F1With fundamental frequency f0Ratio, these three values all adjust according to the control signal of extraction dynamic.
Because sound channel length is shorter on glottis to pharyngeal glottis, therefore can be approximately the single tube model of uniform area, its frequency
Rate receptance function and formant frequency are:
H(f)=1cos(2πfl/c)]]>
Fn=(2n-1)c4l=(2n-1)F1(n=1,2,3...)]]>
Wherein, l is sound channel length, can dynamically be adjusted in smaller range by control parameter, and its change can be according to above formula influence sound
Upper first formant of door, while adjust voice source synthetic parameters λ value.
Fundamental frequency f in the present invention0, the model parameter such as amplitude A and sound channel length l dynamic adjustment, be all according to former frame
Value is used as benchmark, and appropriate adjustment is made according to control signal.For the first frame then by initial value design, wherein fundamental frequency f0Initial value according to
User's sex is set according to average fundamental frequency, and amplitude A can also be set by user according to effect, sound channel length l then bases
Experimental result average value is set.Finally, pharyngeal cavity voice source waveform is obtained by glottis voice source waveform after sound channel modulation on glottis.
The pharyngeal cavity voice source waveform of synthesis is vibrated by electronic larynx bringing device to be exported, and puts on pharyngeal cavity position under neck,
Its appearance design is as shown in Figure 6.Total design is similar to earphone shape, and camera and microphone are fixed on microphone holder
On, fixing band is provided with below earphone, electronic larynx oscillator is placed on it, and armamentarium together, is made by framework integration
Used time is securable to need position, without hand-held.Wherein, the position of electronic larynx oscillator can be adjusted in connect band, with
Meet the needs of different users.
Whole electronic larynx bringing device must be connected progress with computer system by standard universal serial bus (USB) interface
Signal transmits, mainly including following three aspects:First, transmitted from electronic larynx bringing device to computer at audio-video signal
Reason;Second, the electronic larynx switching signal and the pharyngeal cavity voice source signal of synthesis of computer extraction are transferred to electronic larynx bringing device;
3rd, the electric energy needed for the work of electronic larynx bringing device is provided by computer.
The present invention can be referring to Fig. 7 for the idiographic flow of pharyngeal cavity electronic larynx reconstructed speech enhancing, and this method is with adjustable parameter
Spectrum-subtraction based on, electronic guttural sound is determined whether using switching signal, if the then Jing Yin output, while update ring of tone-off frame
Border noise, enhancing processing is carried out if there is sound frame then to subtract using adjustable parameter power spectrum, to eliminate the leakage carried in voice
Periodic noise and ambient noise, improve voice signal to noise ratio and subjective intelligibility, melodious degree.
Pharyngeal cavity electronic larynx voice Enhancement Method is all kept in short-term based on periodicity ambient noise, ambient noise and reconstructed speech
Steady and incoherent to subtract it is assumed that carrying out Parameter Energy spectrum in frequency domain, specific formula is as follows:
Wherein, Y (ω), S (ω), N (ω) are respectively the frequency spectrum of noisy speech, clean speech and noise, and thread is threshold value system
Number, its value sets by experiment statisticses, and α subtracts parameter for adjustable spectrum, and β be spectrum smoothing coefficient, its value can according to noisy speech energy with
The ratio dynamic of estimated noise energy adjusts, i.e. and hypothesis γ=| Y (ω) | 2 | N (ω) | 2 ,]]>Then spectrum subtract coefficient can be according to following formula
Adjustment:
α=1+γ/k1β=γ/k2]]>
Wherein k1、k2Two coefficients are set by statistical experiment.
Then clean speech valuation is:
s^(t)=IFFT[|S^(ω)|·ejargY(ω)]]]>
The part that spectrum-subtraction carries out speech enhan-cement most critical is exactly noise estimation, and the system utilizes switch controlling signal and voice source
Synthetic parameters etc., carry out noise estimation in terms of electronic larynx reveals periodic noise and ambient noise two respectively.
Electronic larynx leakage noise is periodic noise, and its periodicity is consistent with the electronic larynx vibration period, can utilize and close
Into the fundamental frequency f of pharyngeal cavity voice source waveform0, the parameter information such as amplitude A, estimation electronic larynx leakage periodic noise, and according to voice
The dynamic regulation of source synthesis, the estimation of electronic larynx leakage noise can also adjust therewith, ensure that the renewal at any time of noise.
The estimation of ambient noise is divided into initial noisc estimation and noise updates two parts:
The estimation of initial noisc is before system starts, user's sounding, and continuous acquisition L frames noise simultaneously calculates average work(
Rate is composed, as initial noisc power spectrum:
|N^0(ω)|2=1LΣl=1L|Nl(ω)|2]]>
Continue to gather M frame noises, verify whether to meet following condition with the power spectrum of this M frame noise:
(1-χ)|N^0(ω)|2<|Nm(ω)|2<(1+χ)|N^0(ω)|2]]>
If satisfied, then qualified, initial noisc estimation terminates;If not satisfied, then resurvey noise estimation.X is loose in above formula
Coefficient, unsuitable excessive also unsuitable too small, the system is taken as 0.4.
Noise renewal is a critically important step in ambient noise estimation, in the whole electronic larynx course of work ambient noise without
Method ensures stable state, and the system is using weighted average come adaptive renewal noise.It is as follows with formulae express:
Wherein it is current noise power Spectral Estimation, is former frame power Spectral Estimation, λ and ε is fixed coefficient.In view of the steady of algorithm
Tracking performance qualitative and to nonstationary noise, the general values of λ are that the general values of 0.9~0.98, ε are 1.5~2.5.
Network communication portion is mainly locally realizing the Socket transport modules of audio, video data, and sound is then realized in remote port
The Socket receiving modules of video data, then in local broadcasting.Module uses the transmission method that audio, video data is separated, and is
A Socket connection is respectively created in they, and on each Socket, sending and receiving for data can be carried out simultaneously.By
Send and receive in audio-visual synchronization, can solve stationary problem.Because audio, video data is substantial amounts of, continuous, needs can
By transmission, therefore transmission control protocol (TCP) is selected in the transmission of these data.
The audio-video collection module of the present invention has versatility and applicability to different hardware system, for audio-video collection
Module is not particularly limited, using USB camera as video acquisition module in system, microphone audio collection mould by default
Block.
Vision signal uses PAL system (PAL), and image acquisition parameter can carry property pages by camera and be adjusted,
In order to ensure the fluency of video and segmentation tracking effect, gather image is sized such that 640 × 480, coloured image metadata
Form is 24 bitmaps, and video frame rate is defaulted as 20 frames/second, and video delay is 50ms.
Audio signal uses two-channel, and quantified precision is 16.The setting of audio buffer is critically important, if too small can must influence
Audio collection efficiency, it is excessive, larger delay is produced, and it is related to the stationary problem of audio frequency and video with video acquisition frame rate, warp
Measuring is crossed, 70ms is defaulted as in the system.
Requirement of the system of the present invention to real-time is very high, in general, between the time between audio frequency and video input and output
Every 0.5s is not to be exceeded.The external equipment of present system is less, performs speed mainly by Computer signal processing method speed
Influence.Processing is simplified because the complexity of various algorithms is not very high, and using technological means such as audio frequency and video tracking
Process so that total system delay obtains strict control, ensure that requirement of real-time.
Claims (6)
1. a kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted, including microphone, camera, electronic larynx oscillator (2),
Audio-video collection module and computer software and hardware system, camera and microphone (3) are fixed on microphone holder, earphone (1)
Lower section is provided with fixing band, and electronic larynx oscillator is arranged in fixing band (5), it is characterised in that:
The system includes following three main modulars:
1) voiced process septum reset and neck movement IMAQ and processing module, realize from moving image and analyze vision language
The extraction of sound characteristic parameter;
2) pharyngeal cavity voice source dynamic synthesis module, the visual speech characteristic parameter of extraction is converted into voice source synthetic model ginseng
Number, and according to pharyngeal cavity voice source mathematical modeling synthetic waveform;
3) pharyngeal cavity electronic larynx reconstructed speech real time enhancing and network communication module, the pharyngeal cavity electronic larynx reconstructed speech of collection is carried out
Real time enhancing processing, and the voice after processing is subjected to telecommunication network transmission, realize network communication function;
Camera regard the moving image collected as input signal by data connecting line and is transferred to moving image processing module
Carry out visual speech characteristic parameter extraction;The visual speech characteristic parameter exported after moving image processing, is used as input signal again
Into the synthesis of pharyngeal cavity voice source synthesis module control waveform;The pharyngeal cavity voice source waveform of synthesis again by data wire export to
Electronic larynx oscillator, put at neck pharyngeal cavity;The pharyngeal cavity electronic larynx voice of reconstruction passes through data after microphone apparatus gathers
Line inputs speech enhan-cement module, while the module also receives the input of control signal;The input of communication module then includes camera
The voice signal two parts exported after the vision signal of collection and enhancing, eventually pass through network and be output to another client, together
When the audio-video signal that sends of another client be also to receive and play in communication module;
Handled from electronic larynx bringing device to computer transmission audio-video signal, the electronic larynx switching signal of computer extraction
Electronic larynx bringing device is transferred to the pharyngeal cavity voice source signal of synthesis, the electric energy needed for the work of electronic larynx bringing device is by calculating
Machine provides.
A kind of 2. method of the pharyngeal cavity electronic larynx voice communication automatically adjusted, it is characterised in that:Audio-video collection module is opened simultaneously
Beginning work, obtain the moving image of user's face and neck in voiced process by the use of video acquisition module camera and be used as system
Input, image processing module pre-process to input picture, remove interference signal, recycle face complexion characteristic target region
Positioning, segmentation, characteristic parameter extraction, and the tracking of characteristic area motion, it is special to obtain the visual speech related to sound mark
Levy parameter, then needed for deriving that pharyngeal cavity voice source synthesizes by Visual Speech Parameters through automatically controlling synthesis system by relation conversion
Model parameter and switching signal, control the vibration of pharyngeal cavity voice source Waveform composition and bringing device;At the same time, audio is adopted
Collection module microphone records pharyngeal cavity electronic larynx reconstructed speech signal, with reference to switch controlling signal and pharyngeal cavity voice source synthesis letter
Breath, instruct the estimation of leakage periodic noise and ambient noise, and spectrum to subtract the adjustment of parameter, and subtract voice to there is sound frame to carry out spectrum
Enhancing is handled, and after the video image and the audio signal of enhancing most obtained at last is integrated, is completed to send by network system module, is connect
By with local broadcasting, realize telecommunication.
3. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that:Described
Facial movement image procossing is mainly the change detection electronic larynx switching signal using lip shape characteristic reaction sounding start-stop, is had
Body step is as follows:
1) initiation parameter, a frame video image is gathered;
2) the lip color characteristic value of lip color filter computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip
Color characteristic value image, if there is former frame, using former frame lip-region scope and colour of skin mean eigenvalue, instruct this frame
Calculate;
3) optimal segmenting threshold is calculated using improved maximum between-cluster variance Otsu methods, image binaryzation segmentation is carried out with this, obtained
To lip pre-segmentation image, if there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold;
4) threshold area cancellation processing is carried out to lip pre-segmentation image, eliminates less picture noise and ambient interferences block;
5) profile is carried out to lip region and central point extracts, detected and matched using improved one-dimensional Hough Hough transform
The elliptical model parameters of lip, predominantly major and minor axis, while lip region scope is obtained, for instructing next frame lip color characteristic value
Calculate, if there is former frame, the Ellipse Matching of this frame is instructed using the major and minor axis of former frame;
6) differentiated using ratio of semi-minor axis length as mouth shape, by compared with threshold value, obtaining switch level signal, output is used as electronic larynx
Switch controlling signal.
4. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that:Described
Neck image processing is to extract voice source fundamental frequency, changes in amplitude control signal using the motor message of larynx upper neck region, tool
Body step is as follows:
1) initiation parameter, a frame video image is gathered;
2) the features of skin colors value of Complexion filter device computational rules rectangular extent is utilized, and is normalized to 0-255 gray levels, obtains lip
Color characteristic value image, if there is former frame, using former frame larynx upper neck region scope, instruct this frame computer capacity;
3) calculate optimal segmenting threshold using maximum between-cluster variance Otsu methods, and carry out image binaryzation segmentation, obtain face and
Neck area of skin color image, if there is former frame, the calculating of this frame segmentation threshold is instructed using former frame segmentation threshold;
4) threshold area cancellation processing is carried out to segmentation figure picture, eliminates less picture noise and ambient interferences block;
5) lip lower edge information is referred to, segmentation obtains the larynx since under lip into image the bottom of area of skin color
Neck target area, Save Range are used to instruct next frame features of skin colors value to calculate;
6) optical flow field in larynx low portion of neck region is calculated using the Lucas-Kanada differential methods, obtains reacting the speed point of motion feature
Measure information;
7) cluster analysis is carried out to optical flow field, it is calculated and each cluster centre distance for averagely obtaining, with this determination frequency, amplitude
Change, obtains frequency, changes in amplitude coefficient, and input as pharyngeal cavity voice source synthetic parameters.
5. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that:Using base
In the method for detecting human face of the colour of skin, using the cluster of the colour of skin, lip color characteristic value and features of skin colors value are calculated in YUV color spaces
To strengthen the discrimination of target area and background, target enters segmentation link after being strengthened, using maximum between-cluster variance Otsu
Method chooses optimal segmenting threshold, in order to be adapted for lip color and skin color segmentation, improves execution efficiency, has done following improvement:
1) solution of maximum between-cluster variance Otsu methods is not dependent on gray value or a certain color component of RGB color image, but
Lip color and features of skin colors value to each pixel normalize to gray level 0~255, and side between maximum kind is utilized on this gray-scale map
Poor Otsu methods seek optimal threshold T;
2) continuity of time-based continuity and changes of threshold, by the optimal segmenting threshold of previous frame image, and at it
The optimal segmenting threshold of this two field picture is searched in neighborhood, meets that segmentation requires, and improve and perform speed.
6. the method for the pharyngeal cavity electronic larynx voice communication according to claim 2 automatically adjusted, it is characterised in that:The ginseng
Several extractions use different methods for different genius locis:Detect to obtain matching mouth using one-dimensional Hough Hough transform
The elliptical model parameters of lip, the mouth shape characteristic parameter of lip-region is extracted, the control signal as pharyngeal cavity electronic larynx switch;Using
Optical flow method extracts the movable information characteristic parameter of neck area, is used as pharyngeal cavity electronic larynx voice source frequency and width by cluster analysis
The control signal of degree, pharyngeal cavity voice source automatically control synthesis, based on pharyngeal cavity voice source model, using from moving image
The visual speech characteristic parameter of extraction automatically adjusts the synthetic parameters of pharyngeal cavity voice source model, synthesizes pharyngeal cavity voice source waveform, leads to
Cross electronic larynx bringing device and exported by synthetic waveform and vibrated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610466117.8A CN107545888A (en) | 2016-06-24 | 2016-06-24 | A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610466117.8A CN107545888A (en) | 2016-06-24 | 2016-06-24 | A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107545888A true CN107545888A (en) | 2018-01-05 |
Family
ID=60960476
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610466117.8A Pending CN107545888A (en) | 2016-06-24 | 2016-06-24 | A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107545888A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596064A (en) * | 2018-04-13 | 2018-09-28 | 长安大学 | Driver based on Multi-information acquisition bows operating handset behavioral value method |
CN114999461A (en) * | 2022-05-30 | 2022-09-02 | 中国科学技术大学 | Silent voice decoding method based on facial neck surface myoelectricity |
CN116778888A (en) * | 2023-08-21 | 2023-09-19 | 山东鲁南数据科技股份有限公司 | Bionic pronunciation device |
-
2016
- 2016-06-24 CN CN201610466117.8A patent/CN107545888A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596064A (en) * | 2018-04-13 | 2018-09-28 | 长安大学 | Driver based on Multi-information acquisition bows operating handset behavioral value method |
CN114999461A (en) * | 2022-05-30 | 2022-09-02 | 中国科学技术大学 | Silent voice decoding method based on facial neck surface myoelectricity |
CN114999461B (en) * | 2022-05-30 | 2024-05-07 | 中国科学技术大学 | Silent voice decoding method based on surface myoelectricity of face and neck |
CN116778888A (en) * | 2023-08-21 | 2023-09-19 | 山东鲁南数据科技股份有限公司 | Bionic pronunciation device |
CN116778888B (en) * | 2023-08-21 | 2023-11-21 | 山东鲁南数据科技股份有限公司 | Bionic pronunciation device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107293289B (en) | Speech generation method for generating confrontation network based on deep convolution | |
CN101916566B (en) | Electronic larynx speech reconstructing method and system thereof | |
CN105976809B (en) | Identification method and system based on speech and facial expression bimodal emotion fusion | |
CN101474104B (en) | Self-adjusting pharyngeal cavity electronic larynx voice communication system | |
Ishi et al. | Evaluation of formant-based lip motion generation in tele-operated humanoid robots | |
CN104637350B (en) | One kind adult's hearing speech rehabilitation system | |
CN108460334A (en) | A kind of age forecasting system and method based on vocal print and facial image Fusion Features | |
CN109147763A (en) | A kind of audio-video keyword recognition method and device based on neural network and inverse entropy weighting | |
CN110119672A (en) | A kind of embedded fatigue state detection system and method | |
CN107545888A (en) | A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method | |
CN100596186C (en) | An interactive digital multimedia making method based on video and audio | |
CN108831463B (en) | Lip language synthesis method and device, electronic equipment and storage medium | |
CN105448291A (en) | Parkinsonism detection method and detection system based on voice | |
Ishi et al. | Speech-driven lip motion generation for tele-operated humanoid robots | |
CN110136709A (en) | Audio recognition method and video conferencing system based on speech recognition | |
CN106710604A (en) | Formant enhancement apparatus and method for improving speech intelligibility | |
Llorach et al. | Web-based live speech-driven lip-sync | |
CN116934926B (en) | Recognition method and system based on multi-mode data fusion | |
CN117975991B (en) | Digital person driving method and device based on artificial intelligence | |
CN114664289A (en) | Voice emotion recognition method based on convolutional neural network | |
CN116095357B (en) | Live broadcasting method, device and system of virtual anchor | |
Bhattacharjee et al. | Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson's Disease and Healthy Subjects. | |
CN113012717A (en) | Emotional feedback information recommendation system and method based on voice recognition | |
CN116366872A (en) | Live broadcast method, device and system based on man and artificial intelligence | |
CN116682168B (en) | Multi-modal expression recognition method, medium and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180105 |