WO2012003602A1

WO2012003602A1 - Method for reconstructing electronic larynx speech and system thereof

Info

Publication number: WO2012003602A1
Application number: PCT/CN2010/001022
Authority: WO
Inventors: 万明习; 吴亮; 王素品; 牛志峰; 万聪颖
Original assignee: 西安交通大学
Priority date: 2010-07-09
Filing date: 2010-07-09
Publication date: 2012-01-12
Also published as: US8650027B2; US20130035940A1

Abstract

A method for reconstructing electronic larynx speech and a system thereof are provided. The method includes the following steps: firstly, extracting model parameters from the collected speech as a parameter library; then collecting facial images of a sounder, and transmitting the facial images to an image analyzing and processing module to obtain phonation start-stop moments and phonation vowel sorts; then synthesizing voice source waveform by a voice source synthesis module; finally, outputting the voice source waveform by an electronic larynx vibration output module. The voice source synthesis module firstly sets glottis voice source model parameters, thereby synthesizing glottis voice source waveform, then simulates the sounds traveling in a sound channel by a waveguide model, and selects shape parameters of the sound channel according to the phonation vowel sorts, thereby synthesizing electronic larynx voice source waveform. The speech reconstructed by the method and the system thereof is much closer to the sound of the sounder himself.

Description

Electronic throat speech reconstruction method and system thereof

The invention belongs to the field of lesion speech reconstruction, and particularly relates to an electronic throat speech reconstruction method and a system thereof. Background technique

Voice or language is the main means by which humans express feelings and communicate with each other. However, according to statistics, thousands of people around the world lose their vocal ability temporarily or permanently due to various laryngeal operations every year. In view of this, various voice rehabilitation techniques have emerged, among which esophageal voice, tracheal esophageal voice, and artificial electronic throat voice are the most common, and artificial electronic throat is widely used because of its simple use, wide application range, and long-term sound. .

Chinese Patent Application No. 200910020897. No. 3 discloses an automatically adjusted method for voice communication of a pharyngeal electronic throat which removes other noises and thereby improves the quality of reconstructed speech. The working principle of the electronic throat is to provide a missing source of arpeggio vibration, and transmit the vibration into the channel through the transducer for voice modulation, and finally generate the voice through the lip radiation. It can be seen that the source of the missing voice is the most fundamental task of the electronic throat. However, the vibration source provided by the electronic throat currently on the market is mostly square wave or pulse signal, although the improved linear transducer is The glottal source can be output, but these do not match the missing vibrating sources during actual use. Whether it is a cervical or an oral electronic throat, the position of vibration transmitted into the vocal tract is not a glottis, and for different surgical situations of different patients, not only the lack of vocal cords, but also the lack of partial vocalization, these need to be in the electronic The laryngeal vibration source is compensated, so it is necessary to improve the electronic throat quality from the aspect of the electronic throat.

In view of the above problems, it is necessary to provide an electronic throat voice reconstruction method and system thereof that can solve the above technical problems.

The technical problem to be solved by the present invention is to provide an electronic throat speech reconstruction method and a system thereof. The speech reconstructed by the method not only compensates for the acoustic characteristics of the missing channel, but also retains the user's personality characteristics and is closer to the user itself. The sound characteristics, voice quality is better.

To achieve the above object, the present invention provides an electronic throat speech reconstruction method, which first extracts model parameters from a collected speech as a parameter library, and then collects a facial image of the utterer, and transmits the image to an image analysis and processing module, the image After the analysis and processing module analyzes and processes, the vocal start and stop vocalization categories are obtained, and then the vocal source synthesis module is controlled and the noise source waveform is synthesized by the utterance start and stop vocalization categories, and finally, the electronic throat vibration output mode is synthesized. The block outputs the above-mentioned sound source waveform, and the electronic throat vibration output module comprises a front circuit and an electronic throat vibrator, and the synthesizing steps of the sound source synthesis module are as follows:

1) synthesizing the glottal sound source waveform, that is, selecting the glottal sound source model parameter in the parameter library according to the personality characteristic of the user's voice, wherein the sound start and end time controls the start and end of the sound source synthesis, and the voice door sound source synthesis Using the LF model, the specific mathematical representation is as follows:

u {t)二E e ^al sinO ) (0<^< t _e )

In the above formula, Ee is the amplitude parameter, t _p , t _e . t _a , tc are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period and the fundamental frequency period, and the remaining parameters can be above. The five parameters are jointly obtained according to the following formula:

s _ =l-e- ― "

π

U _e二₀ 1 ^e { sin wJ _e ― cos ω ge )- + ω λ/(α +ω )

E„ =—E _n e ^at ' sin ω ί

2.0 ^ <o.i

2— 2.34^+1.34 feet 0.1<R <0.5

2A6-132R + 0.64 (R _a -0.5)

2) Select the shape parameter of the channel according to the vocal vocal category, use the waveguide model to simulate the sound propagation in the channel, and calculate the 嗓 source waveform according to the following formula--

< ι = (卜^; )", + - ψ _Μ = - '; « + ", a ₊₁ ) — 4 - 4 ₊ ι

"7 = (1 + )", — ₊₁ + = u; _+l + η (u + ) A _i + A _l

L- r Λ

Glottis

[ ^li PS:

^{The R} _N ~~ ^L channel is represented by a plurality of sound tube cascades of uniform cross-sectional area. In the above formula, 4 and 4+ are the area functions of the first and the first, +ι sound tubes, ", + and "The positive sound pressure and the reverse sound pressure in the first sound tube are the reflection coefficients of the adjacent interface of the /th and +1th sound tubes, respectively. As a preferred embodiment of the present invention, the image analysis and processing module includes the following steps: Step 1: Initialize parameters, that is, preset analysis of a rectangular frame range, an area threshold, and a neural network weight coefficient, and then acquire a frame of video image, wherein The area threshold is one percent of the area of the analysis rectangle;

Step 2: The lip area is detected by the skin color based detection method, that is, the lip color characteristic value of the rectangular frame range is calculated in the YUV color space according to the following formula, and normalized to 0-255 gray level:

Z = 0.493Λ - 0.589G + 0.0265

Step 3: Calculate the optimal segmentation value of the gray image of the lip color feature value by using the improved maximum inter-class variance method, and then binarize the image by using the threshold value, thereby obtaining the initial segmentation image of the lip;

Step 4: Using the area threshold method, the area of the preliminary segmentation image whose area is smaller than the threshold is eliminated as noise, and the final lip segmentation image is obtained;

Step 5: Perform outer contour and center point extraction on the lip area: Set the long axis of the ellipse to a zero degree angle with the X axis, use the elliptical model to match the outer contour of the lip, and measure the size of the ellipse long and short axis by one-dimensional Hough transform. ;

Step 6: calculating a vocal start and end time and a vowel vowel category by using a normalized semi-major axis, a normalized semi-short axis, a long-short axis ratio, and a lip normalized area value as a set of parameters, wherein the normalization The semi-major axis, the normalized semi-short axis, and the normalized area of the lips all refer to the normalized values of the static semi-major axis, the semi-short axis, and the lip area when the sound is not sounded.

As another preferred embodiment of the present invention, in step 6 of the image analysis and processing module, an artificial neural network algorithm is used to calculate the vocal start and stop time and the utterance vowel category.

As another preferred embodiment of the present invention, the artificial neural network algorithm is a three-layer network, including an input layer, an implicit layer, and an output layer, wherein the input layer includes four inputs, that is, a normalized semi-major axis, The normalized semi-short axis, the ratio of the long and short axes, and the normalized area of the lips. The output layer consists of six outputs, ie no sound, /◎*, /%8, 1-1, and /../five yuan sound.

As another preferred embodiment of the present invention, in the process of synthesizing the squeak source, the sound pressure waveform of the lower part of the pharyngeal cavity is used as the squeak source waveform applied to the neck.

As another preferred embodiment of the present invention, in the sound source synthesis process, the sound pressure waveform at the oral position is used as the sound source waveform applied in the oral cavity.

In order to achieve the above object, the present invention also provides an electronic throat voice system, including a CMOS image sensor, an FPGA chip connected to an output end of the CMOS image sensor, a voice chip connected to an output end of the FPGA chip, and a voice chip. An electronic throat vibrator connected to the output.

The electronic throat speech reconstruction method and system thereof have at least the following advantages: First, in the glottal source LF model of the chirp source synthesis module, the glottal waveform is composed of amplitude parameters Ee and t _p , t _e , t _a , t _c Four time parameters are co-characterized, and these five parameters can be extracted from the speech, so For different users, it can be extracted from the speech retained before the loss of sound as a synthetic parameter, so the reconstructed speech has the user's personality characteristics; in addition, in the vocal waveguide model of the 嗓 source synthesis module, according to the video signal The determined vocal vowel category selects the vocal shape parameter, and selects a suitable vibrator application position according to the surgical resection of the user's throat. Therefore, the sound pressure waveform corresponding to the spatial position of the corresponding channel is synthesized as the electronic throat sound source waveform for the applied portion. In this way, not only meets the user's actual situation, but also greatly retains the user's personality characteristics, so that the reconstructed speech is closer to the user's original voice, and the reconstructed voice quality is improved. BRIEF DESCRIPTION OF THE DRAWINGS

1 is a schematic flow chart of a method for reconstructing an electronic throat of the present invention;

2 is a flow chart of a lip moving image processing and control parameter extraction program of the present invention; FIG. 3 is a flow chart of synthesizing a sound source of the present invention;

4 is a waveform diagram of an electronic throat sound source synthesized in different sounding and use cases of the present invention; FIG. 5 is a schematic diagram of an electronic throat vibration output module of the present invention;

Figure 6 is a block diagram showing the structure of the electronic throat voice system of the present invention. The best way to achieve your invention

The electronic throat speech reconstruction method and system thereof are described in detail below with reference to the accompanying drawings: The invention uses a computer system as a platform to adjust the synthesis of the noise source waveform according to the specific situation of the user's voice loss and the characteristics of the individual sound generation, and simultaneously utilizes the video signal to the voice. The source synthesis is controlled in real time, and finally the above-mentioned chirp source waveform is output through the electronic throat vibration output module connected by the parallel port.

The system for reconstructing an electronic throat voice of the present invention comprises an image acquisition device, an image processing and analysis module connected to an output end of the image acquisition device, a sound source synthesis module connected to an output end of the image processing and analysis module, and a sound source synthesis An electronic throat vibration output module connected to the output of the module.

Referring to FIG. 1 , when the system is started, the image acquisition device, that is, the camera collects the facial image during the user's utterance process, and transmits the facial image to the image processing and analysis module, and the image processing and analysis module receives the data. After processing and analysis, that is, through lip detection, segmentation, edge extraction and fitting, the shape parameters of the elliptical model of the lip edge are obtained, and then the artificial neural network algorithm is used to calculate and determine the start and end time of the vocalization and the vocal vocal category. The control signal synthesized by the sound source; the sound source synthesis module adopts the principle of sound synthesis, according to the situation of different users, including the surgical condition, the personality characteristics of the sound, and the extracted vocal start and stop vocal categories, the synthesis has user personality characteristics and The sound source waveform that meets the actual sounding needs; finally, the synthesized sound source waveform is output through the electronic throat vibration output module.

It can be seen from the above that the electronic throat speech reconstruction method of the present invention mainly comprises three parts, one, image acquisition and processing; second, the synthesis of the electronic throat source; third, the vibration output of the electronic throat. Below A detailed description:

The first part of the invention is image acquisition and processing, mainly using the image processing method for the mouth category, ^ is the signal used to control the electronic throat ^ "dynamic synthesis of the source."

The specific implementation steps of the first part are described in detail below in conjunction with Figure 2:

1) Initializing parameters, that is, pre-analysing the rectangular frame range, the area threshold, and the neural network weight coefficient, and then acquiring a frame of video image, wherein the area threshold is one percent of the analysis rectangular frame area;

2) Using the skin color-based detection method to detect the lip region, that is, calculating the lip color feature value of the rectangular frame range in the YUV color space according to the following formula (1) to enhance the discrimination of the lip region, and normalize to 0- 255 gray level, thus, to obtain the lip color eigenvalue gray image, the formula (1) is as follows:

Z = 0.4937? - 0.589G + 0.0265 Formula (I) In the above formula (1), R, G, and B represent the red, green, and blue components, respectively.

3) Calculating the optimal segmentation threshold of the gray image of the lip color feature value by using the improved maximum inter-class variance (Otsu) method, and then binarizing the image by using the threshold value, thereby obtaining a preliminary segmentation image of the lip;

4) using the area threshold method, the area of the preliminary segmentation image whose area is less than the threshold is eliminated as noise, and the final lip segmentation image is obtained;

5) Perform outer and vertical point extraction on the lip area: Assuming that the long axis of the ellipse is at zero angle with the X axis, the elliptical model is used to fit the outer contour of the lip, and the ellipse long and short axis is obtained by one-dimensional Hough transform. the size of;

6) The normalized semi-major axis, the normalized semi-short axis, the ratio of the long and short axis, and the normalized area of the lips are taken as a set of parameters. The artificial start and stop vocalizations are obtained through the calculation of the artificial neural network. Category, used to guide the synthesis of the sound source.

It should be noted that, in the present invention, the normalized semi-major axis, the normalized semi-minor axis, and the normalized area of the lips are all based on the static semi-major axis, semi-short axis, and lip area when no sound is emitted. Normalized value.

In the present embodiment, the ratio of the long and short axes and the normalized parameter are used as the input of the neural network, because they can not only accurately reflect the change of the mouth shape, but also can determine the start and end time of the vocalization and the vowel category, and have a good distance. The invariance can overcome the judgment error caused by the change of the lip area size in the image caused by the change of the user and the camera. Therefore, the obtained judgment signal has a good coincidence with the speech waveform, and the judgment accuracy is high.

In addition, in order to meet the requirements of real-time performance, the image processing of the present invention adopts a time-space domain joint tracking control method in lip segmentation and ellipse model parameter matching, that is, based on the assumption that the face changes slowly and continuously during speech, through the previous frame image The segmented region information and the ellipse-matched parameter information guide the segmented rectangular range of the frame image and the matched parameter range, and the frame is well utilized. The internal and interframe information not only improves the processing speed, but also improves the calculation accuracy.

The artificial neural network in the present invention is a three-layer forward neural network including an input layer (ie, a normalized semi-major axis, a normalized semi-minor axis, a ratio of long and short axes, and a normalized area of the lips), and an implicit layer. (thirty nodes), output layer (ie no sound, / ◎ eight /% eight / ® eight 1-1, and eight. / five vowels), wherein the node weight coefficient of the neural network is pre-tested by the sample The training uses the error back propagation (BP) algorithm, and the samples are the lip shape parameters when the quiescent state is not sounded and each vowel is sent.

Please continue to refer to FIG. 3, the second part of the present invention is the synthesis of the squeak source. The vocal synthesis principle is used to synthesize the electronic throat source through the source-filter two steps. The specific steps are as follows: Step 1: Synthetic sound Threshold source waveform: Select and set the glottal source model parameters in the parameter library according to the personality characteristics of the user's vocalization. The vocal start and end time obtained in the image acquisition and processing module controls the start and end of the syllabic source synthesis, and is synthesized according to the LF model. The glottal humming sound source; the glottal cymbal sound source is synthesized using the LF model, and the specific mathematical expression is as follows:

In the above formula, Ee is the amplitude parameter, t _p , t _e > t _a , both are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period, and the fundamental frequency period, respectively.

ω )

The second step: selecting the shape parameter of the channel according to the determined vocal vowel category, using the waveguide model to simulate the sound propagation in the channel, and calculating the sound pressure waveform of the vibration transmission into the channel according to the following formula , that is, the synthetic electronic throat source:

The specific mathematical representation of the waveguide model in which the simulated sound propagates in the channel is as follows: i 2 G—)”, +一^u i ₊ i 二^u I一^r , (", + + ", — ₊₁ ) _ 4 - 4 ₊ ι

[u- = ( + r )u7 _+l + ru\2 u7 ₊₁ + r (", ⁺ + u _M ) 4 + ₊ ι ghttis '. u; =

Hps: _out = (\ - r _N )u _N ⁺ ^ u _N ⁺ - u _N r _N « -l where the channel is formed by cascading a plurality of sound tubes of uniform cross-sectional area, expressed as an area function of 4," And "," respectively, the positive sound pressure and the reverse sound pressure in the first sound tube, which are the reflection coefficients of the adjacent interfaces of the first and the first sound tubes, and the cross-sectional areas of adjacent sound tubes 4 and 4 ₊₁ determines that the waveguide model can calculate the sound pressure at any position of the channel by iteration.

It should be noted that: First, in the LF model of the above-mentioned arpeggio synthesis module, the glottal source signal waveform is determined by the amplitude parameters Ee and t _p , t _e , t _a , tc four time parameters, for different people In terms of anatomical structure and vocal characteristics, the glottal 嗓 sound source waveforms have individual differences, which can be reflected in the five parameters of the LF model, and these parameters can be extracted from the speech. . For example, the fundamental frequency of women's vocalization is generally higher than that of men. Therefore, women's tc is smaller than that of men. In the present invention, in order to fully retain the user's voice characteristics, reconstruct the same voice as before the patient loses the voice, the above five parameters need to be extracted from the voice collected before the patient loses the sound, and there is a parameter library, when using the electronic throat, only It is necessary to extract the above parameters in the parameter library to reconstruct the voice with the characteristics of the user's voice, and for the patient who has not collected the voice before the voice loss, it can select the parameters of the voice characteristics that he likes, thus reconstructing his favorite Voice.

Secondly, in the waveguide model of the above-mentioned arpeggio synthesis module, the only parameter is the area function 4 of the channel, and different human voices or different voices are different, and the corresponding channel shapes are different, so, in the present invention Using the vowel category control method, different channel area functions are selected for synthesis according to different vowel vowels; for different users, we first establish a vowel-channel area function corresponding template library. When synthesizing, it is only necessary to find the corresponding channel function according to the judgment vowel category. This template library is established by using the reverse method to obtain the channel response function from the voice recorded by the user, and then the channel response function. Find the best matching channel area function so that the user's vocal personality characteristics are preserved.

It can be seen from the above that by two-step synthesis, the sound pressure signal at any position in the channel can be calculated. However, it is necessary to select the sound pressure signal at which position in the channel as the electronic throat source. The specific surgical situation and the way of use are determined by the user.

Please refer to Figure 4 below. For the waveforms of the synthesized snoring sources in different situations, for example, users who have a laryngectomy due to laryngeal cancer but have a more complete vocal tract can use the neck to apply vibration. Using the reserved channel effect, therefore, the sound pressure waveform at the lower part of the pharyngeal cavity is selected as the waveform of the electronic throat sound source. Figure 4 (a) and Figure 4 (c) show the vowel in this case as /©/ And /%/time synthesized 嗓 source waveform; for patients with pharyngeal cancer, pharyngectomy is required, so that the patient not only loses the vocal cords, but also a large part of the vocal tract is also destroyed, at this time must be selected at the mouth The sound pressure waveform is used as the chirp source waveform, and Fig. 4 (b) and Fig. 4 (d) are the chirp source waveforms synthesized in the case where the vowels are /◎/ and /%/, respectively.

Thus, it can be seen from FIG. 4 that the present invention is directed to different surgical situations, use situations, and vocalization categories, thereby synthesizing different electronic throat sound source waveforms, which not only meets the needs of actual use, but also retains the user's personality characteristics, The quality of electronic throat reconstruction speech has been improved to a large extent.

Referring to FIG. 5, the third module of the present invention is an electronic throat vibration output module, including an electronic throat vibrator and an electronic throat vibrator front circuit, and the computer inputs the synthesized electronic throat sound source waveform signal through the LPT parallel port. The front circuit, after digital-to-analog conversion and power amplification, outputs an analog voltage signal from the audio interface, and finally the electronic throat vibrator vibrates to output the sound source.

The electronic throat vibrator is a linear transducer, that is, linearly converts a voltage signal into mechanical vibration, so that it can output vibration according to a synthesized squeak source, and at the same time, in order to meet the needs of intraoral application, a sound guide tube is used to introduce vibration into the oral cavity. internal.

Continuing to refer to Figure 5, the electronic throat vibrator front circuit consists of an input/output interface, D/A digital-to-analog conversion, power amplification, and power control. The input and output interfaces are respectively 25-pin digital input parallel port and 3. 5mm analog output audio interface, wherein the digital input parallel port is connected with the computer parallel port output end, the transmission speed is 44100 Byte/s, and the analog output audio interface is connected with the electronic throat vibrator; The /A digital-to-analog converter uses DAC0832, the data precision is 8 bits, and can be directly connected to the data bit of the LPT parallel port. The power amplifier uses Ti company's TPA701 audio power amplifier, +3. 5 ~ +5. 5V power supply, the output power can reach 700mW; power control is 5V battery, providing +5V DC voltage to each chip.

In the above embodiment, the voice system of the electronic throat is implemented based on a video capture device, a computer, and an electronic throat vibration output module. However, for implementation, another embodiment may be adopted, as shown in FIG. 6 . In this embodiment, the electronic throat voice system includes a CMOS image sensor for acquiring an image, an FPGA chip connected to the output end of the CMOS image sensor, and used for analyzing and processing the collected image and synthesizing the sound source, and the FPGA chip. The output is connected to a voice chip for D/A conversion and power amplification of the synthesized electronic throat source waveform, and an electronic throat vibrator connected to the output of the voice chip.

The CMOS image sensor uses MICRON's MT9M011 with a maximum resolution of 640 x 480, and the frame rate at this resolution is 60 frames/s, which is used to collect the surface of the user's voice. Part image.

The FPGA chip supports SOPC technology, which realizes the function of video data as input, video data processing analysis and electronic throat sound source synthesis, and finally outputs electronic throat sound source waveform data; the FPGA chip is connected with CMOS image sensor and voice chip. In addition to the interface, LCD, FLASH, and SDRAM are also included, wherein the LCD is a liquid crystal display for displaying related data, FLASH is flash memory, and SDRAM is synchronous dynamic random access memory. ·

The voice chip uses A IC23, including D/A converter and power amplification function. After D/A conversion and power amplification, it is output from the audio interface to the electronic throat vibrator.

The above is only one embodiment of the present invention, and it is not the whole or the only embodiment. Any equivalent change taken by the technical personnel of the present invention by reading the specification of the present invention is the right of the present invention. The requirements are covered.

Claims

Rights request

1. An electronic throat speech reconstruction method, firstly extracting model parameters from the collected speech as a parameter library, and then collecting the facial image of the utterer, transmitting the image to the image analysis and processing module, and analyzing and processing the image analysis and processing module After that, the vocal start and stop time and the utterance vowel category are obtained, and then the squeak source synthesis module is controlled by the utterance start time and the utterance vowel category, and the 嗓 sound source waveform is synthesized, and finally, the 嗓 sound source waveform is output through the electronic throat vibration output module, The electronic throat vibration output module comprises a front circuit and an electronic throat vibration device, wherein: the synthesis steps of the sound source synthesis module are as follows:

1) Synthesizing the glottal 嗓 sound source waveform: selecting the glottal 嗓 sound source model parameters in the parameter library according to the personality characteristics of the user vocalization, wherein the vocal start and end time controls the start and end of the syllabic sound source synthesis, and the glottal 嗓 sound source synthesis is adopted. The LF model, the specific mathematical representation is as follows:

' (0 = E ₀ e ^at sinO ( ≤ t ≤ t _e )

", _g W = -( )[e -ε(ί- ) - ei _e ≤t≤t _c )

⁸ et„

In the above formula, Ee is the amplitude parameter, and t _p , , t _a , and tc are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period, and the fundamental frequency period, respectively.

2) Select the shape parameter of the channel according to the vocal vocal category, use the waveguide model to simulate the sound propagation in the channel, and calculate the 嗓 source waveform according to the following formula: ί = ( ¹ -^ )",+ - = ^u i一';(",+ + ) 4 -4+ 1

{uT = {1 + η )u _M + Γμ- = + η (u* ) 4 + ₊ ι glottis: u * = u _g - r _g u _t = - u _v ) _g ~ - 1

"PS:". ,,, = (Bu) K = " - "; - 1 channel is represented by a sound tube cascade with multiple uniform cross-sectional areas, where 4 and .4 are the / and + 1 sound tubes The area function, and ^ - respectively, are the forward sound pressure and the reverse sound pressure in the first sound tube, which are the reflection coefficients of the adjacent interface of the /th and /1th sound tubes.

2. The electronic throat speech reconstruction method according to claim 1, wherein: the image analysis and processing module comprises the following steps:

Step 1: Initialization parameters: Pre-analyze the rectangular frame range, area threshold and neural network weight coefficient, and then collect a frame of video image, where the area threshold is the percentage of the rectangular frame area. Step 2: Using the skin-based detection method The lip area is detected, that is, the lip color feature value of the rectangular frame range is calculated in the YUV color space according to the following formula, and normalized to 0-255 gray level:

Ζ = 0A93R - 0.589G + 0.0265

Step 3: Calculate the optimal segmentation threshold of the gray image of the lip color feature value by using the improved maximum inter-class variance method, and then binarize the image by using the threshold value, thereby obtaining the initial segmentation image of the lip;

Step 4: Using the area threshold method, the area of the preliminary segmentation image whose area is smaller than the threshold value is eliminated as noise, and the final lip segmentation image is obtained;

3. The electronic throat speech reconstruction method according to claim 2, wherein: in step 6 of the image analysis and processing module, an artificial neural network algorithm is used to calculate the vocal start and stop time and the utterance vowel category.

4. The electronic throat speech reconstruction method according to claim 3, wherein: the artificial neural network algorithm is a three-layer network, including an input layer, an implicit layer, and an output layer, wherein the input layer includes four inputs. , that is, the normalized semi-major axis, the normalized semi-short axis, the ratio of the long and short axes, and the normalized area of the lips. The output layer includes six outputs, ie, no sound, /© eight/%/, , 1-1 , And /../ five vowels.

The method for reconstructing an electronic laryngeal speech according to claim 1 or 4, wherein: in the process of synthesizing the snoring sound source, the sound pressure waveform of the lower part of the pharyngeal cavity is used as the squeak source waveform applied by the neck.

The electronic throat speech reconstruction method according to claim 1 or 4, wherein: in the process of synthesizing the squeak source, the sound pressure waveform at the oral position is used as the squeak source waveform applied in the oral cavity.

7. An electron using the method of claim 1. The 3⁄4 voice system is characterized by including a CMOS image sensor, an FPGA chip connected to the output of the CMOS image sensor, a voice chip connected to the output of the FPGA chip, and an electronic laryngeal connected to the output end of the voice chip.