WO2012003602A1 - Method for reconstructing electronic larynx speech and system thereof - Google Patents

Method for reconstructing electronic larynx speech and system thereof Download PDF

Info

Publication number
WO2012003602A1
WO2012003602A1 PCT/CN2010/001022 CN2010001022W WO2012003602A1 WO 2012003602 A1 WO2012003602 A1 WO 2012003602A1 CN 2010001022 W CN2010001022 W CN 2010001022W WO 2012003602 A1 WO2012003602 A1 WO 2012003602A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
area
image
normalized
lip
Prior art date
Application number
PCT/CN2010/001022
Other languages
French (fr)
Chinese (zh)
Inventor
万明习
吴亮
王素品
牛志峰
万聪颖
Original Assignee
西安交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安交通大学 filed Critical 西安交通大学
Priority to PCT/CN2010/001022 priority Critical patent/WO2012003602A1/en
Publication of WO2012003602A1 publication Critical patent/WO2012003602A1/en
Priority to US13/603,226 priority patent/US8650027B2/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility
    • G10L2021/0575Aids for the handicapped in speaking

Definitions

  • the invention belongs to the field of lesion speech reconstruction, and particularly relates to an electronic throat speech reconstruction method and a system thereof. Background technique
  • Voice or language is the main means by which humans express feelings and communicate with each other.
  • thousands of people around the world lose their vocal ability temporarily or permanently due to various laryngeal operations every year.
  • various voice rehabilitation techniques have emerged, among which esophageal voice, tracheal esophageal voice, and artificial electronic throat voice are the most common, and artificial electronic throat is widely used because of its simple use, wide application range, and long-term sound. .
  • Chinese Patent Application No. 200910020897. No. 3 discloses an automatically adjusted method for voice communication of a pharyngeal electronic throat which removes other noises and thereby improves the quality of reconstructed speech.
  • the working principle of the electronic throat is to provide a missing source of arpeggio vibration, and transmit the vibration into the channel through the transducer for voice modulation, and finally generate the voice through the lip radiation. It can be seen that the source of the missing voice is the most fundamental task of the electronic throat.
  • the vibration source provided by the electronic throat currently on the market is mostly square wave or pulse signal, although the improved linear transducer is The glottal source can be output, but these do not match the missing vibrating sources during actual use.
  • the position of vibration transmitted into the vocal tract is not a glottis, and for different surgical situations of different patients, not only the lack of vocal cords, but also the lack of partial vocalization, these need to be in the electronic
  • the laryngeal vibration source is compensated, so it is necessary to improve the electronic throat quality from the aspect of the electronic throat.
  • the technical problem to be solved by the present invention is to provide an electronic throat speech reconstruction method and a system thereof.
  • the speech reconstructed by the method not only compensates for the acoustic characteristics of the missing channel, but also retains the user's personality characteristics and is closer to the user itself. The sound characteristics, voice quality is better.
  • the present invention provides an electronic throat speech reconstruction method, which first extracts model parameters from a collected speech as a parameter library, and then collects a facial image of the utterer, and transmits the image to an image analysis and processing module, the image After the analysis and processing module analyzes and processes, the vocal start and stop vocalization categories are obtained, and then the vocal source synthesis module is controlled and the noise source waveform is synthesized by the utterance start and stop vocalization categories, and finally, the electronic throat vibration output mode is synthesized.
  • Ee is the amplitude parameter
  • t p , t e . t a , tc are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period and the fundamental frequency period, and the remaining parameters can be above.
  • the five parameters are jointly obtained according to the following formula:
  • the image analysis and processing module includes the following steps: Step 1: Initialize parameters, that is, preset analysis of a rectangular frame range, an area threshold, and a neural network weight coefficient, and then acquire a frame of video image, wherein The area threshold is one percent of the area of the analysis rectangle;
  • Step 2 The lip area is detected by the skin color based detection method, that is, the lip color characteristic value of the rectangular frame range is calculated in the YUV color space according to the following formula, and normalized to 0-255 gray level:
  • Step 3 Calculate the optimal segmentation value of the gray image of the lip color feature value by using the improved maximum inter-class variance method, and then binarize the image by using the threshold value, thereby obtaining the initial segmentation image of the lip;
  • Step 4 Using the area threshold method, the area of the preliminary segmentation image whose area is smaller than the threshold is eliminated as noise, and the final lip segmentation image is obtained;
  • Step 5 Perform outer contour and center point extraction on the lip area: Set the long axis of the ellipse to a zero degree angle with the X axis, use the elliptical model to match the outer contour of the lip, and measure the size of the ellipse long and short axis by one-dimensional Hough transform. ;
  • Step 6 calculating a vocal start and end time and a vowel vowel category by using a normalized semi-major axis, a normalized semi-short axis, a long-short axis ratio, and a lip normalized area value as a set of parameters, wherein the normalization
  • the semi-major axis, the normalized semi-short axis, and the normalized area of the lips all refer to the normalized values of the static semi-major axis, the semi-short axis, and the lip area when the sound is not sounded.
  • an artificial neural network algorithm is used to calculate the vocal start and stop time and the utterance vowel category.
  • the artificial neural network algorithm is a three-layer network, including an input layer, an implicit layer, and an output layer, wherein the input layer includes four inputs, that is, a normalized semi-major axis, The normalized semi-short axis, the ratio of the long and short axes, and the normalized area of the lips.
  • the output layer consists of six outputs, ie no sound, / ⁇ *, /%8, 1-1, and /../five yuan sound.
  • the sound pressure waveform of the lower part of the pharyngeal cavity is used as the squeak source waveform applied to the neck.
  • the sound pressure waveform at the oral position is used as the sound source waveform applied in the oral cavity.
  • the present invention also provides an electronic throat voice system, including a CMOS image sensor, an FPGA chip connected to an output end of the CMOS image sensor, a voice chip connected to an output end of the FPGA chip, and a voice chip.
  • An electronic throat vibrator connected to the output.
  • the electronic throat speech reconstruction method and system thereof have at least the following advantages: First, in the glottal source LF model of the chirp source synthesis module, the glottal waveform is composed of amplitude parameters Ee and t p , t e , t a , t c Four time parameters are co-characterized, and these five parameters can be extracted from the speech, so For different users, it can be extracted from the speech retained before the loss of sound as a synthetic parameter, so the reconstructed speech has the user's personality characteristics; in addition, in the vocal waveguide model of the ⁇ source synthesis module, according to the video signal The determined vocal vowel category selects the vocal shape parameter, and selects a suitable vibrator application position according to the surgical resection of the user's throat.
  • the sound pressure waveform corresponding to the spatial position of the corresponding channel is synthesized as the electronic throat sound source waveform for the applied portion.
  • the sound pressure waveform corresponding to the spatial position of the corresponding channel is synthesized as the electronic throat sound source waveform for the applied portion.
  • FIG. 1 is a schematic flow chart of a method for reconstructing an electronic throat of the present invention
  • FIG. 2 is a flow chart of a lip moving image processing and control parameter extraction program of the present invention
  • FIG. 3 is a flow chart of synthesizing a sound source of the present invention
  • FIG. 4 is a waveform diagram of an electronic throat sound source synthesized in different sounding and use cases of the present invention
  • FIG. 5 is a schematic diagram of an electronic throat vibration output module of the present invention
  • Figure 6 is a block diagram showing the structure of the electronic throat voice system of the present invention. The best way to achieve your invention
  • the electronic throat speech reconstruction method and system thereof are described in detail below with reference to the accompanying drawings:
  • the invention uses a computer system as a platform to adjust the synthesis of the noise source waveform according to the specific situation of the user's voice loss and the characteristics of the individual sound generation, and simultaneously utilizes the video signal to the voice.
  • the source synthesis is controlled in real time, and finally the above-mentioned chirp source waveform is output through the electronic throat vibration output module connected by the parallel port.
  • the system for reconstructing an electronic throat voice of the present invention comprises an image acquisition device, an image processing and analysis module connected to an output end of the image acquisition device, a sound source synthesis module connected to an output end of the image processing and analysis module, and a sound source synthesis An electronic throat vibration output module connected to the output of the module.
  • the image acquisition device that is, the camera collects the facial image during the user's utterance process, and transmits the facial image to the image processing and analysis module, and the image processing and analysis module receives the data.
  • the shape parameters of the elliptical model of the lip edge are obtained, and then the artificial neural network algorithm is used to calculate and determine the start and end time of the vocalization and the vocal vocal category.
  • the control signal synthesized by the sound source; the sound source synthesis module adopts the principle of sound synthesis, according to the situation of different users, including the surgical condition, the personality characteristics of the sound, and the extracted vocal start and stop vocal categories, the synthesis has user personality characteristics and The sound source waveform that meets the actual sounding needs; finally, the synthesized sound source waveform is output through the electronic throat vibration output module.
  • the electronic throat speech reconstruction method of the present invention mainly comprises three parts, one, image acquisition and processing; second, the synthesis of the electronic throat source; third, the vibration output of the electronic throat.
  • the first part of the invention is image acquisition and processing, mainly using the image processing method for the mouth category, ⁇ is the signal used to control the electronic throat ⁇ "dynamic synthesis of the source.”
  • the formula (1) is as follows:
  • R, G, and B represent the red, green, and blue components, respectively.
  • the normalized semi-major axis, the normalized semi-short axis, the ratio of the long and short axis, and the normalized area of the lips are taken as a set of parameters.
  • the artificial start and stop vocalizations are obtained through the calculation of the artificial neural network. Category, used to guide the synthesis of the sound source.
  • the normalized semi-major axis, the normalized semi-minor axis, and the normalized area of the lips are all based on the static semi-major axis, semi-short axis, and lip area when no sound is emitted. Normalized value.
  • the ratio of the long and short axes and the normalized parameter are used as the input of the neural network, because they can not only accurately reflect the change of the mouth shape, but also can determine the start and end time of the vocalization and the vowel category, and have a good distance.
  • the invariance can overcome the judgment error caused by the change of the lip area size in the image caused by the change of the user and the camera. Therefore, the obtained judgment signal has a good coincidence with the speech waveform, and the judgment accuracy is high.
  • the image processing of the present invention adopts a time-space domain joint tracking control method in lip segmentation and ellipse model parameter matching, that is, based on the assumption that the face changes slowly and continuously during speech, through the previous frame image
  • the segmented region information and the ellipse-matched parameter information guide the segmented rectangular range of the frame image and the matched parameter range, and the frame is well utilized.
  • the internal and interframe information not only improves the processing speed, but also improves the calculation accuracy.
  • the artificial neural network in the present invention is a three-layer forward neural network including an input layer (ie, a normalized semi-major axis, a normalized semi-minor axis, a ratio of long and short axes, and a normalized area of the lips), and an implicit layer. (thirty nodes), output layer (ie no sound, / ⁇ eight /% eight / ® eight 1-1, and eight. / five vowels), wherein the node weight coefficient of the neural network is pre-tested by the sample
  • the training uses the error back propagation (BP) algorithm, and the samples are the lip shape parameters when the quiescent state is not sounded and each vowel is sent.
  • BP error back propagation
  • the second part of the present invention is the synthesis of the squeak source.
  • the vocal synthesis principle is used to synthesize the electronic throat source through the source-filter two steps.
  • the specific steps are as follows: Step 1: Synthetic sound Threshold source waveform: Select and set the glottal source model parameters in the parameter library according to the personality characteristics of the user's vocalization.
  • the vocal start and end time obtained in the image acquisition and processing module controls the start and end of the syllabic source synthesis, and is synthesized according to the LF model.
  • the glottal humming sound source; the glottal cymbal sound source is synthesized using the LF model, and the specific mathematical expression is as follows:
  • Ee is the amplitude parameter, t p , t e > t a , both are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period, and the fundamental frequency period, respectively.
  • the second step selecting the shape parameter of the channel according to the determined vocal vowel category, using the waveguide model to simulate the sound propagation in the channel, and calculating the sound pressure waveform of the vibration transmission into the channel according to the following formula , that is, the synthetic electronic throat source:
  • Hps: out ( ⁇ - r N )u N + ⁇ u N + - u N r N « -l
  • the channel is formed by cascading a plurality of sound tubes of uniform cross-sectional area, expressed as an area function of 4," And ",” respectively, the positive sound pressure and the reverse sound pressure in the first sound tube, which are the reflection coefficients of the adjacent interfaces of the first and the first sound tubes, and the cross-sectional areas of adjacent sound tubes 4 and 4 +1 determines that the waveguide model can calculate the sound pressure at any position of the channel by iteration.
  • the glottal source signal waveform is determined by the amplitude parameters Ee and t p , t e , t a , tc four time parameters, for different people
  • the glottal ⁇ sound source waveforms have individual differences, which can be reflected in the five parameters of the LF model, and these parameters can be extracted from the speech.
  • the fundamental frequency of women's vocalization is generally higher than that of men. Therefore, women's tc is smaller than that of men.
  • the above five parameters need to be extracted from the voice collected before the patient loses the sound, and there is a parameter library, when using the electronic throat, only It is necessary to extract the above parameters in the parameter library to reconstruct the voice with the characteristics of the user's voice, and for the patient who has not collected the voice before the voice loss, it can select the parameters of the voice characteristics that he likes, thus reconstructing his favorite Voice.
  • the only parameter is the area function 4 of the channel, and different human voices or different voices are different, and the corresponding channel shapes are different, so, in the present invention
  • the vowel category control method different channel area functions are selected for synthesis according to different vowel vowels; for different users, we first establish a vowel-channel area function corresponding template library.
  • This template library is established by using the reverse method to obtain the channel response function from the voice recorded by the user, and then the channel response function. Find the best matching channel area function so that the user's vocal personality characteristics are preserved.
  • the sound pressure signal at any position in the channel can be calculated. However, it is necessary to select the sound pressure signal at which position in the channel as the electronic throat source. The specific surgical situation and the way of use are determined by the user.
  • Figure 4 (a) and Figure 4 (c) show the vowel in this case as / ⁇ / And /%/time synthesized ⁇ source waveform; for patients with pharyngeal cancer, pharyngectomy is required, so that the patient not only loses the vocal cords, but also a large part of the vocal tract is also destroyed, at this time must be selected at the mouth
  • the sound pressure waveform is used as the chirp source waveform
  • Fig. 4 (b) and Fig. 4 (d) are the chirp source waveforms synthesized in the case where the vowels are / ⁇ / and /%/, respectively.
  • the present invention is directed to different surgical situations, use situations, and vocalization categories, thereby synthesizing different electronic throat sound source waveforms, which not only meets the needs of actual use, but also retains the user's personality characteristics, The quality of electronic throat reconstruction speech has been improved to a large extent.
  • the third module of the present invention is an electronic throat vibration output module, including an electronic throat vibrator and an electronic throat vibrator front circuit, and the computer inputs the synthesized electronic throat sound source waveform signal through the LPT parallel port.
  • the front circuit after digital-to-analog conversion and power amplification, outputs an analog voltage signal from the audio interface, and finally the electronic throat vibrator vibrates to output the sound source.
  • the electronic throat vibrator is a linear transducer, that is, linearly converts a voltage signal into mechanical vibration, so that it can output vibration according to a synthesized squeak source, and at the same time, in order to meet the needs of intraoral application, a sound guide tube is used to introduce vibration into the oral cavity. internal.
  • the electronic throat vibrator front circuit consists of an input/output interface, D/A digital-to-analog conversion, power amplification, and power control.
  • the input and output interfaces are respectively 25-pin digital input parallel port and 3. 5mm analog output audio interface, wherein the digital input parallel port is connected with the computer parallel port output end, the transmission speed is 44100 Byte/s, and the analog output audio interface is connected with the electronic throat vibrator;
  • the /A digital-to-analog converter uses DAC0832, the data precision is 8 bits, and can be directly connected to the data bit of the LPT parallel port.
  • the power amplifier uses Ti company's TPA701 audio power amplifier, +3. 5 ⁇ +5. 5V power supply, the output power can reach 700mW; power control is 5V battery, providing +5V DC voltage to each chip.
  • the voice system of the electronic throat is implemented based on a video capture device, a computer, and an electronic throat vibration output module.
  • the electronic throat voice system includes a CMOS image sensor for acquiring an image, an FPGA chip connected to the output end of the CMOS image sensor, and used for analyzing and processing the collected image and synthesizing the sound source, and the FPGA chip.
  • the output is connected to a voice chip for D/A conversion and power amplification of the synthesized electronic throat source waveform, and an electronic throat vibrator connected to the output of the voice chip.
  • the CMOS image sensor uses MICRON's MT9M011 with a maximum resolution of 640 x 480, and the frame rate at this resolution is 60 frames/s, which is used to collect the surface of the user's voice. Part image.
  • the FPGA chip supports SOPC technology, which realizes the function of video data as input, video data processing analysis and electronic throat sound source synthesis, and finally outputs electronic throat sound source waveform data; the FPGA chip is connected with CMOS image sensor and voice chip.
  • CMOS image sensor and voice chip In addition to the interface, LCD, FLASH, and SDRAM are also included, wherein the LCD is a liquid crystal display for displaying related data, FLASH is flash memory, and SDRAM is synchronous dynamic random access memory. ⁇
  • the voice chip uses A IC23, including D/A converter and power amplification function. After D/A conversion and power amplification, it is output from the audio interface to the electronic throat vibrator.

Abstract

A method for reconstructing electronic larynx speech and a system thereof are provided. The method includes the following steps: firstly, extracting model parameters from the collected speech as a parameter library; then collecting facial images of a sounder, and transmitting the facial images to an image analyzing and processing module to obtain phonation start-stop moments and phonation vowel sorts; then synthesizing voice source waveform by a voice source synthesis module; finally, outputting the voice source waveform by an electronic larynx vibration output module. The voice source synthesis module firstly sets glottis voice source model parameters, thereby synthesizing glottis voice source waveform, then simulates the sounds traveling in a sound channel by a waveguide model, and selects shape parameters of the sound channel according to the phonation vowel sorts, thereby synthesizing electronic larynx voice source waveform. The speech reconstructed by the method and the system thereof is much closer to the sound of the sounder himself.

Description

一种电子喉语音重建方法及其系统 技术领域  Electronic throat speech reconstruction method and system thereof
本发明属于病变语音重建领域, 特别涉及一种电子喉语音重建方法及 其系统。 背景技术  The invention belongs to the field of lesion speech reconstruction, and particularly relates to an electronic throat speech reconstruction method and a system thereof. Background technique
语音或者语言是人类表达感情和相互交流的主要手段, 然而, 据统计 每年全世界有成千上万的人因为各种喉部外科手术而暂时或永久丧失发声 能力。 鉴于此, 各种嗓音康复技术应运而生, 其中, 以食管语音、 气管食 管语音, 以及人工电子喉语音最为常见, 而人工电子喉因为使用简单、 适 用范围广、 可长时间发声而被广泛应用。  Voice or language is the main means by which humans express feelings and communicate with each other. However, according to statistics, thousands of people around the world lose their vocal ability temporarily or permanently due to various laryngeal operations every year. In view of this, various voice rehabilitation techniques have emerged, among which esophageal voice, tracheal esophageal voice, and artificial electronic throat voice are the most common, and artificial electronic throat is widely used because of its simple use, wide application range, and long-term sound. .
中国发明专利申请第 200910020897. 3号公开了一种自动调节的咽腔电 子喉语音通讯的方法,其去除了其他的噪音,从而提高了重建语音的质量。 电 子喉的工作原理是提供缺失的嗓音振动源, 并通过换能器将振动传递进入 声道进行语音调制, 最后通过唇端辐射产生语音。 由此可见, ^供缺失的 嗓音振动源是电子喉最根本的任务, 然而, 目前市面上所见的电子喉所提 供的振动嗓音源多为方波或脉沖信号, 改进的线性换能器虽然能输出声门 嗓音源, 但是这些都不符合实际使用过程中缺失的嗓音振动源。 无论是颈 外式还是口腔式电子喉, 振动传递进入声道的位置都不是声门, 而且对于 不同病人的不同手术情况, 不仅是声带缺失,而且包括部分声道的缺失,这 些都需要在电子喉振动源中得到补偿, 因此从电子喉本质方面改进以提高 电子喉语音质量是十分必要的。  Chinese Patent Application No. 200910020897. No. 3 discloses an automatically adjusted method for voice communication of a pharyngeal electronic throat which removes other noises and thereby improves the quality of reconstructed speech. The working principle of the electronic throat is to provide a missing source of arpeggio vibration, and transmit the vibration into the channel through the transducer for voice modulation, and finally generate the voice through the lip radiation. It can be seen that the source of the missing voice is the most fundamental task of the electronic throat. However, the vibration source provided by the electronic throat currently on the market is mostly square wave or pulse signal, although the improved linear transducer is The glottal source can be output, but these do not match the missing vibrating sources during actual use. Whether it is a cervical or an oral electronic throat, the position of vibration transmitted into the vocal tract is not a glottis, and for different surgical situations of different patients, not only the lack of vocal cords, but also the lack of partial vocalization, these need to be in the electronic The laryngeal vibration source is compensated, so it is necessary to improve the electronic throat quality from the aspect of the electronic throat.
鉴于以上问题, 实有必要提供一种可以解决上述技术问题的电子喉语 音重建方法及其系统 发明内容  In view of the above problems, it is necessary to provide an electronic throat voice reconstruction method and system thereof that can solve the above technical problems.
本发明所要解决的技术问题是提供一种电子喉语音重建方法及其系 统,通过本方法重建的语音不仅补偿了缺失声道的声学特性, 而且保留了使 用者的个性特点, 更接近使用者本身的声音特点, 语音质量更好。  The technical problem to be solved by the present invention is to provide an electronic throat speech reconstruction method and a system thereof. The speech reconstructed by the method not only compensates for the acoustic characteristics of the missing channel, but also retains the user's personality characteristics and is closer to the user itself. The sound characteristics, voice quality is better.
为实现上述目的, 本发明提供了一种电子喉语音重建方法, 首先从采 集的语音中提取模型参数作为参数库, 接着采集发声者的面部图像, 将该 图像传输给图像分析与处理模块, 图像分析与处理模块分析处理完之后,得 到发声起止时刻与发声元音类别, 再接着, 以发声起止时刻和发声元音类 别控制嗓音源合成模块并合成噪音源波形, 最后, 通过电子喉振动输出模 块将上述嗓音源波形输出, 电子喉振动输出模块包括前置电路和电子喉振 动器, 所述嗓音源合成模块的合成步骤如下: To achieve the above object, the present invention provides an electronic throat speech reconstruction method, which first extracts model parameters from a collected speech as a parameter library, and then collects a facial image of the utterer, and transmits the image to an image analysis and processing module, the image After the analysis and processing module analyzes and processes, the vocal start and stop vocalization categories are obtained, and then the vocal source synthesis module is controlled and the noise source waveform is synthesized by the utterance start and stop vocalization categories, and finally, the electronic throat vibration output mode is synthesized. The block outputs the above-mentioned sound source waveform, and the electronic throat vibration output module comprises a front circuit and an electronic throat vibrator, and the synthesizing steps of the sound source synthesis module are as follows:
1 )合成声门嗓音源波形, 即根据使用者发声的个性特征在参数库内选 择声门嗓音源模型参数, 其中, 发声起止时刻控制嗓音源合成的开始和结 束,所述声门嗓音源合成采用 LF模型, 具体数学表示如下:  1) synthesizing the glottal sound source waveform, that is, selecting the glottal sound source model parameter in the parameter library according to the personality characteristic of the user's voice, wherein the sound start and end time controls the start and end of the sound source synthesis, and the voice door sound source synthesis Using the LF model, the specific mathematical representation is as follows:
u {t)二 E eal sinO ) (0<^< te)
Figure imgf000004_0001
u {t)二E e al sinO ) (0<^< t e )
Figure imgf000004_0001
上式中, Ee为幅度参数, tp、 te. ta、 tc均为时间参数, 分别代表气流 最大峰值时刻、 最大负峰值时刻、 指数回复段时间常数和基频周期, 其余 参数可由以上五个参数按照以下公式联合求得: In the above formula, Ee is the amplitude parameter, t p , t e . t a , tc are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period and the fundamental frequency period, and the remaining parameters can be above. The five parameters are jointly obtained according to the following formula:
s _ =l— e— ― "  s _ =l-e- ― "
π  π
Ue01 e { sin wJe― cos ω g e )- + ω λ/(α +ω ) U e0 1 e { sin wJ e ― cos ω ge )- + ω λ/(α +ω )
E„ =—Eneat' sin ω ί
Figure imgf000004_0002
E„ =—E n e at ' sin ω ί
Figure imgf000004_0002
2.0 ^ <o.i  2.0 ^ <o.i
2— 2.34^+1.34尺 0.1<R <0.5  2— 2.34^+1.34 feet 0.1<R <0.5
2A6-132R + 0.64(Ra一 0.5)
Figure imgf000004_0003
2A6-132R + 0.64 (R a -0.5)
Figure imgf000004_0003
2 )根据发声元音类别选择声道的形状参数, 利用波导模型模拟声音在 声道中传播, 按照以下公式计算出嗓音源波形-.  2) Select the shape parameter of the channel according to the vocal vocal category, use the waveguide model to simulate the sound propagation in the channel, and calculate the 嗓 source waveform according to the following formula--
< ι = (卜^; )",+ - ψΜ = - '; « + ",一+1 ) — 4 - 4+ι< ι = (卜^; )", + - ψ Μ = - '; « + ", a +1 ) — 4 - 4 + ι
"7 = (1 + )",— +1 + = u;+l + η (u + ) Ai + Al "7 = (1 + )", — +1 + = u; +l + η (u + ) A i + A l
l— r Λ 一、  L- r Λ
glottis  Glottis
[liPS:
Figure imgf000004_0004
R N ~~L 声道由多个均匀截面积的声管级联表示,上式中, 4和 4+,为第 ζ·个和第 ,+ι个声管的面积函数, 《,+和 「分别为第 个声管中的正向声压和反向声 压, 是第/个和第 /+1个声管相邻界面的反射系数。 作为本发明的优选实施例, 所述图像分析与处理模块包括如下步骤: 步骤一: 初始化参数, 即预设分析矩形框范围、 面积阔值和神经网络 权系数, 然后采集一帧视频图像, 其中面积阈值为分析矩形框面积的百分 之一;
[ li PS:
Figure imgf000004_0004
The R N ~~ L channel is represented by a plurality of sound tube cascades of uniform cross-sectional area. In the above formula, 4 and 4+ are the area functions of the first and the first, +ι sound tubes, ", + and "The positive sound pressure and the reverse sound pressure in the first sound tube are the reflection coefficients of the adjacent interface of the /th and +1th sound tubes, respectively. As a preferred embodiment of the present invention, the image analysis and processing module includes the following steps: Step 1: Initialize parameters, that is, preset analysis of a rectangular frame range, an area threshold, and a neural network weight coefficient, and then acquire a frame of video image, wherein The area threshold is one percent of the area of the analysis rectangle;
步骤二: 利用基于肤色的检测方法对嘴唇区域进行检测, 即在 YUV 色 彩空间按照下述公式计算矩形框范围的唇色特征值, 并归一化为 0-255 灰 度级:  Step 2: The lip area is detected by the skin color based detection method, that is, the lip color characteristic value of the rectangular frame range is calculated in the YUV color space according to the following formula, and normalized to 0-255 gray level:
Z = 0.493Λ - 0.589G + 0.0265  Z = 0.493Λ - 0.589G + 0.0265
步骤三: 利用改进的最大类间方差法计算唇色特征值灰度图像的最佳 分割闹值, 然后, 以此阈值对图像进行二值化分割, 从而, 得到嘴唇的初 步分割图像;  Step 3: Calculate the optimal segmentation value of the gray image of the lip color feature value by using the improved maximum inter-class variance method, and then binarize the image by using the threshold value, thereby obtaining the initial segmentation image of the lip;
步骤四: 采用面积阈值的方法, 将初步分割图像中面积小于阈值的区 域作为噪声消去, 得到最终的嘴唇分割图像;  Step 4: Using the area threshold method, the area of the preliminary segmentation image whose area is smaller than the threshold is eliminated as noise, and the final lip segmentation image is obtained;
步骤五: 对嘴唇区域进行外轮廓和中心点提取: 设定椭圆长轴与 X轴 成零度角, 利用椭圓模型对嘴唇外轮廓进行匹配, 通过一维哈夫变换检测 得到椭圆长短轴的大小;  Step 5: Perform outer contour and center point extraction on the lip area: Set the long axis of the ellipse to a zero degree angle with the X axis, use the elliptical model to match the outer contour of the lip, and measure the size of the ellipse long and short axis by one-dimensional Hough transform. ;
步骤六: 以归一化半长轴、 归一化半短轴、 长短轴之比和嘴唇归一化 面积值作为一组参数, 计算发声起止时刻和发声元音类别, 其中, 所述归 一化半长轴、 归一化半短轴, 以及嘴唇归一化面积均是指以不发声时静态 半长轴、 半短轴、 嘴唇面积为标准的归一化值。  Step 6: calculating a vocal start and end time and a vowel vowel category by using a normalized semi-major axis, a normalized semi-short axis, a long-short axis ratio, and a lip normalized area value as a set of parameters, wherein the normalization The semi-major axis, the normalized semi-short axis, and the normalized area of the lips all refer to the normalized values of the static semi-major axis, the semi-short axis, and the lip area when the sound is not sounded.
作为本发明的另一优选实施例,所述图像分析与处理模块的步骤六 中,采用人工神经网络算法计算发声起止时刻和发声元音类别。  As another preferred embodiment of the present invention, in step 6 of the image analysis and processing module, an artificial neural network algorithm is used to calculate the vocal start and stop time and the utterance vowel category.
作为本发明的另一优选实施例,所述人工神经网络算法为三层网络,包 括输入层、 隐含层, 以及输出层, 其中, 输入层包含四个输入, 即归一化 半长轴、 归一化半短轴、 长短轴之比和嘴唇归一化面积值, 输出层包括六 个输出, 即不发声、 /◎*、 / %八 、 1—1、 以及 /../五个元音。  As another preferred embodiment of the present invention, the artificial neural network algorithm is a three-layer network, including an input layer, an implicit layer, and an output layer, wherein the input layer includes four inputs, that is, a normalized semi-major axis, The normalized semi-short axis, the ratio of the long and short axes, and the normalized area of the lips. The output layer consists of six outputs, ie no sound, /◎*, /%8, 1-1, and /../five yuan sound.
作为本发明的另一优选实施例, 所述嗓音源合成过程中, 以声道咽腔 下部声压波形作为颈部施加的嗓音源波形。  As another preferred embodiment of the present invention, in the process of synthesizing the squeak source, the sound pressure waveform of the lower part of the pharyngeal cavity is used as the squeak source waveform applied to the neck.
作为本发明的另一优选实施例, 所述嗓音源合成过程中, 以口腔位置 声压波形作为口腔内施加的嗓音源波形。  As another preferred embodiment of the present invention, in the sound source synthesis process, the sound pressure waveform at the oral position is used as the sound source waveform applied in the oral cavity.
为了实现上述目的, 本发明还提供了一种电子喉语音系统, 包括 CMOS 图像传感器、 与 CMOS图像传感器的输出端相连的 FPGA芯片、 与 FPGA芯片 的输出端相连的语音芯片, 以及与语音芯片的输出端相连的电子喉振动器。  In order to achieve the above object, the present invention also provides an electronic throat voice system, including a CMOS image sensor, an FPGA chip connected to an output end of the CMOS image sensor, a voice chip connected to an output end of the FPGA chip, and a voice chip. An electronic throat vibrator connected to the output.
本发明电子喉语音重建方法及其系统至少具有以下优点: 首先,在嗓音 源合成模块的声门嗓音源 LF模型中, 声门波形由幅度参数 Ee以及 tp、 te, ta、 tc四个时间参数共同表征,而这五个参数可以从语音中提取出来,因此对 于不同的使用者来说, 可以从其失声前保留的语音中提取出来作为合成参 数,故重建语音具有使用者的个性特点; 另外, 在嗓音源合成模块的声道波 导模型中, 根据视频信号判断的发声元音类别选择声道形状参数, 依据使 用者咽喉部手术切除情况, 选择合适的振动器施加位置, 因此, 针对施加 部位合成对应声道空间位置的声压波形作为电子喉嗓音源波形, 如此,不仅 符合使用者的实际情况, 而且极大的保留了使用者的个性特征, 使重建语 音更接近使用者本人的原始语音, 改善重建语音质量。 附图的简要说明 The electronic throat speech reconstruction method and system thereof have at least the following advantages: First, in the glottal source LF model of the chirp source synthesis module, the glottal waveform is composed of amplitude parameters Ee and t p , t e , t a , t c Four time parameters are co-characterized, and these five parameters can be extracted from the speech, so For different users, it can be extracted from the speech retained before the loss of sound as a synthetic parameter, so the reconstructed speech has the user's personality characteristics; in addition, in the vocal waveguide model of the 嗓 source synthesis module, according to the video signal The determined vocal vowel category selects the vocal shape parameter, and selects a suitable vibrator application position according to the surgical resection of the user's throat. Therefore, the sound pressure waveform corresponding to the spatial position of the corresponding channel is synthesized as the electronic throat sound source waveform for the applied portion. In this way, not only meets the user's actual situation, but also greatly retains the user's personality characteristics, so that the reconstructed speech is closer to the user's original voice, and the reconstructed voice quality is improved. BRIEF DESCRIPTION OF THE DRAWINGS
图 1是本发明电子喉语音重建方法的流程示意图;  1 is a schematic flow chart of a method for reconstructing an electronic throat of the present invention;
图 2是本发明嘴唇运动图像处理和控制参数提取程序流程图; 图 3是本发明嗓音源合成流程图;  2 is a flow chart of a lip moving image processing and control parameter extraction program of the present invention; FIG. 3 is a flow chart of synthesizing a sound source of the present invention;
图 4是本发明不同发声和使用情况下合成的电子喉嗓音源波形图; 图 5是本发明电子喉振动输出模块示意图;  4 is a waveform diagram of an electronic throat sound source synthesized in different sounding and use cases of the present invention; FIG. 5 is a schematic diagram of an electronic throat vibration output module of the present invention;
图 6是本发明电子喉语音系统的一个结构框图。 实现发明的最佳方式  Figure 6 is a block diagram showing the structure of the electronic throat voice system of the present invention. The best way to achieve your invention
下面结合附图对本发明电子喉语音重建方法及其系统进行详细描述: 本发明以计算机系统为平台, 根据使用者失声的具体情况及个人发声 特点调整噪音源波形的合成, 同时利用视频信号对嗓音源合成进行实时控 制,最终通过并口连接的电子喉振动输出模块将上述嗓音源波形输出。  The electronic throat speech reconstruction method and system thereof are described in detail below with reference to the accompanying drawings: The invention uses a computer system as a platform to adjust the synthesis of the noise source waveform according to the specific situation of the user's voice loss and the characteristics of the individual sound generation, and simultaneously utilizes the video signal to the voice. The source synthesis is controlled in real time, and finally the above-mentioned chirp source waveform is output through the electronic throat vibration output module connected by the parallel port.
本发明电子喉语音重建方法的系统包括图像采集设备、 与图像采集设 备的输出端相连的图像处理及分析模块、 与图像处理及分析模块的输出端 相连的嗓音源合成模块, 以及与嗓音源合成模块的输出端相连的电子喉振 动输出模块。  The system for reconstructing an electronic throat voice of the present invention comprises an image acquisition device, an image processing and analysis module connected to an output end of the image acquisition device, a sound source synthesis module connected to an output end of the image processing and analysis module, and a sound source synthesis An electronic throat vibration output module connected to the output of the module.
请参阅图 1 所述, 当系统启动后, 图像采集设备, 即摄像头采集使用 者发声过程中的面部图像, 并将该面部图像传输给图像处理及分析模块,图 像处理及分析模块接收到该数据后进行处理与分析, 即通过嘴唇检测、 分 割、 边缘提取和拟合, 从而得到嘴唇边缘的椭圆模型形状参数, 之后,再通 过人工神经网络算法计算判断发声的起止时刻和发声元音类别并作为嗓音 源合成的控制信号; 嗓音源合成模块采用发声合成法原理, 根据不同使用 者的情况, 包括手术情况、 发声个性特点, 以及提取的发声起止和发声元 音类别, 合成具有使用者个性特征和符合实际发声需要的嗓音源波形;最后 通过电子喉振动输出模块将上述合成的嗓音源波形输出。  Referring to FIG. 1 , when the system is started, the image acquisition device, that is, the camera collects the facial image during the user's utterance process, and transmits the facial image to the image processing and analysis module, and the image processing and analysis module receives the data. After processing and analysis, that is, through lip detection, segmentation, edge extraction and fitting, the shape parameters of the elliptical model of the lip edge are obtained, and then the artificial neural network algorithm is used to calculate and determine the start and end time of the vocalization and the vocal vocal category. The control signal synthesized by the sound source; the sound source synthesis module adopts the principle of sound synthesis, according to the situation of different users, including the surgical condition, the personality characteristics of the sound, and the extracted vocal start and stop vocal categories, the synthesis has user personality characteristics and The sound source waveform that meets the actual sounding needs; finally, the synthesized sound source waveform is output through the electronic throat vibration output module.
由上述可知, 本发明电子喉语音重建方法主要包括三个部分, 一、 图 像采集及处理; 二、 电子喉嗓音源的合成; 三、 电子喉的振动输出。 下面 详细描述: It can be seen from the above that the electronic throat speech reconstruction method of the present invention mainly comprises three parts, one, image acquisition and processing; second, the synthesis of the electronic throat source; third, the vibration output of the electronic throat. Below A detailed description:
本发明的第一部分为图像采集及处理, 主要利用图像处理的方法对嘴 类别, ^为 制信号用以控制电子喉嗓^ "源的动态合成。 ' 一  The first part of the invention is image acquisition and processing, mainly using the image processing method for the mouth category, ^ is the signal used to control the electronic throat ^ "dynamic synthesis of the source."
下面结合图 2所示详细介绍第一部分的具体实现步骤:  The specific implementation steps of the first part are described in detail below in conjunction with Figure 2:
1 ) 初始化参数,即预设分析矩形框范围、 面积阈值和神经网络权系 数,然后采集一帧视频图像,其中面积阈值为分析矩形框面积的百分之一; 1) Initializing parameters, that is, pre-analysing the rectangular frame range, the area threshold, and the neural network weight coefficient, and then acquiring a frame of video image, wherein the area threshold is one percent of the analysis rectangular frame area;
2 )利用基于肤色的检测方法对嘴唇区域进行检测, 即在 YUV色彩空间 按照下述公式 (一)计算矩形框范围的唇色特征值以增强嘴唇区域的区分 度,并归一化为 0-255灰度级, 从而, 得到唇色特征值灰度图像, 公式(一) 如下: 2) Using the skin color-based detection method to detect the lip region, that is, calculating the lip color feature value of the rectangular frame range in the YUV color space according to the following formula (1) to enhance the discrimination of the lip region, and normalize to 0- 255 gray level, thus, to obtain the lip color eigenvalue gray image, the formula (1) is as follows:
Z = 0.4937? - 0.589G + 0.0265 公式(一) 在上述公式(一) 中, R、 G、 B分别代表红色、 绿色和蓝色分量。  Z = 0.4937? - 0.589G + 0.0265 Formula (I) In the above formula (1), R, G, and B represent the red, green, and blue components, respectively.
3 )利用改进的最大类间方差(Otsu )法计算唇色特征值灰度图像的最 佳分割阈值, 然后, 以此阈值对图像进行二值化分割, 从而, 得到嘴唇的 初步分割图像;  3) Calculating the optimal segmentation threshold of the gray image of the lip color feature value by using the improved maximum inter-class variance (Otsu) method, and then binarizing the image by using the threshold value, thereby obtaining a preliminary segmentation image of the lip;
4 )釆用面积阈值的方法, 将初步分割图像中面积小于阔值的区域作为 噪声消去, 得到最终的嘴唇分割图像;  4) using the area threshold method, the area of the preliminary segmentation image whose area is less than the threshold is eliminated as noise, and the final lip segmentation image is obtained;
5 )对嘴唇区域进行外轮廊和中心点提取: 假设椭圆长轴与 X轴成零度 角,利用椭圓模型对嘴唇外轮廓进行拟合, 通过一维哈夫(Hough ) 变换检 测得到椭圆长短轴的大小;  5) Perform outer and vertical point extraction on the lip area: Assuming that the long axis of the ellipse is at zero angle with the X axis, the elliptical model is used to fit the outer contour of the lip, and the ellipse long and short axis is obtained by one-dimensional Hough transform. the size of;
6 )以归一化半长轴、 归一化半短轴、 长短轴之比和嘴唇归一化面积值 四个值作为一组参数, 经过人工神经网络的计算得到发声起止时刻和发声 元音类别, 用以指导嗓音源合成控制。  6) The normalized semi-major axis, the normalized semi-short axis, the ratio of the long and short axis, and the normalized area of the lips are taken as a set of parameters. The artificial start and stop vocalizations are obtained through the calculation of the artificial neural network. Category, used to guide the synthesis of the sound source.
需要说明: 在本发明中, 归一化半长轴、 归一化半短轴, 以及嘴唇归 一化面积均是指以不发声时静态的半长轴、 半短轴、 嘴唇面积为标准的归 一化值。  It should be noted that, in the present invention, the normalized semi-major axis, the normalized semi-minor axis, and the normalized area of the lips are all based on the static semi-major axis, semi-short axis, and lip area when no sound is emitted. Normalized value.
在本实施方式中, 以长短轴的比值和归一化参数作为神经网络的输 入,因为它们不但能够准确反映嘴型的变化情况, 而且可以判断发声起止时 刻与元音类别, 具有很好的距离不变性, 可以克服由于使用者与摄像头^ 离变化造成图像中嘴唇面积大小改变而产生的判断错误, 因此, 得到的判 断信号与语音波形具有很好的吻合度, 判断准确率较高。  In the present embodiment, the ratio of the long and short axes and the normalized parameter are used as the input of the neural network, because they can not only accurately reflect the change of the mouth shape, but also can determine the start and end time of the vocalization and the vowel category, and have a good distance. The invariance can overcome the judgment error caused by the change of the lip area size in the image caused by the change of the user and the camera. Therefore, the obtained judgment signal has a good coincidence with the speech waveform, and the judgment accuracy is high.
另外, 为了满足实时性的要求, 本发明的图像处理在嘴唇分割和椭圆 模型参数匹配中都采用了时空域联合的跟踪控制方法, 即基于说话时面部 变化緩慢连续的假设, 通过前一帧图像分割的区域信息和椭圓匹配的参数 信息指导本帧图像的分割的矩形范围和匹配的参数范围, 很好的利用了帧 内和帧间信息, 不仅提高了处理的速度, 还提高了计算精度。 In addition, in order to meet the requirements of real-time performance, the image processing of the present invention adopts a time-space domain joint tracking control method in lip segmentation and ellipse model parameter matching, that is, based on the assumption that the face changes slowly and continuously during speech, through the previous frame image The segmented region information and the ellipse-matched parameter information guide the segmented rectangular range of the frame image and the matched parameter range, and the frame is well utilized. The internal and interframe information not only improves the processing speed, but also improves the calculation accuracy.
本发明中的人工神经网络为三层前向神经网络, 包括输入层 (即归一 化半长轴、 归一化半短轴、 长短轴之比和嘴唇归一化面积值)、 隐含层(三 十个节点)、 输出层 (即不发声、 /◎八 /%八 /®八 1—1、 以及八 ./五个元 音), 其中, 神经网络的节点权系数预先经过样本训练得到, 训练采用误差 反向传播(BP ) 算法, 样本为不发声静止状态和发各个元音时的嘴唇形状 参数。  The artificial neural network in the present invention is a three-layer forward neural network including an input layer (ie, a normalized semi-major axis, a normalized semi-minor axis, a ratio of long and short axes, and a normalized area of the lips), and an implicit layer. (thirty nodes), output layer (ie no sound, / ◎ eight /% eight / ® eight 1-1, and eight. / five vowels), wherein the node weight coefficient of the neural network is pre-tested by the sample The training uses the error back propagation (BP) algorithm, and the samples are the lip shape parameters when the quiescent state is not sounded and each vowel is sent.
请继续参阅图 3 所示, 本发明的第二个部分为嗓音源的合成, 利用发 声合成法原理, 通过源 -滤波器两步合成电子喉嗓音源, 具体步骤如下: 第一步: 合成声门嗓音源波形: 根据使用者发声的个性特征在参数库 中选择并设置声门嗓音源模型参数, 图像采集及处理模块内得到的发声起 止时刻控制嗓音源合成的开始和结束, 按照 LF模型合成声门嗓音源; 所述声门嗓音源合成采用 LF模型, 具体数学表示如下:  Please continue to refer to FIG. 3, the second part of the present invention is the synthesis of the squeak source. The vocal synthesis principle is used to synthesize the electronic throat source through the source-filter two steps. The specific steps are as follows: Step 1: Synthetic sound Threshold source waveform: Select and set the glottal source model parameters in the parameter library according to the personality characteristics of the user's vocalization. The vocal start and end time obtained in the image acquisition and processing module controls the start and end of the syllabic source synthesis, and is synthesized according to the LF model. The glottal humming sound source; the glottal cymbal sound source is synthesized using the LF model, and the specific mathematical expression is as follows:
Figure imgf000008_0001
Figure imgf000008_0001
上式中, Ee为幅度参数, tp、 te > ta、 均为时间参数, 分别代表气流 最大峰值时刻、 最大负峰值时刻、 指数回复段时间常数和基频周期, 其余 In the above formula, Ee is the amplitude parameter, t p , t e > t a , both are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period, and the fundamental frequency period, respectively.
ω )
Figure imgf000008_0002
第二步: 根据判断的发声元音类别选择声道的形状参数, 利用波导模 型模拟声音在声道中传播, 根据以下公式计算出使用者实际使用时, 振动 传递进入声道处的声压波形, 即为合成的电子喉嗓音源:
ω )
Figure imgf000008_0002
The second step: selecting the shape parameter of the channel according to the determined vocal vowel category, using the waveguide model to simulate the sound propagation in the channel, and calculating the sound pressure waveform of the vibration transmission into the channel according to the following formula , that is, the synthetic electronic throat source:
所述模拟声音在声道中传播的波导模型的具体数学表示如下: i二 G— )",+一 ui+i二 uI一 r, (",+ + ",— +1 ) _ 4 - 4+ιThe specific mathematical representation of the waveguide model in which the simulated sound propagates in the channel is as follows: i 2 G—)”, +一u i + i 二u I一r , (", + + ", — +1 ) _ 4 - 4 + ι
[u- = ( + r )u7+l + ru\二 u7+1 + r (",+ + uM ) 4 + +ι ghttis '. u; =[u- = ( + r )u7 +l + ru\2 u7 +1 + r (", + + u M ) 4 + + ι ghttis '. u; =
Figure imgf000009_0001
Figure imgf000009_0001
Hps: out = (\ - rN)uN + ^ uN + - uN rN « -l 其中, 声道由多个均匀截面积的声管级联而成,表示为面积函数 4, " 和",「分别为第 个声管中的正向声压和反向声压, 是第 个和第 + 1个声管 相邻界面的反射系数, 由相邻声管的截面积 4和 4+1确定, 波导模型通过迭 代可以计算出声道任意位置的声压。 Hps: out = (\ - r N )u N + ^ u N + - u N r N « -l where the channel is formed by cascading a plurality of sound tubes of uniform cross-sectional area, expressed as an area function of 4," And "," respectively, the positive sound pressure and the reverse sound pressure in the first sound tube, which are the reflection coefficients of the adjacent interfaces of the first and the first sound tubes, and the cross-sectional areas of adjacent sound tubes 4 and 4 +1 determines that the waveguide model can calculate the sound pressure at any position of the channel by iteration.
需要说明的是: 第一、 在上述嗓音源合成模块的 LF模型中, 声门嗓音 源波形由幅度参数 Ee及 tp、 te、 ta、 tc四个时间参数共同确定, 对于不同的 人而言, 由于其解剖结构和发声特点不同, 因此, 声门嗓音源波形具有个 性差异, 这些都可以体现在 LF模型的五个参数中, 而这几个参数都是可以 从语音中提取出来的。 例如, 女性发声时基频普遍高于男性, 因此, 女性 的 tc要比男性小等等。 在本发明中, 为了充分保留使用者的声音特点, 重 建出与患者失声前相同的语音, 需要从患者失声前采集的语音中提取上述 五个参数, 存在参数库内, 使用电子喉时, 只需要在参数库中提取上述参 数,即可重建出具有使用者发声特点的语音, 而对于没有采集到失声前语音 的患者来说, 其可以选择自己喜欢的语音特点的参数, 从而重建出自己喜 欢的语音。 It should be noted that: First, in the LF model of the above-mentioned arpeggio synthesis module, the glottal source signal waveform is determined by the amplitude parameters Ee and t p , t e , t a , tc four time parameters, for different people In terms of anatomical structure and vocal characteristics, the glottal 嗓 sound source waveforms have individual differences, which can be reflected in the five parameters of the LF model, and these parameters can be extracted from the speech. . For example, the fundamental frequency of women's vocalization is generally higher than that of men. Therefore, women's tc is smaller than that of men. In the present invention, in order to fully retain the user's voice characteristics, reconstruct the same voice as before the patient loses the voice, the above five parameters need to be extracted from the voice collected before the patient loses the sound, and there is a parameter library, when using the electronic throat, only It is necessary to extract the above parameters in the parameter library to reconstruct the voice with the characteristics of the user's voice, and for the patient who has not collected the voice before the voice loss, it can select the parameters of the voice characteristics that he likes, thus reconstructing his favorite Voice.
第二、 在上述嗓音源合成模块的波导模型中, 唯一的参数就是声道的 面积函数 4 , 不同的人发声或者同一人发声不同, 其对应的声道形状都不 相同, 故, 本发明中采用元音类别的控制方法, 才艮据不同的发声元音, 选 取不同的声道面积函数用于合成; 而对于不同的使用者, 我们首先建立一 个元音-声道面积函数对应模板库, 合成时只需要才艮据判断元音类别查找相 应的声道函数即可, 这个模板库的建立是利用反求的方法从使用者录制的 语音中获得声道响应函数, 再从声道响应函数求取最佳匹配的声道面积函 数,这样可以使得使用者的发声个性特征得到保留。  Secondly, in the waveguide model of the above-mentioned arpeggio synthesis module, the only parameter is the area function 4 of the channel, and different human voices or different voices are different, and the corresponding channel shapes are different, so, in the present invention Using the vowel category control method, different channel area functions are selected for synthesis according to different vowel vowels; for different users, we first establish a vowel-channel area function corresponding template library. When synthesizing, it is only necessary to find the corresponding channel function according to the judgment vowel category. This template library is established by using the reverse method to obtain the channel response function from the voice recorded by the user, and then the channel response function. Find the best matching channel area function so that the user's vocal personality characteristics are preserved.
由以上可知,通过两步合成,可以计算出声道中任意位置的声压信 号,然而选取声道中哪一个位置的声压信号作为电子喉嗓音源, 需要根据使 用者具体的手术情况和使用方式来决定。 It can be seen from the above that by two-step synthesis, the sound pressure signal at any position in the channel can be calculated. However, it is necessary to select the sound pressure signal at which position in the channel as the electronic throat source. The specific surgical situation and the way of use are determined by the user.
下面请参阅图 4 所示, 为不同情况下合成的嗓音源的波形图,例如,由 于喉癌而进行喉切除手术但声道保留较完整的使用者, 可以采用颈部施加 振动的方式从而充分利用保留的声道作用, 因此, 选取声道咽腔下部的声 压波形作为电子喉嗓音源波形, 图 4 (a)和图 4 (c)即分别为该情况下发元音 为 /©/和 / %/时合成的嗓音源波形; 对于咽部癌症的患者, 需要进行咽切 除术, 如此, 病人不仅丧失了声带, 而且很大部分的声道也被破坏, 此时 必须选取口腔处的声压波形作为嗓音源波形, 图 4 (b)和图 4 (d)即分别为该 情况下发元音为 /◎/和 / %/时合成的嗓音源波形。  Please refer to Figure 4 below. For the waveforms of the synthesized snoring sources in different situations, for example, users who have a laryngectomy due to laryngeal cancer but have a more complete vocal tract can use the neck to apply vibration. Using the reserved channel effect, therefore, the sound pressure waveform at the lower part of the pharyngeal cavity is selected as the waveform of the electronic throat sound source. Figure 4 (a) and Figure 4 (c) show the vowel in this case as /©/ And /%/time synthesized 嗓 source waveform; for patients with pharyngeal cancer, pharyngectomy is required, so that the patient not only loses the vocal cords, but also a large part of the vocal tract is also destroyed, at this time must be selected at the mouth The sound pressure waveform is used as the chirp source waveform, and Fig. 4 (b) and Fig. 4 (d) are the chirp source waveforms synthesized in the case where the vowels are /◎/ and /%/, respectively.
如此, 从图 4 可以看出本发明针对不同的手术情况、 使用情况和发声 类别, 从而, 合成不同的电子喉嗓音源波形,不仅符合实际使用的需要,而 且保留了使用者的个性特点, 很大程度上改善了电子喉重建语音的质量。  Thus, it can be seen from FIG. 4 that the present invention is directed to different surgical situations, use situations, and vocalization categories, thereby synthesizing different electronic throat sound source waveforms, which not only meets the needs of actual use, but also retains the user's personality characteristics, The quality of electronic throat reconstruction speech has been improved to a large extent.
请参阅图 5所示, 本发明的第三个模块为电子喉的振动输出模块,包括 电子喉振动器以及电子喉振动器前置电路, 计算机通过 LPT并口将合成的 电子喉嗓音源波形信号输入前置电路, 经过数模转换和功率放大后, 由音 频接口输出模拟电压信号, 最后电子喉振动器振动, 从而输出嗓音源。  Referring to FIG. 5, the third module of the present invention is an electronic throat vibration output module, including an electronic throat vibrator and an electronic throat vibrator front circuit, and the computer inputs the synthesized electronic throat sound source waveform signal through the LPT parallel port. The front circuit, after digital-to-analog conversion and power amplification, outputs an analog voltage signal from the audio interface, and finally the electronic throat vibrator vibrates to output the sound source.
所述电子喉振动器为线性换能器, 即将电压信号线性转换成机械振 动,因此, 其可以按照合成的嗓音源输出振动, 同时为了满足口腔内施加的 需要, 使用导音管将振动导入口腔内部。  The electronic throat vibrator is a linear transducer, that is, linearly converts a voltage signal into mechanical vibration, so that it can output vibration according to a synthesized squeak source, and at the same time, in order to meet the needs of intraoral application, a sound guide tube is used to introduce vibration into the oral cavity. internal.
请继续参阅图 5所示, 电子喉振动器前置电路由输入输出接口、 D/A数 模转换、 功率放大和电源控制组成。 输入输出接口分别为 25针数字输入并 口和 3. 5mm模拟输出音频接口, 其中数字输入并口与计算机并口输出端相 连,传输速度为 44100Byte/s,模拟输出音频接口与电子喉振动器相连 接; D/A数模转换器采用 DAC0832 , 数据精度 8位, 可以直接与 LPT并口的 数据位相连; 功率放大器使用 Ti公司的 TPA701音频功率放大器, +3. 5 ~ +5. 5V供电, 输出功率可达 700mW; 电源控制为 5V电池, 提供 +5V直流电压 给各芯片。  Continuing to refer to Figure 5, the electronic throat vibrator front circuit consists of an input/output interface, D/A digital-to-analog conversion, power amplification, and power control. The input and output interfaces are respectively 25-pin digital input parallel port and 3. 5mm analog output audio interface, wherein the digital input parallel port is connected with the computer parallel port output end, the transmission speed is 44100 Byte/s, and the analog output audio interface is connected with the electronic throat vibrator; The /A digital-to-analog converter uses DAC0832, the data precision is 8 bits, and can be directly connected to the data bit of the LPT parallel port. The power amplifier uses Ti company's TPA701 audio power amplifier, +3. 5 ~ +5. 5V power supply, the output power can reach 700mW; power control is 5V battery, providing +5V DC voltage to each chip.
在以上实施方式中, 该电子喉的语音系统是以视频采集设备、 计算机 和电子喉振动输出模块为基础实现的, 然而, 为了便于实现, 还可以采用 另外一种实施方式, 如图 6 所示, 在该实施方式中, 电子喉语音系统包括 用于采集图像的 CMOS图像传感器、 与 CMOS图像传感器的输出端相连并用 于对采集到的图像进行分析处理和嗓音源合成的 FPGA芯片、 与 FPGA芯片 的输出端相连并用于对合成的电子喉嗓音源波形进行 D/A转换和功率放大 的语音芯片, 以及与语音芯片的输出端相连的电子喉振动器。  In the above embodiment, the voice system of the electronic throat is implemented based on a video capture device, a computer, and an electronic throat vibration output module. However, for implementation, another embodiment may be adopted, as shown in FIG. 6 . In this embodiment, the electronic throat voice system includes a CMOS image sensor for acquiring an image, an FPGA chip connected to the output end of the CMOS image sensor, and used for analyzing and processing the collected image and synthesizing the sound source, and the FPGA chip. The output is connected to a voice chip for D/A conversion and power amplification of the synthesized electronic throat source waveform, and an electronic throat vibrator connected to the output of the voice chip.
所述 CMOS图像传感器采用 MICRON公司的 MT9M011 , 最大分辨率为 640 x 480, 在该分辨率下的帧率为 60 帧 /s, 用于采集使用者发声过程中的面 部图像。 The CMOS image sensor uses MICRON's MT9M011 with a maximum resolution of 640 x 480, and the frame rate at this resolution is 60 frames/s, which is used to collect the surface of the user's voice. Part image.
FPGA芯片支持 SOPC技术, 实现以视频数据为输入, 经过视频数据处理 分析和电子喉嗓音源合成,最终输出电子喉嗓音源波形数据的功能;该 FPGA 芯片除了包含与 CMOS图像传感器及语音芯片相连的接口外,还包括 LCD、 FLASH , 以及 SDRAM, 其中, LCD为液晶显示屏, 用于显示相关数据, FLASH 为闪存, SDRAM为同步动态随机存储器。 ·  The FPGA chip supports SOPC technology, which realizes the function of video data as input, video data processing analysis and electronic throat sound source synthesis, and finally outputs electronic throat sound source waveform data; the FPGA chip is connected with CMOS image sensor and voice chip. In addition to the interface, LCD, FLASH, and SDRAM are also included, wherein the LCD is a liquid crystal display for displaying related data, FLASH is flash memory, and SDRAM is synchronous dynamic random access memory. ·
语音芯片采用 A IC23, 包括 D/A转换器和功率放大功能, 经过 D/A转换 和功率放大后, 由音频接口输出到电子喉振动器。  The voice chip uses A IC23, including D/A converter and power amplification function. After D/A conversion and power amplification, it is output from the audio interface to the electronic throat vibrator.
以上所述仅为本发明的一种实施方式,不是全部或唯一的实施方式,本 领域普通技术人员通过阅读本发明说明书而对本发明技术方案采取的任何 等效的变换, 均为本发明的权利要求所涵盖。  The above is only one embodiment of the present invention, and it is not the whole or the only embodiment. Any equivalent change taken by the technical personnel of the present invention by reading the specification of the present invention is the right of the present invention. The requirements are covered.

Claims

权 利 要 求 Rights request
1、 一种电子喉语音重建方法, 首先从采集的语音中提取模型参数作为 参数库, 接着采集发声者的面部图像, 将该图像传输给图像分析与处理模 块,图像分析与处理模块分析处理完之后, 得到发声起止时刻与发声元音类 别, 再接着, 以发声起止时刻和发声元音类别控制嗓音源合成模块并合成 嗓音源波形, 最后, 通过电子喉振动输出模块将上述嗓音源波形输出,电子 喉振动输出模块包括前置电路和电子喉振动器, 其特征在于: 所述嗓音源 合成模块的合成步骤如下: 1. An electronic throat speech reconstruction method, firstly extracting model parameters from the collected speech as a parameter library, and then collecting the facial image of the utterer, transmitting the image to the image analysis and processing module, and analyzing and processing the image analysis and processing module After that, the vocal start and stop time and the utterance vowel category are obtained, and then the squeak source synthesis module is controlled by the utterance start time and the utterance vowel category, and the 嗓 sound source waveform is synthesized, and finally, the 嗓 sound source waveform is output through the electronic throat vibration output module, The electronic throat vibration output module comprises a front circuit and an electronic throat vibration device, wherein: the synthesis steps of the sound source synthesis module are as follows:
1 )合成声门嗓音源波形:根据使用者发声的个性特征在参数库内选择 声门嗓音源模型参数,其中,发声起止时刻控制嗓音源合成的开始和结束,所 述声门嗓音源合成采用 LF模型, 具体数学表示如下:  1) Synthesizing the glottal 嗓 sound source waveform: selecting the glottal 嗓 sound source model parameters in the parameter library according to the personality characteristics of the user vocalization, wherein the vocal start and end time controls the start and end of the syllabic sound source synthesis, and the glottal 嗓 sound source synthesis is adopted. The LF model, the specific mathematical representation is as follows:
' (0 = E0eat sinO ( ≤t≤te ) ' (0 = E 0 e at sinO ( ≤ t ≤ t e )
",gW = -( )[e -ε(ί- ) - e i e≤t≤tc ) ", g W = -( )[e -ε(ί- ) - ei e ≤t≤t c )
8 et„ 8 et„
上式中, Ee为幅度参数, tp、 、 ta、 tc均为时间参数, 分别代表气流 最大峰值时刻、 最大负峰值时刻、 指数回复段时间常数和基频周期, 其余 In the above formula, Ee is the amplitude parameter, and t p , , t a , and tc are time parameters, which represent the maximum peak time of the airflow, the maximum negative peak time, the time constant of the exponential return period, and the fundamental frequency period, respectively.
Figure imgf000012_0001
Figure imgf000012_0001
2 )根据发声元音类别选择声道的形状参数, 利用波导模型模拟声音在 声道中传播, 按照以下公式计算出嗓音源波形: ί = (1 -^ )",+ - = ui一 ';(",+ + ) 4一 4+ 1 2) Select the shape parameter of the channel according to the vocal vocal category, use the waveguide model to simulate the sound propagation in the channel, and calculate the 嗓 source waveform according to the following formula: ί = ( 1 -^ )",+ - = u i一';(",+ + ) 4 -4+ 1
{uT = {1 + η )uM + Γμ- = + η (u* ) 4 + +ι glottis: u * = ug - rgut = - uv ) g ~ - 1{uT = {1 + η )u M + Γμ- = + η (u* ) 4 + + ι glottis: u * = u g - r g u t = - u v ) g ~ - 1
Figure imgf000013_0001
Figure imgf000013_0001
"PS: "。,,, = (卜 )K = " - "; - 1 声道由多个均匀截面积的声管级联表示,上式中, 4和.4,为第/个和第 + 1个声管的面积函数, 和^ -分别为第,个声管中的正向声压和反向声 压, 是第 /个和第 / + 1个声管相邻界面的反射系数。  "PS:". ,,, = (Bu) K = " - "; - 1 channel is represented by a sound tube cascade with multiple uniform cross-sectional areas, where 4 and .4 are the / and + 1 sound tubes The area function, and ^ - respectively, are the forward sound pressure and the reverse sound pressure in the first sound tube, which are the reflection coefficients of the adjacent interface of the /th and /1th sound tubes.
2、 如权利要求 1所述的电子喉语音重建方法, 其特征在于: 所述图像 分析与处理模块包括如下步骤:  2. The electronic throat speech reconstruction method according to claim 1, wherein: the image analysis and processing module comprises the following steps:
步骤一: 初始化参数: 预设分析矩形框范围、 面积阚值和神经网络权 系数, 然后采集一帧视频图像, 其中面积阈值为分析矩形框面积的百分之 步驟二: 利用基于肤色的检测方法对嘴唇区域进行检测, 即在 YUV 色 彩空间按照下述公式计算矩形框范围的唇色特征值, 并归一化为 0-255灰 度级:  Step 1: Initialization parameters: Pre-analyze the rectangular frame range, area threshold and neural network weight coefficient, and then collect a frame of video image, where the area threshold is the percentage of the rectangular frame area. Step 2: Using the skin-based detection method The lip area is detected, that is, the lip color feature value of the rectangular frame range is calculated in the YUV color space according to the following formula, and normalized to 0-255 gray level:
Ζ = 0A93R - 0.589G + 0.0265  Ζ = 0A93R - 0.589G + 0.0265
步骤三: 利用改进的最大类间方差法计算唇色特征值灰度图像的最佳 分割阈值, 然后, 以此阈值对图像进行二值化分割, 从而, 得到嘴唇的初 步分割图像;  Step 3: Calculate the optimal segmentation threshold of the gray image of the lip color feature value by using the improved maximum inter-class variance method, and then binarize the image by using the threshold value, thereby obtaining the initial segmentation image of the lip;
步骤四: 采用面积阈值的方法, 将初步分割图像中面积小于阚值的区 域作为噪声消去, 得到最终的嘴唇分割图像;  Step 4: Using the area threshold method, the area of the preliminary segmentation image whose area is smaller than the threshold value is eliminated as noise, and the final lip segmentation image is obtained;
步骤五: 对嘴唇区域进行外轮廓和中心点提取: 设定椭圓长轴与 X轴 成零度角, 利用椭圆模型对嘴唇外轮廓进行匹配, 通过一维哈夫变换检测 得到椭圆长短轴的大小;  Step 5: Perform outer contour and center point extraction on the lip area: Set the long axis of the ellipse to a zero degree angle with the X axis, use the elliptical model to match the outer contour of the lip, and measure the size of the ellipse long and short axis by one-dimensional Hough transform. ;
步骤六: 以归一化半长轴、 归一化半短轴、 长短轴之比和嘴唇归一化 面积值作为一组参数, 计算发声起止时刻和发声元音类别, 其中, 所述归 一化半长轴、 归一化半短轴, 以及嘴唇归一化面积均是指以不发声时静态 半长轴、 半短轴、 嘴唇面积为标准的归一化值。  Step 6: calculating a vocal start and end time and a vowel vowel category by using a normalized semi-major axis, a normalized semi-short axis, a long-short axis ratio, and a lip normalized area value as a set of parameters, wherein the normalization The semi-major axis, the normalized semi-short axis, and the normalized area of the lips all refer to the normalized values of the static semi-major axis, the semi-short axis, and the lip area when the sound is not sounded.
3、 如权利要求 2所述的电子喉语音重建方法, 其特征在于: 所述图像 分析与处理模块的步 ^六中, 采用人工神经网络算法计算发声起止时刻和 发声元音类别。  3. The electronic throat speech reconstruction method according to claim 2, wherein: in step 6 of the image analysis and processing module, an artificial neural network algorithm is used to calculate the vocal start and stop time and the utterance vowel category.
4、 如权利要求 3所述的电子喉语音重建方法, 其特征在于: 所述人工 神经网络算法为三层网络, 包括输入层、 隐含层, 以及输出层, 其中, 输 入层包含四个输入, 即归一化半长轴、 归一化半短轴、 长短轴之比和嘴唇 归一化面积值, 输出层包括六个输出, 即不发声、 /©八 /%/, 、 1—1、 以及 /../五个元音。 4. The electronic throat speech reconstruction method according to claim 3, wherein: the artificial neural network algorithm is a three-layer network, including an input layer, an implicit layer, and an output layer, wherein the input layer includes four inputs. , that is, the normalized semi-major axis, the normalized semi-short axis, the ratio of the long and short axes, and the normalized area of the lips. The output layer includes six outputs, ie, no sound, /© eight/%/, , 1-1 , And /../ five vowels.
5、 如权利要求 1或 4所述的电子喉语音重建方法, 其特征在于: 所述 嗓音源合成过程中, 以声道咽腔下部声压波形作为颈部施加的嗓音源波形。  The method for reconstructing an electronic laryngeal speech according to claim 1 or 4, wherein: in the process of synthesizing the snoring sound source, the sound pressure waveform of the lower part of the pharyngeal cavity is used as the squeak source waveform applied by the neck.
6、 如权利要求 1或 4所述的电子喉语音重建方法, 其特征在于: 所述 嗓音源合成过程中, 以口腔位置声压波形作为口腔内施加的嗓音源波形。  The electronic throat speech reconstruction method according to claim 1 or 4, wherein: in the process of synthesizing the squeak source, the sound pressure waveform at the oral position is used as the squeak source waveform applied in the oral cavity.
7、 一种应用权利要求 1所述的方法的电子。!¾语音系统,其特征在于:包 括 CMOS图像传感器、与 CMOS图像传感器的输出端相连的 FPGA芯片、与 FPGA 芯片的输出端相连的语音芯片, 以及与语音芯片的输出端相连的电子喉振  7. An electron using the method of claim 1. The 3⁄4 voice system is characterized by including a CMOS image sensor, an FPGA chip connected to the output of the CMOS image sensor, a voice chip connected to the output of the FPGA chip, and an electronic laryngeal connected to the output end of the voice chip.
PCT/CN2010/001022 2010-07-09 2010-07-09 Method for reconstructing electronic larynx speech and system thereof WO2012003602A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2010/001022 WO2012003602A1 (en) 2010-07-09 2010-07-09 Method for reconstructing electronic larynx speech and system thereof
US13/603,226 US8650027B2 (en) 2010-07-09 2012-09-04 Electrolaryngeal speech reconstruction method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/001022 WO2012003602A1 (en) 2010-07-09 2010-07-09 Method for reconstructing electronic larynx speech and system thereof

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/603,226 Continuation US8650027B2 (en) 2010-07-09 2012-09-04 Electrolaryngeal speech reconstruction method and system thereof

Publications (1)

Publication Number Publication Date
WO2012003602A1 true WO2012003602A1 (en) 2012-01-12

Family

ID=45440743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/001022 WO2012003602A1 (en) 2010-07-09 2010-07-09 Method for reconstructing electronic larynx speech and system thereof

Country Status (2)

Country Link
US (1) US8650027B2 (en)
WO (1) WO2012003602A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101475894B1 (en) * 2013-06-21 2014-12-23 서울대학교산학협력단 Method and apparatus for improving disordered voice
CN104835492A (en) * 2015-04-03 2015-08-12 西安交通大学 Electronic larynx fricative reconstruction method
CN108831472B (en) * 2018-06-27 2022-03-11 中山大学肿瘤防治中心 Artificial intelligent sounding system and sounding method based on lip language recognition
CN111192568B (en) * 2018-11-15 2022-12-13 华为技术有限公司 Speech synthesis method and speech synthesis device
US11757469B2 (en) * 2021-04-01 2023-09-12 Qualcomm Incorporated Compression technique for deep neural network weights
WO2023007509A1 (en) * 2021-07-27 2023-02-02 Indian Institute Of Technology Bombay Method and system for time-scaled audiovisual feedback of speech production efforts

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1776809A (en) * 2005-10-17 2006-05-24 西安交通大学 Method and system for reinforcing electronic guttural sound
CN101030384A (en) * 2007-03-27 2007-09-05 西安交通大学 Electronic throat speech reinforcing system and its controlling method
CN101474104A (en) * 2009-01-14 2009-07-08 西安交通大学 Self-adjusting pharyngeal cavity electronic larynx voice communication system and method
WO2010004397A1 (en) * 2008-07-11 2010-01-14 University Of Witwatersrand, Johannesburg An artificial larynx

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SU681447A1 (en) * 1975-04-15 1979-08-25 Институт математики СО АН СССР Speech imitator
US4292472A (en) * 1979-08-29 1981-09-29 Lennox Thomas M Electronic artificial larynx
US4550427A (en) * 1981-03-30 1985-10-29 Thomas Jefferson University Artificial larynx
US4571739A (en) * 1981-11-06 1986-02-18 Resnick Joseph A Interoral Electrolarynx
US4672673A (en) * 1982-11-01 1987-06-09 Thomas Jefferson University Artificial larynx
US4547894A (en) * 1983-11-01 1985-10-15 Xomed, Inc. Replaceable battery pack for intra-oral larynx
US4821326A (en) * 1987-11-16 1989-04-11 Macrowave Technology Corporation Non-audible speech generation method and apparatus
FR2632725B1 (en) * 1988-06-14 1990-09-28 Centre Nat Rech Scient METHOD AND DEVICE FOR ANALYSIS, SYNTHESIS, SPEECH CODING
US5326349A (en) * 1992-07-09 1994-07-05 Baraff David R Artificial larynx
US6006175A (en) * 1996-02-06 1999-12-21 The Regents Of The University Of California Methods and apparatus for non-acoustic speech characterization and recognition
US6377919B1 (en) * 1996-02-06 2002-04-23 The Regents Of The University Of California System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech
FR2786908B1 (en) * 1998-12-04 2001-06-08 Thomson Csf PROCESS AND DEVICE FOR THE PROCESSING OF SOUNDS FOR THE HEARING DISEASE
US7676372B1 (en) * 1999-02-16 2010-03-09 Yugen Kaisha Gm&M Prosthetic hearing device that transforms a detected speech into a speech of a speech form assistive in understanding the semantic meaning in the detected speech
US6487531B1 (en) * 1999-07-06 2002-11-26 Carol A. Tosaya Signal injection coupling into the human vocal tract for robust audible and inaudible voice recognition
US7212639B1 (en) * 1999-12-30 2007-05-01 The Charles Stark Draper Laboratory Electro-larynx
US6856952B2 (en) * 2001-02-28 2005-02-15 Intel Corporation Detecting a characteristic of a resonating cavity responsible for speech
WO2002077972A1 (en) * 2001-03-27 2002-10-03 Rast Associates, Llc Head-worn, trimodal device to increase transcription accuracy in a voice recognition system and to process unvocalized speech
US20050281412A1 (en) * 2004-06-16 2005-12-22 Hillman Robert E Voice prosthesis with neural interface
US20110051944A1 (en) * 2009-09-01 2011-03-03 Lois Margaret Kirkpatrick Laryngophone combined with transmitter and power source

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1776809A (en) * 2005-10-17 2006-05-24 西安交通大学 Method and system for reinforcing electronic guttural sound
CN101030384A (en) * 2007-03-27 2007-09-05 西安交通大学 Electronic throat speech reinforcing system and its controlling method
WO2010004397A1 (en) * 2008-07-11 2010-01-14 University Of Witwatersrand, Johannesburg An artificial larynx
CN101474104A (en) * 2009-01-14 2009-07-08 西安交通大学 Self-adjusting pharyngeal cavity electronic larynx voice communication system and method

Also Published As

Publication number Publication date
US8650027B2 (en) 2014-02-11
US20130035940A1 (en) 2013-02-07

Similar Documents

Publication Publication Date Title
WO2012003602A1 (en) Method for reconstructing electronic larynx speech and system thereof
CN101916566B (en) Electronic larynx speech reconstructing method and system thereof
Story et al. Vocal tract area functions for an adult female speaker based on volumetric imaging
Zañartu et al. Subglottal impedance-based inverse filtering of voiced sounds using neck surface acceleration
Abushakra et al. Acoustic signal classification of breathing movements to virtually aid breath regulation
WO2019034184A1 (en) Method and system for articulation evaluation by fusing acoustic features and articulatory movement features
US20230154450A1 (en) Voice grafting using machine learning
CN106264839A (en) Intelligent snore stopping pillow
Lulich et al. The relation between tongue shape and pitch in clarinet playing using ultrasound measurements
WO2012112985A2 (en) System and methods for evaluating vocal function using an impedance-based inverse filtering of neck surface acceleration
Acker Vocal tract adjustments for the projected voice
Dang et al. A study on transvelar coupling for non-nasalized sounds
Al Kork et al. A multi-sensor helmet to capture rare singing, an intangible cultural heritage study
CN215349053U (en) Congenital heart disease intelligent screening robot
Zhang et al. Morphological characteristics of male and female hypopharynx: A magnetic resonance imaging-based study
Colton et al. Measuring vocal fold function
Apostol et al. 3D geometry of the vocal tract and interspeaker variability
Krecichwost et al. Multichannel speech acquisition and analysis for computer-aided sigmatism diagnosis in children
Narayanan Fricative consonants: An articulatory, acoustic, and systems study
Mehta et al. Use of aerodynamic measures in clinical voice assessment
Ma et al. The acoustical role of vocal tract in the horseshoe bat, Rhinolophus pusillus
CN1263423C (en) Method and device for determining respiratory system condition by using respiratory system produced sound
Sinescu et al. Quantitative parameters which describe speech sound distortions due to inadequate dental mounting
Story et al. The relation of velopharyngeal coupling area to the identification of stop versus nasal consonants in North American English based on speech generated by acoustically driven vocal tract modulations
Jesus et al. Ultrasonography applied to the description of voice quality settings in adult speakers of Brazilian Portuguese

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10854262

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10854262

Country of ref document: EP

Kind code of ref document: A1