EP0071716B1 - Allophone vocoder - Google Patents
Allophone vocoder Download PDFInfo
- Publication number
- EP0071716B1 EP0071716B1 EP19820105168 EP82105168A EP0071716B1 EP 0071716 B1 EP0071716 B1 EP 0071716B1 EP 19820105168 EP19820105168 EP 19820105168 EP 82105168 A EP82105168 A EP 82105168A EP 0071716 B1 EP0071716 B1 EP 0071716B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- phoneme
- analog
- speech
- speech signal
- allophone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired
Links
- 238000000034 method Methods 0.000 claims description 22
- 238000013519 translation Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000001228 spectrum Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 abstract description 16
- 238000001914 filtration Methods 0.000 abstract description 5
- 230000006835 compression Effects 0.000 abstract 2
- 238000007906 compression Methods 0.000 abstract 2
- 230000015572 biosynthetic process Effects 0.000 description 7
- 238000003786 synthesis reaction Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000005457 optimization Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 230000003924 mental process Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000005056 compaction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 210000000867 larynx Anatomy 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
Definitions
- This invention relates to a vocoder system as defined in the precharacterizing part of claim 1.
- a vocoder system as defined in the precharacterizing part of claim 1.
- Such a system is known from IEEE INTERNATIONAL CONFERENCE ON ACOUSTIC, Speech and Signal Processing, April 10-12, 1978, V.N. GUPTA et al.: "Speaker-independent vowel identification in continuous speech", pages 546-548.
- the invention relates to a method of analyzing a speech signal and producing audible synthesized speech as defined in the precharacterizing part of claim 4.
- Intrinsic to the recognition of an analog speech is the use of a methodology which breaks the analog speech into its component parts which may be compared to some library for identification. Numerous methods and apparatuses have evolved so as to approximate the human speech and to model it. These modeling techniques include the vocoder, linear predictive filters, and other devices.
- Flanagan discusses two electronic devices which automatically extract the first three formant frequencies from continuous speech. These devices yield continuous DC output voltages whose magnitudes as functions of time represent the formant frequencies of the speech. Although the formant frequencies are in an analog form, use of an analog-digital (AD) converter readily transforms these formant frequencies into digital form which is more suitable for use in an electronic environment.
- AD analog-digital
- this allophone vocoder device Some applications of this allophone vocoder device are found in a digital dictating machine, a store and play telephone, voice memos, multichannel voice communications, voice recorded exams, etc. In the situation of a dictating machine, the erroneous matching of the phonemes is more visible than in the synthesized speech situation; but it provides a rough draft or first cut to the document so as to be edited later.
- This invention uses a phoneme-to-allophone matching algorithm, such that the quality of synthesized speech is vastely improved since allophones more closely map the human utterances.
- This vocoder accepts the analog speech input and matches it to a set of phoneme templates; the phonemes each contain a phoneme code which is compressed into a sequence of phoneme codes and communicated via a channel. This channel should be as noise free as possible so as to provide accurate transmission.
- the sequence of phonemes is received and then translated to an analogous allophone sequence and synthesized through known electronic synthesis means.
- the phoneme recognizer contains an automatic gain control (AGC).
- AGC automatic gain control
- the phoneme recognizer receives the voice input and automatically controls the gain of the voice and sends a signal to the formant-determining means for analysis and formant extraction.
- the algorithm operates on the formants and features of the utterance requiring the detection of the phoneme boundary within the speech.
- the detected phoneme is matched to a phoneme in a library of phoneme templates. Each phoneme template has a corresponding identification code.
- the selected identification code is sequentially packed and transmitted via a transmission channel to a receiver.
- the transmission channel may be either a wired or wireless communication medium. Ideally the transmission channel is as noiseless as possible so as to reduce errors.
- the phoneme-to-allophone synthesizer receives the phoneme codes from the channel.
- the algorithm converts the phoneme sequence into an analogous allophone sequence and thereby produces quality speech.
- a control means sequentially directs a library of allophone characteristics to be communicated to a speech synthesizer.
- a formant is a frequency component in the spectrum of speech which has large amplitude energy. It also has a resonant frequency of the pitch and a voiced sound. This resonant frequency is a multiple of a fundamental frequency.
- the first formant occurs between 200 to 850 Hertz (Hz), and the second formant occurs between 850 and 2,500 Hz, the third formant occurs between 2,500 and 3,500 Hz.
- FIGURE 1 illustrates in block diagram the capabilities of an embodiment of the invention.
- Analog speech 101 is picked up by the microphone 102 and transmitted in analog form to the analog to digital (A/D) converter 103. Once the signal has been translated into digital form, it is converted to a perceived phoneme via the conversion means 104. Each perceived phoneme is communicated to the comparator 105 and referenced to templates in the library 106 so that a match is obtained. Once a matched phoneme is determined, its code is communicated via the bus 107 to either the phoneme sequencer 108, the storage means 109, or the transmitter 110.
- A/D analog to digital
- the code sequence which matches to the phoneme sequence totally identifies the analog speech 101.
- This code sequence is more susceptible to being packed or for storage than the original analog speech 101 due to its digital nature.
- the phoneme sequencer 108 utilizes the code communicated via the bus 107 to obtain the appropriate phoneme from the library 106.
- This phoneme from the library 106 has associated with it a set of allophone characteristics which are communicated to the synthesizer 114.
- the synthesizer 114 communicates an analog signal to operate speaker 115 in the generation of speech 116.
- a more intelligible and higher quality speech 116 is generated. This translation ability permits the encoding of the data in a phoneme base so as to facilitate a lower bit per second transmission rate and thus requires less time and storage medium for the recordation of the original analog speech 101.
- the phoneme codes are stored via storage means 109 for later retrieval. This later retrieval is optionally used by the phoneme sequencer 108, synthesizer 114, and speaker 115 sequence to again synthesize the phoneme sequence in allophone form for generation of speech 116.
- the storage means 109 communicates the phoneme codes to the phoneme-to-alphabet converter 111 which translates the phonemes to their equivalent alphanumeric parts. Once the phonemes have been translated to the alphanumeric parts, such as in ASCII code, they are readily transmitted to the printer 112 so as to produce a paper copy 113 of the original analog speech 101.
- the storage means 109 allows the invention to generate printed text from a speech input so as to permit an automatic dictating device.
- Another alternative is for the phoneme codes from the bus 107 to be communicated to a transmitter 110.
- the transmitter generates signals 117 representative of the phoneme codes which are perceived by a remote unit 120 at its receiver 118.
- the remote unit 120 contains the same capabilities as the local unit 121. This entails the transmission of the phoneme code via a bus 119 from the receiver 118. Again, once the phoneme code is transmitted via the bus 119, it is susceptible for the remote storage means 109' or the remote sequencer 108'. In another embodiment of the invention the phoneme codes transmitted via the bus 119 are also communicatable to a remote transmitter, not shown.
- the remote unit 120 utilizes the phoneme codes in the same manner as the local unit 121.
- the phoneme codes are utilized by the remote sequencer 108' in conjunction with the data in the remote library 106' to generate an analogous allophone sequence which is communicated to the remote synthesizer 114'.
- the remote synthesizer 114' controls the operation of the remote speaker 115' in generating the speech 116'.
- the remote unit 120 also has the option of storing the phoneme code at the remote storage means 109' for later use by the remote sequencer 108' or the phoneme to alphabet converter 111'.
- the phoneme-to-alphabet converter 111' translates the phoneme code to its equivalent alphanumeric symbols which are communicated to the printer 112' to generate a paper copy 113'.
- the analog speech is translated to a phoneme code which is more susceptible to storage or for manipulation as a data string.
- the phoneme code permits easy storage, transmission, generation of a printed copy or eventual synthesis by translation to an analogous allophone sequence.
- FIGURE 2a illustrates, in block form, an embodiment of the invention which receives the analog speech input and results in a speech output.
- the original analog speech signal input 201 is communicated to a phoneme recognizer 202 which generates a sequence of phonemes 203 via a communication channel 204.
- the sequence of phonemes 205 is communicated to a phoneme-to-allophone synthesizer 206 which translates the phoneme sequence into its analogous allophone sequence so as to generate the speech output 207.
- the phoneme recognizer 202 and the phoneme-to-allophone synthesizer 206 are alternatively in the same unit or are remote one from the other.
- the communication channel 204 is either a hard wired device such as bus or a telephone line or a radio transmitter with receiver.
- FIGURE 2b illustrates an embodiment of the phoneme recognizer 202 illustrated in FIGURE 2a.
- the analog speech signal input 201 is communicated to an automatic gain control circuit (AGC) 208 so as to regulate the speech signal into a certain desirable balance.
- AGC automatic gain control circuit
- the formant tracker 209 breaks the analog signal into its formant components which are stored in a random access memory (RAM) 210.
- RAM random access memory
- the formants stored in RAM 210 are communicated to the phoneme boundary detection means 211 so as to group the formants into perceived phoneme components.
- Each perceived phoneme is communicated to the recognition algorithm 212 which utilizes the phoneme templates from the library 213 which is comprised of known phonemes. A best match is made between the perceived phoneme from the phoneme boundary detection means 211 and the templates found in the phoneme template library 213 by the recognition algorithm 212 so as to generate a recognized phoneme code 214.
- the recognition algorithm 212 provides a continuous sequence of phoneme codes so that a blank or non-recognized phoneme does not exist in the sequence. A blank for a non-recognition determination only results in an increase in the noise of the invention.
- FIGURE 2c illustrates an embodiment of a phoneme-to-allophone synthesizer 206.
- the sequence of phoneme codes 205 is communicated to the controller 215.
- the controller 215 utilizes these codes and its prompting of the read only memory (ROM) 217 to communicate to the speech synthesizer 216 the appropriate bit sequence indicative of the analogous allophone sequence.
- This data communicated from the ROM 217 to the speech synthesizer 216 establishes the parameters necessary for the modulation of the speaker 218 in the generation of the synthesized speech.
- the speech synthesizer is chosen from a wide variety of speech synthesis means, including, but not limited to, the use of a linear predictive filter.
- FIGURE 3 is a block diagram of an embodiment of the invention which generates indicia representative of the analog speech.
- the automatic gain control circuit (AGC) 301 communicates an analog speech signal to the pitch tracker 302 and the integration means 304, 314, and 324.
- the pitch tracker 302 generates a fundamental frequency F0.
- a respective set of integers is determined for which the fundamental frequency F0, when multiplied by the integer falls within the formant range.
- the respective sets of integers are broadened to include an overlap in the sets so that the entire formant is defined.
- the integer set for the first formant may contain (0, 1, 2, 3, 4); the second formant integer set contains (4, 5, 6, 7); the third formant integer set contains (7, 8, 9).
- the formant determiner 308 accepts the fundamental frequency FO and utilizes it with an integer value from the integer set for n in the sinusoidal oscillator 303.
- the sinusoidal oscillator 303 generates a sinusoidal signal, s(t), which is centered at the product n and the fundamental frequency.
- the sinusoidal signal is communicated to the integrator 304 which integrates the product of the sinusoidal signal s(t) and the analog speech signal, f(t) over the chosen frequency of the formant. This integration by the integrator 304 creates a convolution of the analog speech signal f(t).
- This operation involving the generation of a sinusoidal signal by the sinusoidal oscillator 303 and the communication thereof to the integrator 304 is continued for all integer values within the integer set by the incrementer 306.
- the value of n which generates the maximum amplitude from the integrator 304 is chosen by the determinator 305.
- This product additionally is determinative of the bandwidth BW1, of the first formant and the pair F1 and BW1 are communicated via channel 307.
- the formant determiners 318 and 328 generate a sinusoidal signal via the sinusoidal oscillators 313 and 323 respectively and subsequently integrate by the integrators 314 and 324 so as to obtain the optimal values M' and K', 315 and 325 respectively.
- the indicia BW1, F1, BW2, F2, BW3, F3, and F0 represent the perceived phoneme indicia from the analog speech from the AGC circuit 301. This perceived indicia is used to match the perceived phoneme to a phoneme template in a library so as to obtain a best match.
- FIGURE 4 indicates the relationship of the bandwidth to the optimal formant.
- the optimal integer value N' Once the optimal integer value N' is determined, its amplitude is plotted relative to the surrounding integers.
- the independent axis 402 contains the frequencies as dictated by the product of the integer value with the fundamental frequency.
- the dependent axis 403 contains the amplitude generated by the product in the convolution with the analog speech signal. As illustrated, the optimal value N' generates an amplitude 404.
- a bandwidth BW1 is determined for the appropriate optimal value N'.
- this bandwidth forms another indicia for determining the perceived phoneme relationship to the phoneme templates of the library. Similar analysis is done for each formant.
- FIGURE 5 is a flow chart of an embodiment for determining the optimal formant positions.
- the algorithm is started at 501 and a fundamental frequency, F0, 502 is determined. This fundamental frequency is utilized to optimize on N 503.
- the optimization on N 503 entails the initialization of the N value 504 followed by the sinusoidal oscillation based at the product of N FO 505.
- the frequency convolver 506 generates the convolution of the fundamental frequency FO and the inputted analog speech signal over the chosen frequency of the formant.
- the convolution is optimized at 507 wherein if it is not the optimal value, the N value is incremented at 508 and the process is repeated until an optimal N value is determined.
- the algorithm proceeds to optimize on the value of M 513 and then to optimize on the value K 523.
- the optimization on N 503, the optimization of M 513, and the optimization of K 523 are identical in structure and performance.
- three formant frequency ranges are utilized to define the human language. It has been found that three ranges accurately described the human speech, but this methodology is either extendable or contractable at the will of the designer. No loss in generality is encountered when the algorithm is extended to apply to a single formant or similarly to extend to more than three formants.
- FIGURE 6 graphically illustrates another methodology for the encoding of the analog speech signal in the formants.
- the analog speech signal 608 is plotted over the independent axis 601 of frequency.
- the dependent axis 602 is the amplitude.
- the frequency range lies between 200 and 700 Hz.
- the second formant 604 has a frequency range of 850 to 2500 Hz; and the third formant 605 has a frequency range of 2700 to 3500 Hz.
- a method similar to the methodology discussed in FIGURE 3 and FIGURE 5 is used to determine the location of the maximum amplitude within the formant range. These maxima yield a distance between maxima, 606 and 607 respectively.
- the distance, d,, between the optimal first and second formants is used to characterize the perceived phoneme for matching to a phoneme template. This methodology allows two integer values d, and d 2 to describe what previously necessitated the use of three integer values (for the first, second and third formants).
- FIGURE 7 is an embodiment of the encoding scheme for establishing a word for matching to the phoneme template.
- the data word 701 in this example is an 8 bit word but any length of word which is capable of adequately describing the perceived phoneme is acceptable.
- the 8 bits are broken up into four basic components, 702, 703, 704, and 705.
- the first component 702 is indicative of a pause or no pause situation.
- b o is set to a value of 1, a pause has been perceived and the appropriate steps will therefore be taken; similarly a 0 at b o indicates lack of a pause.
- bit b i 703 which indicates a voiced or unvoiced phoneme.
- Bits b 2 -b 3 , 704 indicate the contour of the analog speech signal; its assigned value indicates a level slope, a positive slope or a negative slope.
- Bits b 4- b 7 , 705, indicate a mixture of the relative energy, relative pitch, first distance, and second distance.
- FIGURE 8 illustrates the translation of the phoneme code sequence into its appropriate allophone sequence or alternately its alphanumeric counterpart.
- the phoneme sequence 801 is broken into its phoneme codes such as phoneme code 802.
- the phoneme code 802 distinctly describes a particular phoneme 807.
- This phoneme 807 is either printed as at 805 in its ASCII alphanumeric character or it is translated to its analogous allophone sequence when it is taken in conjunction with the surrounding phoneme codes 803 and 804.
- the allophone sequence 806 is generated through the knowledge of the target phoneme 807 and its relationship to its surrounding phonemes.
- the phonemes which precede, 803, and follow, 804, the target phoneme 802 are retained in memory so as to generate the appropriate allophone sequence 806.
- FIGURE 9 illustrates the characteristics of an embodiment of a decisional tree which determines the best approximation of the phoneme template in matching the perceived phoneme.
- the decisional tree is broken up into multiple stages 901, 902, etc. Each stage of the tree breaks the perceived phoneme into a feasible and infeasible matches. As the perceived phoneme is further broken into feasible and infeasible states, the infeasible state becomes absorbing and the feasible state decreases so that eventually a single phoneme template is the only possible choice. Hence, the final stage of the tree must consist of as many nodes as there are templates.
- the original decision 903 is made on whether the first bit, b o , is either set or not set. If the first bit is set, transition is made to node 905; the nodes which follow node 904, B1, are ignored. This determination on the b o level results in translating the available phoneme templates into an infeasible set, those lying exclusively behind node 904 and a feasible set, those lying behind node B2, 905. A similar determination is made for each component part of the indicia. In this example, another separation is made on b 1 and then on the value of b 2- b 3 . This separation into nodes is continued until a final or terminating node is encountered which uniquely identifies the phoneme template chosen.
- Movement is acceptable laterally between nodes such as between nodes E1, 908, and E2, 909 via the ray 907. This movement is permissible so long as a cycle is not thereby created.
- ray 910 indicates a cycle between D1 and C1. For example, a sequence containing C1-D1-C1-D1-C1 is not acceptable since it is a cycle. This sequence causes a never ending cycle which results in a decision never being made.
- the one qualification of the tree illustrated in this embodiment is that a decision must eventually be reached.
- the algorithm illustrated in FIGURE 9 is but one embodiment to identify the best match between the perceived phoneme and the phoneme template. Another approach is to generate a comparison value for each phoneme template relative to the perceived phoneme and then choose the optimal value accordingly. This approach requires more computation and a longer time for its operation.
- FIGURES 10a and 10b illustrate a phoneme to allophone transformation wherein a phoneme is translated to its analogous allophone sequence.
- FIGURE 10a a list of the rules in defining the allophone is set forth.
- 1001 illustrates a blank or a word boundary.
- the different symbols illustrated indicate different allophonic characteristics which are attachable to a phoneme.
- the syllables are broken by a period ".”, 1002.
- These allophonic rules are combined with the phonemes to generate the appropriate allophone sequence.
- FIGURE 10b illustrates how the phoneme "CH", 1003, translates into an appropriate allophone sequence.
- the phoneme "CH” is either a “b CH", 1004, as in “chain” or lies within a word as illustrated by "CH", 1005, as in "bewitching".
- Each phoneme maps into a unique allophone sequence. This allophone sequence is determined through knowledge of the preceding phoneme and the following phoneme within the phoneme sequence.
- the invention as described herein details the use of a voice recognition system which translates the analog speech signal into a phoneme sequence which is more susceptible to compaction, storage, transmission, or translation to an analogous allophone sequence for speech synthesis.
- the phoneme perception allows for an un- limitable vocabulary to be used and also for a best match to be generated.
- the use of a best match is acceptable since the human ear acts as a filtering mechanism and the human brain ignores random noise so as to also filter the synthesized speech.
- the synthesized speech is enhanced dramatically through the translation of the phoneme sequence to an analogous allophone sequence.
- the stored phoneme sequence is susceptible to being translated to an alphanumeric sequence or for transmission via the radio or telephone lines.
- This invention makes it possible for a direct speech to text dictating machine to be implemented and also can be advantageously employed to produce a highly efficient speech data transmission rate.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Machine Translation (AREA)
Abstract
Description
- This invention relates to a vocoder system as defined in the precharacterizing part of
claim 1. Such a system is known from IEEE INTERNATIONAL CONFERENCE ON ACOUSTIC, Speech and Signal Processing, April 10-12, 1978, V.N. GUPTA et al.: "Speaker-independent vowel identification in continuous speech", pages 546-548. Further, the invention relates to a method of analyzing a speech signal and producing audible synthesized speech as defined in the precharacterizing part ofclaim 4. - It has long been recognized that analog speech signals contain numerous redundant sounds so as to make such signals not suitable for efficient data transmission. In a direct human interaction situation this inefficiency is tolerable. The technical requirements to cope with inefficient speed transmission though become infeasible due to cost, time and the increased memory storage which is rendered necessary because of the inefficiency. A need exists for a system and a method which can take an analog speech signal and translate it into a digital form which is reconstructable after transmission or storage. This type of device is generally referred to as a "vocoder".
- Intrinsic to the recognition of an analog speech is the use of a methodology which breaks the analog speech into its component parts which may be compared to some library for identification. Numerous methods and apparatuses have evolved so as to approximate the human speech and to model it. These modeling techniques include the vocoder, linear predictive filters, and other devices.
- One such method of analyzing the analog speech was discussed by James L. Flanagan in the article "Automatic Extraction of Formant Frequencies from Continuous Speech" first printed in J. Acoust. Soc. Am., Vol. 28, pp. 110-118, January 1956, incorporated hereinto by reference.
- In the article, Flanagan discusses two electronic devices which automatically extract the first three formant frequencies from continuous speech. These devices yield continuous DC output voltages whose magnitudes as functions of time represent the formant frequencies of the speech. Although the formant frequencies are in an analog form, use of an analog-digital (AD) converter readily transforms these formant frequencies into digital form which is more suitable for use in an electronic environment.
- Another method was discussed by H. K. Dunn in his article "Methods of Measuring Vowel Formant Bandwidths" J. Acoust. Soc. Am., Vol. 33, pp. 1737-1746, December 1961, incorporated hereinto by reference. In the article, Dunn discloses the use of spectrums of real speech and the use of an artificial larynx in an application to real subjects.
- It is clear therefore that an efficient methodology and apparatus for transformation of an analog speech signal to an approximating digital form does not exist. The mere recognition of formants or the use of diphones in the synthesis of the perceived speech is inaccurate and does not allowfor quality recordation and transmission of data representative of the original speech signal.
- The problems described above can be solved by use of the features contained in the characterizing parts of
claims - Since the human ear acts as a filtering mechanism and also due to the inherent redundancy of the spoken language, any errors which are generated in the selection of the optimal phoneme match are minimized. For example, assume that the phoneme recognizer incorrectly matched the spoken phoneme "SH" to the phoneme "CH" in the phrase "We will be taking a cruise on the ship". This results in the phrase becoming "We will be taking a cruise on the chip". Although the transmitted phoneme sequence is not a perfect match, the total phrase is still intelligible to the listener since the human ear and the mental process filter out this incorrect phoneme. The human ear and mental process have developed over the years to compensate for variations in pronunciations and the incorrect usage of words.
- Some applications of this allophone vocoder device are found in a digital dictating machine, a store and play telephone, voice memos, multichannel voice communications, voice recorded exams, etc. In the situation of a dictating machine, the erroneous matching of the phonemes is more visible than in the synthesized speech situation; but it provides a rough draft or first cut to the document so as to be edited later.
- This invention uses a phoneme-to-allophone matching algorithm, such that the quality of synthesized speech is vastely improved since allophones more closely map the human utterances.
- This vocoder accepts the analog speech input and matches it to a set of phoneme templates; the phonemes each contain a phoneme code which is compressed into a sequence of phoneme codes and communicated via a channel. This channel should be as noise free as possible so as to provide accurate transmission. The sequence of phonemes is received and then translated to an analogous allophone sequence and synthesized through known electronic synthesis means.
- One such means is discussed in U.S. Patent No. 4,209,836 issued to Wiggins, Jr., et al on June 24, 1980, incorporated hereinto by reference. This speech synthesis integrated circuit device uses a linear predictive filter in its generation of the synthesized speech.
- The control of the data within the synthesizer is well known in the art. One such method for communicating digital speech data and control of the memory for storing the data is disclosed in U.S. Patent No. 4,234,761 issued to Wiggins, Jr., et al on November 18, 1980.
- In an embodiment of the invention as defined in
claim 3, the phoneme recognizer contains an automatic gain control (AGC). The phoneme recognizer receives the voice input and automatically controls the gain of the voice and sends a signal to the formant-determining means for analysis and formant extraction. The algorithm operates on the formants and features of the utterance requiring the detection of the phoneme boundary within the speech. The detected phoneme is matched to a phoneme in a library of phoneme templates. Each phoneme template has a corresponding identification code. The selected identification code is sequentially packed and transmitted via a transmission channel to a receiver. - The transmission channel may be either a wired or wireless communication medium. Ideally the transmission channel is as noiseless as possible so as to reduce errors.
- The phoneme-to-allophone synthesizer receives the phoneme codes from the channel. The algorithm converts the phoneme sequence into an analogous allophone sequence and thereby produces quality speech. In the phoneme-to-allophone synthesizer, a control means sequentially directs a library of allophone characteristics to be communicated to a speech synthesizer.
- The use of an efficient formant-determining means is beneficial. A formant is a frequency component in the spectrum of speech which has large amplitude energy. It also has a resonant frequency of the pitch and a voiced sound. This resonant frequency is a multiple of a fundamental frequency. The first formant occurs between 200 to 850 Hertz (Hz), and the second formant occurs between 850 and 2,500 Hz, the third formant occurs between 2,500 and 3,500 Hz.
- The invention together with its particular embodiments and ramifications will be more fully explained by the following drawings and their accompanying descriptions.
- FIGURE 1 is a block diagram of an embodiment of the invention illustrating the data compression and transmission capabilities of the invention.
- FIGURE 2a is a block diagram of the communication relationship of the invention.
- FIGURES 2b and 2c illustrate the recognition side and the synthesis side respectively of the embodiment illustrated in FIGURE 2a.
- FIGURE 3 is an embodiment of the invention utilized to generate indicia representative of the analog speech signal.
- FIGURE 4 is illustrative of the determination of the bandwidth associated with a particular formant.
- FIGURE 5 is a flow chart of an embodiment determining the formant of the analog speech signal.
- FIGURE 6 illustrates a method of determining indicia so as to define a particular formant structure of an analog speech signal.
- FIGURE 7 illustrates an encoding scheme for the indicia.
- FIGURE 8 illustrates a translational operation of a phoneme to either an allophone or alphanumeric characters.
- FIGURE 9 is an example of a decisional tree operating upon the encoded indicia as representated in FIGURE 7.
- FIGURES 10a and 10b illustrate the translation of phonemes-to-allophones.
- FIGURE 1 illustrates in block diagram the capabilities of an embodiment of the invention.
-
Analog speech 101 is picked up by themicrophone 102 and transmitted in analog form to the analog to digital (A/D) converter 103. Once the signal has been translated into digital form, it is converted to a perceived phoneme via the conversion means 104. Each perceived phoneme is communicated to the comparator 105 and referenced to templates in thelibrary 106 so that a match is obtained. Once a matched phoneme is determined, its code is communicated via thebus 107 to either thephoneme sequencer 108, the storage means 109, or thetransmitter 110. - The code sequence which matches to the phoneme sequence totally identifies the
analog speech 101. This code sequence is more susceptible to being packed or for storage than theoriginal analog speech 101 due to its digital nature. - The
phoneme sequencer 108 utilizes the code communicated via thebus 107 to obtain the appropriate phoneme from thelibrary 106. This phoneme from thelibrary 106 has associated with it a set of allophone characteristics which are communicated to the synthesizer 114. The synthesizer 114 communicates an analog signal to operatespeaker 115 in the generation ofspeech 116. Through the use of the phoneme-to-allophone translation as effectuated by thephoneme sequencer 108, with the aid oflibrary 106, a more intelligible andhigher quality speech 116 is generated. This translation ability permits the encoding of the data in a phoneme base so as to facilitate a lower bit per second transmission rate and thus requires less time and storage medium for the recordation of theoriginal analog speech 101. - Alternatively, the phoneme codes are stored via storage means 109 for later retrieval. This later retrieval is optionally used by the
phoneme sequencer 108, synthesizer 114, andspeaker 115 sequence to again synthesize the phoneme sequence in allophone form for generation ofspeech 116. Optionally, the storage means 109 communicates the phoneme codes to the phoneme-to-alphabet converter 111 which translates the phonemes to their equivalent alphanumeric parts. Once the phonemes have been translated to the alphanumeric parts, such as in ASCII code, they are readily transmitted to theprinter 112 so as to produce apaper copy 113 of theoriginal analog speech 101. - This branch of the operation, the storage means 109, phoneme-to-alphabet converter 111, and
printer 112, allows the invention to generate printed text from a speech input so as to permit an automatic dictating device. - Another alternative is for the phoneme codes from the
bus 107 to be communicated to atransmitter 110. The transmitter generatessignals 117 representative of the phoneme codes which are perceived by a remote unit 120 at itsreceiver 118. - The remote unit 120 contains the same capabilities as the local unit 121. This entails the transmission of the phoneme code via a
bus 119 from thereceiver 118. Again, once the phoneme code is transmitted via thebus 119, it is susceptible for the remote storage means 109' or the remote sequencer 108'. In another embodiment of the invention the phoneme codes transmitted via thebus 119 are also communicatable to a remote transmitter, not shown. - The remote unit 120 utilizes the phoneme codes in the same manner as the local unit 121. The phoneme codes are utilized by the remote sequencer 108' in conjunction with the data in the remote library 106' to generate an analogous allophone sequence which is communicated to the remote synthesizer 114'. The remote synthesizer 114' controls the operation of the remote speaker 115' in generating the speech 116'. The remote unit 120 also has the option of storing the phoneme code at the remote storage means 109' for later use by the remote sequencer 108' or the phoneme to alphabet converter 111'. The phoneme-to-alphabet converter 111' translates the phoneme code to its equivalent alphanumeric symbols which are communicated to the printer 112' to generate a paper copy 113'.
- It is clear from this embodiment of the invention that the analog speech is translated to a phoneme code which is more susceptible to storage or for manipulation as a data string. The phoneme code permits easy storage, transmission, generation of a printed copy or eventual synthesis by translation to an analogous allophone sequence.
- FIGURE 2a illustrates, in block form, an embodiment of the invention which receives the analog speech input and results in a speech output.
- In the embodiment of FIGURE 2a, the original analog
speech signal input 201 is communicated to aphoneme recognizer 202 which generates a sequence ofphonemes 203 via acommunication channel 204. The sequence ofphonemes 205 is communicated to a phoneme-to-allophone synthesizer 206 which translates the phoneme sequence into its analogous allophone sequence so as to generate thespeech output 207. It should be noted that thephoneme recognizer 202 and the phoneme-to-allophone synthesizer 206 are alternatively in the same unit or are remote one from the other. In this context thecommunication channel 204 is either a hard wired device such as bus or a telephone line or a radio transmitter with receiver. - FIGURE 2b illustrates an embodiment of the
phoneme recognizer 202 illustrated in FIGURE 2a. The analogspeech signal input 201 is communicated to an automatic gain control circuit (AGC) 208 so as to regulate the speech signal into a certain desirable balance. Theformant tracker 209 breaks the analog signal into its formant components which are stored in a random access memory (RAM) 210. Although in this embodiment the use of aRAM 210 is illustrated, it is contemplated that any suitable storage means could be employed. The formants stored inRAM 210 are communicated to the phoneme boundary detection means 211 so as to group the formants into perceived phoneme components. Each perceived phoneme is communicated to therecognition algorithm 212 which utilizes the phoneme templates from thelibrary 213 which is comprised of known phonemes. A best match is made between the perceived phoneme from the phoneme boundary detection means 211 and the templates found in thephoneme template library 213 by therecognition algorithm 212 so as to generate a recognized phoneme code 214. - As noted earlier, a best match is obtained, even if not a perfect recognition, since the natural filtering of the human ear and the error correction of the mental processes of the listener minimize any errors generated by the
recognition algorithm 212. Therecognition algorithm 212 provides a continuous sequence of phoneme codes so that a blank or non-recognized phoneme does not exist in the sequence. A blank for a non-recognition determination only results in an increase in the noise of the invention. - FIGURE 2c illustrates an embodiment of a phoneme-to-
allophone synthesizer 206. - The sequence of
phoneme codes 205 is communicated to thecontroller 215. Thecontroller 215 utilizes these codes and its prompting of the read only memory (ROM) 217 to communicate to thespeech synthesizer 216 the appropriate bit sequence indicative of the analogous allophone sequence. This data communicated from theROM 217 to thespeech synthesizer 216 establishes the parameters necessary for the modulation of thespeaker 218 in the generation of the synthesized speech. - The speech synthesizer is chosen from a wide variety of speech synthesis means, including, but not limited to, the use of a linear predictive filter.
- FIGURE 3 is a block diagram of an embodiment of the invention which generates indicia representative of the analog speech.
- This indicia is representative of the perceived phoneme and is used in finding a best match or optimal match with the template in the library. The automatic gain control circuit (AGC) 301 communicates an analog speech signal to the
pitch tracker 302 and the integration means 304, 314, and 324. Thepitch tracker 302 generates a fundamental frequency F0. - For each
formant determiner - The
formant determiner 308 accepts the fundamental frequency FO and utilizes it with an integer value from the integer set for n in the sinusoidal oscillator 303. The sinusoidal oscillator 303 generates a sinusoidal signal, s(t), which is centered at the product n and the fundamental frequency. The sinusoidal signal is communicated to the integrator 304 which integrates the product of the sinusoidal signal s(t) and the analog speech signal, f(t) over the chosen frequency of the formant. This integration by the integrator 304 creates a convolution of the analog speech signal f(t). - This operation involving the generation of a sinusoidal signal by the sinusoidal oscillator 303 and the communication thereof to the integrator 304 is continued for all integer values within the integer set by the
incrementer 306. The value of n which generates the maximum amplitude from the integrator 304 is chosen by thedeterminator 305. This optimal value, N', is used to generate the first formant F1 defined by F1 = N' x F0. This product additionally is determinative of the bandwidth BW1, of the first formant and the pair F1 and BW1 are communicated viachannel 307. - In like fashion the
formant determiners sinusoidal oscillators integrators - The indicia BW1, F1, BW2, F2, BW3, F3, and F0, represent the perceived phoneme indicia from the analog speech from the
AGC circuit 301. This perceived indicia is used to match the perceived phoneme to a phoneme template in a library so as to obtain a best match. - FIGURE 4 indicates the relationship of the bandwidth to the optimal formant.
- Once the optimal integer value N' is determined, its amplitude is plotted relative to the surrounding integers. The
independent axis 402 contains the frequencies as dictated by the product of the integer value with the fundamental frequency. Thedependent axis 403 contains the amplitude generated by the product in the convolution with the analog speech signal. As illustrated, the optimal value N' generates anamplitude 404. By utilizing the surroundingdata points - The use of this bandwidth forms another indicia for determining the perceived phoneme relationship to the phoneme templates of the library. Similar analysis is done for each formant.
- FIGURE 5 is a flow chart of an embodiment for determining the optimal formant positions.
- The algorithm is started at 501 and a fundamental frequency, F0, 502 is determined. This fundamental frequency is utilized to optimize on
N 503. The optimization onN 503 entails the initialization of theN value 504 followed by the sinusoidal oscillation based at the product ofN FO 505. Thefrequency convolver 506 generates the convolution of the fundamental frequency FO and the inputted analog speech signal over the chosen frequency of the formant. The convolution is optimized at 507 wherein if it is not the optimal value, the N value is incremented at 508 and the process is repeated until an optimal N value is determined. At the optimization of N, the algorithm proceeds to optimize on the value ofM 513 and then to optimize on thevalue K 523. The optimization onN 503, the optimization ofM 513, and the optimization ofK 523 are identical in structure and performance. - In this embodiment three formant frequency ranges are utilized to define the human language. It has been found that three ranges accurately described the human speech, but this methodology is either extendable or contractable at the will of the designer. No loss in generality is encountered when the algorithm is extended to apply to a single formant or similarly to extend to more than three formants.
- FIGURE 6 graphically illustrates another methodology for the encoding of the analog speech signal in the formants.
- The
analog speech signal 608 is plotted over theindependent axis 601 of frequency. Thedependent axis 602 is the amplitude. Within the first formant 603, the frequency range lies between 200 and 700 Hz. Thesecond formant 604 has a frequency range of 850 to 2500 Hz; and the third formant 605 has a frequency range of 2700 to 3500 Hz. A method similar to the methodology discussed in FIGURE 3 and FIGURE 5 is used to determine the location of the maximum amplitude within the formant range. These maxima yield a distance between maxima, 606 and 607 respectively. The distance, d,, between the optimal first and second formants is used to characterize the perceived phoneme for matching to a phoneme template. This methodology allows two integer values d, and d2 to describe what previously necessitated the use of three integer values (for the first, second and third formants). - FIGURE 7 is an embodiment of the encoding scheme for establishing a word for matching to the phoneme template.
- The
data word 701 in this example is an 8 bit word but any length of word which is capable of adequately describing the perceived phoneme is acceptable. In this embodiment the 8 bits are broken up into four basic components, 702, 703, 704, and 705. - The
first component 702 is indicative of a pause or no pause situation. Hence if bo is set to a value of 1, a pause has been perceived and the appropriate steps will therefore be taken; similarly a 0 at bo indicates lack of a pause. A similar relationship exists at bit bi, 703, which indicates a voiced or unvoiced phoneme. Bits b2-b3, 704, indicate the contour of the analog speech signal; its assigned value indicates a level slope, a positive slope or a negative slope. Bits b4-b7, 705, indicate a mixture of the relative energy, relative pitch, first distance, and second distance. Bits b4-b7, 705, are encoded so that their value indicates the characteristics of the perceived phoneme relating to the formant distances. Bits b4-b7 are encoded to communicate the distances between the maximums within each formant as illustrated in Figure 6. From table 706, each value within the range of bits b4=b7 absolutely defines the two distances. - FIGURE 8 illustrates the translation of the phoneme code sequence into its appropriate allophone sequence or alternately its alphanumeric counterpart.
- The
phoneme sequence 801 is broken into its phoneme codes such asphoneme code 802. Thephoneme code 802 distinctly describes aparticular phoneme 807. Thisphoneme 807 is either printed as at 805 in its ASCII alphanumeric character or it is translated to its analogous allophone sequence when it is taken in conjunction with the surroundingphoneme codes - The
allophone sequence 806 is generated through the knowledge of thetarget phoneme 807 and its relationship to its surrounding phonemes. In this context, the phonemes which precede, 803, and follow, 804, thetarget phoneme 802 are retained in memory so as to generate theappropriate allophone sequence 806. - FIGURE 9 illustrates the characteristics of an embodiment of a decisional tree which determines the best approximation of the phoneme template in matching the perceived phoneme.
- The decisional tree is broken up into multiple stages 901, 902, etc. Each stage of the tree breaks the perceived phoneme into a feasible and infeasible matches. As the perceived phoneme is further broken into feasible and infeasible states, the infeasible state becomes absorbing and the feasible state decreases so that eventually a single phoneme template is the only possible choice. Hence, the final stage of the tree must consist of as many nodes as there are templates.
- The
original decision 903 is made on whether the first bit, bo, is either set or not set. If the first bit is set, transition is made tonode 905; the nodes which follownode 904, B1, are ignored. This determination on the bo level results in translating the available phoneme templates into an infeasible set, those lying exclusively behindnode 904 and a feasible set, those lying behind node B2, 905. A similar determination is made for each component part of the indicia. In this example, another separation is made on b1 and then on the value of b2-b3. This separation into nodes is continued until a final or terminating node is encountered which uniquely identifies the phoneme template chosen. - Movement is acceptable laterally between nodes such as between nodes E1, 908, and E2, 909 via the
ray 907. This movement is permissible so long as a cycle is not thereby created. In this context ray 910 indicates a cycle between D1 and C1. For example, a sequence containing C1-D1-C1-D1-C1 is not acceptable since it is a cycle. This sequence causes a never ending cycle which results in a decision never being made. The one qualification of the tree illustrated in this embodiment is that a decision must eventually be reached. - The algorithm illustrated in FIGURE 9 is but one embodiment to identify the best match between the perceived phoneme and the phoneme template. Another approach is to generate a comparison value for each phoneme template relative to the perceived phoneme and then choose the optimal value accordingly. This approach requires more computation and a longer time for its operation.
- FIGURES 10a and 10b illustrate a phoneme to allophone transformation wherein a phoneme is translated to its analogous allophone sequence.
- In FIGURE 10a, a list of the rules in defining the allophone is set forth. As illustrated "b", 1001 illustrates a blank or a word boundary. The different symbols illustrated indicate different allophonic characteristics which are attachable to a phoneme. The syllables are broken by a period ".", 1002. These allophonic rules are combined with the phonemes to generate the appropriate allophone sequence.
- FIGURE 10b illustrates how the phoneme "CH", 1003, translates into an appropriate allophone sequence. Depending upon the preceding and the following phoneme, the phoneme "CH" is either a "b CH", 1004, as in "chain" or lies within a word as illustrated by "CH", 1005, as in "bewitching".
- Each phoneme maps into a unique allophone sequence. This allophone sequence is determined through knowledge of the preceding phoneme and the following phoneme within the phoneme sequence.
- The invention as described herein details the use of a voice recognition system which translates the analog speech signal into a phoneme sequence which is more susceptible to compaction, storage, transmission, or translation to an analogous allophone sequence for speech synthesis. The phoneme perception allows for an un- limitable vocabulary to be used and also for a best match to be generated. The use of a best match is acceptable since the human ear acts as a filtering mechanism and the human brain ignores random noise so as to also filter the synthesized speech. The synthesized speech is enhanced dramatically through the translation of the phoneme sequence to an analogous allophone sequence. The stored phoneme sequence is susceptible to being translated to an alphanumeric sequence or for transmission via the radio or telephone lines.
- This invention makes it possible for a direct speech to text dictating machine to be implemented and also can be advantageously employed to produce a highly efficient speech data transmission rate.
Claims (8)
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US28969081A | 1981-08-03 | 1981-08-03 | |
US289604 | 1981-08-03 | ||
US06/289,604 US4661915A (en) | 1981-08-03 | 1981-08-03 | Allophone vocoder |
US06/289,603 US4424415A (en) | 1981-08-03 | 1981-08-03 | Formant tracker |
US289690 | 1981-08-03 | ||
US289603 | 2002-11-07 |
Publications (3)
Publication Number | Publication Date |
---|---|
EP0071716A2 EP0071716A2 (en) | 1983-02-16 |
EP0071716A3 EP0071716A3 (en) | 1983-05-11 |
EP0071716B1 true EP0071716B1 (en) | 1987-08-26 |
Family
ID=27403910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19820105168 Expired EP0071716B1 (en) | 1981-08-03 | 1982-06-14 | Allophone vocoder |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP0071716B1 (en) |
JP (1) | JPS5827200A (en) |
DE (1) | DE3277095D1 (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4707858A (en) * | 1983-05-02 | 1987-11-17 | Motorola, Inc. | Utilizing word-to-digital conversion |
FR2547146B1 (en) * | 1983-06-02 | 1987-03-20 | Texas Instruments France | METHOD AND DEVICE FOR HEARING SYNTHETIC SPOKEN MESSAGES AND FOR VIEWING CORRESPONDING GRAPHIC MESSAGES |
DE3513243A1 (en) * | 1985-04-13 | 1986-10-16 | Telefonbau Und Normalzeit Gmbh, 6000 Frankfurt | Method for speech transmission and speech storage |
JPS62231300A (en) * | 1986-03-31 | 1987-10-09 | 郵政省通信総合研究所長 | Automatic zoning of voice processing unit and processing therefor |
FR2642882B1 (en) * | 1989-02-07 | 1991-08-02 | Ripoll Jean Louis | SPEECH PROCESSING APPARATUS |
AU684872B2 (en) * | 1994-03-10 | 1998-01-08 | Cable And Wireless Plc | Communication system |
EP0706172A1 (en) * | 1994-10-04 | 1996-04-10 | Hughes Aircraft Company | Low bit rate speech encoder and decoder |
US5680512A (en) * | 1994-12-21 | 1997-10-21 | Hughes Aircraft Company | Personalized low bit rate audio encoder and decoder using special libraries |
CN1120469C (en) * | 1998-02-03 | 2003-09-03 | 西门子公司 | Method for voice data transmission |
US7353173B2 (en) * | 2002-07-11 | 2008-04-01 | Sony Corporation | System and method for Mandarin Chinese speech recognition using an optimized phone set |
US7353172B2 (en) * | 2003-03-24 | 2008-04-01 | Sony Corporation | System and method for cantonese speech recognition using an optimized phone set |
US7353174B2 (en) * | 2003-03-31 | 2008-04-01 | Sony Corporation | System and method for effectively implementing a Mandarin Chinese speech recognition dictionary |
CN111147444B (en) * | 2019-11-20 | 2021-08-06 | 维沃移动通信有限公司 | Interaction method and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5326761A (en) * | 1976-08-26 | 1978-03-13 | Babcock Hitachi Kk | Injecting device for reducing agent for nox |
-
1982
- 1982-06-14 EP EP19820105168 patent/EP0071716B1/en not_active Expired
- 1982-06-14 DE DE8282105168T patent/DE3277095D1/en not_active Expired
- 1982-08-02 JP JP57135070A patent/JPS5827200A/en active Granted
Also Published As
Publication number | Publication date |
---|---|
EP0071716A3 (en) | 1983-05-11 |
DE3277095D1 (en) | 1987-10-01 |
EP0071716A2 (en) | 1983-02-16 |
JPH0576040B2 (en) | 1993-10-21 |
JPS5827200A (en) | 1983-02-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4661915A (en) | Allophone vocoder | |
US4424415A (en) | Formant tracker | |
EP1704558B1 (en) | Corpus-based speech synthesis based on segment recombination | |
EP0302663B1 (en) | Low cost speech recognition system and method | |
US6535852B2 (en) | Training of text-to-speech systems | |
US6529866B1 (en) | Speech recognition system and associated methods | |
US7035794B2 (en) | Compressing and using a concatenative speech database in text-to-speech systems | |
Zwicker et al. | Automatic speech recognition using psychoacoustic models | |
US6161091A (en) | Speech recognition-synthesis based encoding/decoding method, and speech encoding/decoding system | |
EP0071716B1 (en) | Allophone vocoder | |
US5165008A (en) | Speech synthesis using perceptual linear prediction parameters | |
US4343969A (en) | Apparatus and method for articulatory speech recognition | |
CA2085895A1 (en) | Continuous speech processing system | |
JP2001166789A (en) | Method and device for voice recognition of chinese using phoneme similarity vector at beginning or end | |
Abe et al. | Statistical analysis of bilingual speaker’s speech for cross‐language voice conversion | |
EP0191531B1 (en) | A method and an arrangement for the segmentation of speech | |
Wagner | Automatic labelling of continuous speech with a given phonetic transcription using dynamic programming algorithms | |
KR100259777B1 (en) | Optimal synthesis unit selection method in text-to-speech system | |
Bu et al. | Perceptual speech processing and phonetic feature mapping for robust vowel recognition | |
Wang et al. | An experimental analysis on integrating multi-stream spectro-temporal, cepstral and pitch information for mandarin speech recognition | |
CN112242152A (en) | Voice interaction method and device, electronic equipment and storage medium | |
JPH01202798A (en) | Voice recognizing method | |
EP0681729B1 (en) | Speech synthesis and recognition system | |
Dutono et al. | Effects of compound parameters on speaker-independent word recognition | |
JPH1185196A (en) | Speech encoding/decoding system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Designated state(s): DE FR GB NL |
|
PUAL | Search report despatched |
Free format text: ORIGINAL CODE: 0009013 |
|
AK | Designated contracting states |
Designated state(s): DE FR GB NL |
|
17P | Request for examination filed |
Effective date: 19831104 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): DE FR GB NL |
|
REF | Corresponds to: |
Ref document number: 3277095 Country of ref document: DE Date of ref document: 19871001 |
|
ET | Fr: translation filed | ||
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
26N | No opposition filed | ||
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: NL Payment date: 20010319 Year of fee payment: 20 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20010502 Year of fee payment: 20 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20010531 Year of fee payment: 20 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20010627 Year of fee payment: 20 |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: IF02 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GB Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION Effective date: 20020613 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION Effective date: 20020614 |
|
NLV7 | Nl: ceased due to reaching the maximum lifetime of a patent |
Effective date: 20020614 |