GB2250405A - Speech analysis and image synthesis - Google Patents
Speech analysis and image synthesis Download PDFInfo
- Publication number
- GB2250405A GB2250405A GB9119492A GB9119492A GB2250405A GB 2250405 A GB2250405 A GB 2250405A GB 9119492 A GB9119492 A GB 9119492A GB 9119492 A GB9119492 A GB 9119492A GB 2250405 A GB2250405 A GB 2250405A
- Authority
- GB
- United Kingdom
- Prior art keywords
- speech
- mouth
- store
- data
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000004458 analytical method Methods 0.000 title claims description 24
- 230000015572 biosynthetic process Effects 0.000 title claims description 8
- 238000003786 synthesis reaction Methods 0.000 title claims description 8
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000013500 data storage Methods 0.000 claims 1
- 238000000034 method Methods 0.000 description 8
- 238000012549 training Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000007630 basic procedure Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Successive frames of speech (at 1) are analysed 2, 3, 4 to produce a sequence of codewords identifying the character of the frame. A store 6 stores probability values Pei indicating the probability that any codeword was produced by one of a set of standard 'mouth shapes', whilst a second store 7 stores values tij indicating the probability of one mouth shape following another. A Viterbi decoder examines the sequence of codewords and, using the probabilities, estimates the most likely sequence of mouth shapes to correspond to the speech. This can be used to generate a synthetic "talking face" moving image, e.g. for videophone or audio conferencing applications. <IMAGE>
Description
SPEECH ANALYSIS AND IMAGE SYNTHESIS
The present application relates to the analysis of speech, and more particularly to the analysis of speech to estimate the visual appearance of a mouth by which the speech is uttered. One specific application of such analysis is for the synthesis on the basis of an input speech signal, of a moving image of a human face for display to accompany such speech. Such synthesis may be desired for a video terminal in low bit rate transmission systems, or for enhanced audio-conferencing facilities.
In our European patent application no. 86308732.6 (Publication No. 0225729A) we describe an apparatus for synthesis of a moving image which has a store for an image of a face, and a store for storing a set of data blocks each corresponding to the mouth area of the face and representing a respective different mouth shape. In operation, an input audio signal is analysed to produce sequences of spectral parameters which are then used to access a table relating these parameters to codewords identifying mouth data blocks, the codewords obtained being employed to select the corresponding mouth data blocks for output to a display device.
The present invention is defined in the claims.
Some embodiments of the invention will now be described with reference to the accompanying drawings in which:
Figure 1 is a block diagram of one form of speech analysis apparatus in accordance with the invention; and
Figure 2 is a schematic diagram illustrating the operation of the apparatus; and
Figure 3 is an apparatus for synthesising a moving image, incorporating the speech analysis apparatus of
Figure 1.
The purpose of the speech analysis apparatus shown in
Figure 1 is to receive an input speech signal at an input 1, analyse it, and estimate at intervals the mouth shape which was most likely to have produced that portion of speech. The output of the apparatus is a sequence of codewords each of which identifies an entry in a codebook of mouth shapes. In this example, the codebook is assumed to contain 16 entries. The actual mouth shapes are not stored in the apparatus of Figure 1.
In the embodiment shown, the speech is firstly sampled at 8kHz and converted into digital form in an analogue-to-digital converter 2 and processed by an LPC analysis unit 3 to produce for successive frames of speech (e.g. of 20ms duration) a set of eight LPC coefficients defining a filter having a spectral response similar to that of the speech frame. Any of the conventional LPC analysis methods commonly used for LPC speech coders may be employed for this purpose. The coefficients are vector quantised in a VQ unit 4, which matches each set of coefficients to the nearest entry in a codebook of (e.g.
sixty-four) coefficient sets stored in a speech codebook store 5. This process is again conventional; for example the entry chosen may be that for which the City Block distance (viz the sum of the moduli of the intercoefficient differences) between the actual set and the stored set is a minimum.
The use of vector quantised LPC coefficients is one possible example; LPC cepstral coefficients, or extraction of other speech features, as is common in speech recognition systems, may alternatively be employed.
The apparatus also includes a store 6 containing probability values Pei each of which indicates the relative probability that the speech represented by codeword e originated from a mouth having the shape represented by codeword i. In this example, the store has 16 x 64 = 1024 entries.
A further store 7 contains transition probability values tij each of which indicates the relative probability that mouth shape i is followed by mouth shape j; thus it has 16 x 16 = 256 entries.
The mouth codebook, the speech codebook, and the probability values are generated by a training process, analysing a video signal and its accompanying speech. In a test, fifty sentences totalling 200 seconds were used thereby generating 10,000 frames of speech data and (at 25 frames/second) 5,000 frames of video data.
The speech codebook is generated by selecting that set of 64 coefficient sets which substantially minimises the average distance between a training frame and the nearest entry in the set. Similarly, the mouth codebook was generated by selecting that set of sixteen mouth shapes which substantially minimises the average distance between mouth portions of a training video frame and the nearest entry in the set. The distance measure used was simply the City Block distance between two mouths as represented by the height and width of the mouth opening, but naturally more sophisticated measures could be used if desired.
The mouth/speech probability values Pei are generated by, for each frame, matching-the video and speech frames to the nearest codebook entry; the number of occurences of each pair is recorded. For this part of the training, each video frame was repeated once. Likewise the transitional mouth probability values tij are obtained by counting the number of occurences of mouths i and ; appearing consecutively.
If any particular event did not occur in the training sequence, the corresponding probability value was set to a small value rather than zero.
Returning to Figure 1, the apparatus includes a
Viterbi algorithm unit 8. The purpose of this is, for a passage of speech containing N frames, to determine the most likely sequence of mouth shapes having regard to (1) the observed speech as represented by the codewords e(n) (n=l ... N) output by the VQ unit 4 (2) the speech/mouth shape probability values Pei stored in the store 6 and (3) the transitional probability values tij stored in the store 7. This process is, illustrated schematically in Figure 2.
The application of the Viterbi algorithm to the analysis of the speech information will now be described.
For an utterance (or other portion) of speech a sequence of speech codewords, have been produced, one for each frame of speech. The basic procedure is that one calculates, for each frame in succession, the probability that that frame resulted from each of the permitted mouth shapes, taking into account the speech codeword for that frame, the calculated probability for the preceding frame, and the stored probability values. When the end of the sentence is reached, the mouth shape associated with the largest of the calculated probabilities is chosen for that speech frame, whereupon one then re-visits successive preceding frames and makes a similar decision taking into account the previous decision (in respect of the following frame).
We recall that the probability that a particular speech frame having codeword e was generated by mouth shape i is Pei, and that the probability of mouth shape i being followed by shape j is tij - these values being stored in the stores 6, 7. We define Pi(n) for the nth frame as being the calculated probability that that frame resulted from mouth i.
We commence by finding Pi(l) (i = 0 ... 15) for the first frame. There is no previous frame, so we estimate these probabilities on the basis of the codeword e(l) for that frame. Thus: Pi(l) = Pe(l)i (i = 0 ... 15).
For the second frame, we first apply the stored transitional probability values tij to the calculated probabilities Pi(l) from frame 1. For each candidate second frame mouth shape, we multiply Pi(l) by the corresponding transitional probabilities - for the jth shape: Tit(2) = P.(l)t. (i = 0 ... 15)
Tij 1 1 ] Now select the largest of these, Tmax j (2) noting also the value of i associated with it.
= max (2,j)) Note that the significance of this is that if mouthshape j is the shape chosen for frame 2 then shape max (2,j) is the most likely one to precede it in frame 1 having regard to the calculated probabilities and andthe transitional probability values tij
Having found all these maxima Tmax i (i = 0 15) we then use the frame 2 codeword e(2) to obtain the probability values Pe(2)i from the store and multiply the two to obtain.
Pi(2) = Tmax i(2) Pe(2)i - the calculated probabilities for the second frame.
This process is repeated for successive frames, until we have found Pi(N) for the last frame of the utterance. At this point the first actual decision is made; mouth shape I(N) associated with the largest of the set Pi(N) being chosen, where I(N) is the associated value of i.
Recalling that each time we selected the maximum value of T. (n) we recorded the previous frame mouth shape
13 max (n-l,j), associated with the choice, we can now go back to the penultimate frame N-l and deduce that this implies selection of shape i max (N-1,-I(N)) for it.
In the above description, the calculated probability value for the first frame was estimated simply on the basis of the received speech, since there was no previous frame. In a modification, a further store 9 is included which contains the probability for each mouth shape of its occuring at the beginning of an utterance. These values are then used for the first frame just as the product
T max i is used for later frames.
This description assumes that the application of the
Viterbi algorithm is performed after the whole of an utterance (e.g. a sentence) has been received. Where it is desired to perform the analysis in real time this may involve an undesirable delay, and a modified approach may be preferred. In that the algorithm relies upon the history of the speech over a period, some delay is inherent; however tests indicate that traceback over a period greater than 200ms is of little value.
In the modified method, assuming operation over a window of m frames in length, suppose we start analysis at frame n, the mouth shape for frame n-l having already been fixed (unless n = 1 in which case that frame is dealt with as already discussed for the first frame). As frame n-l is fixed, the previous frame probability Pi (n-l) is unity for the selected entry and zero for all other values of i. The above procedure is then followed up to frame n+m, the mouth codeword for which has just been generated by the VQ unit 4, with traceback to frame n. The mouthshape for frame n is then fixed, and the decisions for later frames discarded. When frame n+m+l is available, the process is the repeated starting at frame n+l and extending up to frame n+m+1.
Figure 2 shows an apparatus for synthesis of a moving image including a human face having a mouth which moves in correspondence with a speech signal received at an input 1. The input 1 is connected to the input of an analysis apparatus 100 which is of the structure described above and shown in Figure 1.
A store 101 contains data representing a stored image of a face; this may be for example a stored digital representation of a raster-scan television picture, the store being a conventional video frame store.
A second store 102 stores sixteen data blocks each being a digital representation of a mouth having a respective one of the mouth shapes discussed above.
For the purposes of the present description it is assumed that each block is stored in pixel map form - i.e.
a mouth could be superimposed into the face stored in the store 101 simply by writing each picture element from the block into the appropriate location in the store 101.
However, other, parametric, methods of-mouth representation could be used.
The analysis apparatus 100 produces at its output, every 20ms, a codeword identifying one of the data blocks in the store 102. An output unit 103 reads the face data from the store 101 every 20ms to form a raster scan television signal; when however it requires picture data in respect of a portion of the picture area corresponding to the mouth, it reads the data instead from the portion of the mouth data store 102 identified by the codeword supplied by the analysis apparatus 100, so that the desired mouth shape is incorporated into the image. The video signal output on an output line 104 can be displayed on a video monitor 105.
This description assumes a video rate of 50 frames per second; in order to generate a signal at the UK Standard (System I) rate of 25 frames per second one could use a 40ms speech frame period, but if it is preferred to retain the 20ms analysis period then the frame rate could be reduced by omitting alternate mouth shapes, or (more preferably to avoid aliasin# temporally filtering the mouth image sequence and then discarding alternate images. Obviously other video frame rates such as 30 frames/second used in system H can be achieved by similar such adjustments.
Claims (9)
1. A speech analysis apparatus comprising means for analysing a speech signal to generate at intervals items of speech information each representative of the speech during that interval;
first store means storing, for each possible item of speech information, data representing the probabilities that a particular portion of speech corresponding to that item of information has been generated by a mouth having each of a predetermined plurality of mouth shapes;
second store means storing, for each possible one of the plurality of mouth shapes, data representing the probabilities that that shape is followed by that shape and by each other one of those shapes; and
decoding means responsive to the generated items of speech information, to the probability data stored in said first store means in respect of those generated items, and to the probability data stored in the second store means to determine a sequence of mouth shapes deemed to be substantially most likely to correspond to the said generated items.
2. An apparatus according to claim 1 in which the speech analysis means comprises
(a) means to analyse each of successive frames of speech to produce therefor a set of parameters representing the spectral content thereof;
(b) a fourth store storing a plurality of reference sets of parameters; and
(c) means to determine, for each produced parameter set, which of the stored sets it most closely resembles;
the said items of speech information being codewords identifying the determined reference sets.
3. An apparatus according to claim 2 in which the parameters are the coefficients of a linear prediction filter.
4. An apparatus according to claim 1, 2 or 3 in which the decoding means is a Viterbi decoder.
5. An apparatus according to any one of the preceding claims including further store means storing for each of the plurality of mouth shapes data representing the probability of its occurence at the commencement of an utterance, the decoding means being responsive also to the data stored in the further store means.
6. An apparatus for synthesis of a moving image, comprising
(a) means for storage and output of data representing an image of a face;
(b) means for storage and output of a set of mouth data blocks each representing an image of a mouth having a respective one of the said shapes;
(c) a speech signal input;
(d) a speech analysis apparatus according to any one of the preceding claims to receive the speech input; and
(e) control means responsive to the output of the speech analysis apparatus to select for output from the mouth data storage means the data blocks corresponding to the determined sequence.
7. An apparatus according to claim 6 further including video signal generating means operable to generate video frames each representing a face image corresponding to the stored face data having superimposed thereon a mouth image represented by a said selected mouth data block.
8. An apparatus for speech analysis substantially as herein described with reference to the accompanying drawings.
9. An apparatus for synthesis of a moving image substantially as herein described with reference to the accompanying drawings.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB909019829A GB9019829D0 (en) | 1990-09-11 | 1990-09-11 | Speech analysis and image synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
GB9119492D0 GB9119492D0 (en) | 1991-10-23 |
GB2250405A true GB2250405A (en) | 1992-06-03 |
Family
ID=10682008
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB909019829A Pending GB9019829D0 (en) | 1990-09-11 | 1990-09-11 | Speech analysis and image synthesis |
GB9119492A Withdrawn GB2250405A (en) | 1990-09-11 | 1991-09-11 | Speech analysis and image synthesis |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB909019829A Pending GB9019829D0 (en) | 1990-09-11 | 1990-09-11 | Speech analysis and image synthesis |
Country Status (1)
Country | Link |
---|---|
GB (2) | GB9019829D0 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0603809A2 (en) * | 1992-12-21 | 1994-06-29 | Casio Computer Co., Ltd. | Object image display devices |
EP0673170A2 (en) * | 1994-03-18 | 1995-09-20 | AT&T Corp. | Video signal processing systems and methods utilizing automated speech analysis |
EP0674315A1 (en) * | 1994-03-18 | 1995-09-27 | AT&T Corp. | Audio visual dubbing system and method |
EP0689362A2 (en) * | 1994-06-21 | 1995-12-27 | AT&T Corp. | Sound-synchronised video system |
WO1996002898A1 (en) * | 1994-07-18 | 1996-02-01 | 477250 B.C. Ltd. | Process of producing personalized video cartoons |
EP0710929A2 (en) * | 1994-11-07 | 1996-05-08 | AT&T Corp. | Acoustic-assisted image processing |
EP0734162A2 (en) * | 1995-03-24 | 1996-09-25 | Deutsche Thomson-Brandt Gmbh | Communication terminal |
FR2749420A1 (en) * | 1996-06-03 | 1997-12-05 | Alfonsi Philippe | METHOD AND DEVICE FOR FORMING MOVING IMAGES OF A CONTACT PERSON |
EP0860811A2 (en) * | 1997-02-24 | 1998-08-26 | Digital Equipment Corporation | Automated speech alignment for image synthesis |
WO1998043235A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Device and method for prosody generation at visual synthesis |
WO1998043236A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Method of speech synthesis |
EP0893923A1 (en) * | 1997-07-23 | 1999-01-27 | Texas Instruments France | Video communication system |
WO2000045380A1 (en) * | 1999-01-27 | 2000-08-03 | Bright Spark Technologies (Proprietary) Limited | Voice driven mouth animation system |
US7369992B1 (en) * | 2002-05-10 | 2008-05-06 | At&T Corp. | System and method for triphone-based unit selection for visual speech synthesis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0056507A1 (en) * | 1981-01-19 | 1982-07-28 | Richard Welcher Bloomstein | Apparatus and method for creating visual images of lip movements |
EP0179701A1 (en) * | 1984-10-02 | 1986-04-30 | Yves Guinet | Television method for multilingual programmes |
EP0225729A1 (en) * | 1985-11-14 | 1987-06-16 | BRITISH TELECOMMUNICATIONS public limited company | Image encoding and synthesis |
GB2231246A (en) * | 1989-03-08 | 1990-11-07 | Kokusai Denshin Denwa Co Ltd | Converting text input into moving-face picture |
-
1990
- 1990-09-11 GB GB909019829A patent/GB9019829D0/en active Pending
-
1991
- 1991-09-11 GB GB9119492A patent/GB2250405A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0056507A1 (en) * | 1981-01-19 | 1982-07-28 | Richard Welcher Bloomstein | Apparatus and method for creating visual images of lip movements |
EP0179701A1 (en) * | 1984-10-02 | 1986-04-30 | Yves Guinet | Television method for multilingual programmes |
EP0225729A1 (en) * | 1985-11-14 | 1987-06-16 | BRITISH TELECOMMUNICATIONS public limited company | Image encoding and synthesis |
GB2231246A (en) * | 1989-03-08 | 1990-11-07 | Kokusai Denshin Denwa Co Ltd | Converting text input into moving-face picture |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0603809A3 (en) * | 1992-12-21 | 1994-08-17 | Casio Computer Co Ltd | Object image display devices. |
EP0603809A2 (en) * | 1992-12-21 | 1994-06-29 | Casio Computer Co., Ltd. | Object image display devices |
US5608839A (en) * | 1994-03-18 | 1997-03-04 | Lucent Technologies Inc. | Sound-synchronized video system |
EP0673170A2 (en) * | 1994-03-18 | 1995-09-20 | AT&T Corp. | Video signal processing systems and methods utilizing automated speech analysis |
EP0674315A1 (en) * | 1994-03-18 | 1995-09-27 | AT&T Corp. | Audio visual dubbing system and method |
EP0673170A3 (en) * | 1994-03-18 | 1996-06-26 | At & T Corp | Video signal processing systems and methods utilizing automated speech analysis. |
EP0689362A2 (en) * | 1994-06-21 | 1995-12-27 | AT&T Corp. | Sound-synchronised video system |
EP0689362A3 (en) * | 1994-06-21 | 1996-06-26 | At & T Corp | Sound-synchronised video system |
WO1996002898A1 (en) * | 1994-07-18 | 1996-02-01 | 477250 B.C. Ltd. | Process of producing personalized video cartoons |
EP0710929A2 (en) * | 1994-11-07 | 1996-05-08 | AT&T Corp. | Acoustic-assisted image processing |
EP0710929A3 (en) * | 1994-11-07 | 1996-07-03 | At & T Corp | Acoustic-assisted image processing |
EP0734162A2 (en) * | 1995-03-24 | 1996-09-25 | Deutsche Thomson-Brandt Gmbh | Communication terminal |
EP0734162A3 (en) * | 1995-03-24 | 1997-05-21 | Thomson Brandt Gmbh | Communication terminal |
WO1997046974A1 (en) * | 1996-06-03 | 1997-12-11 | Pronier Jean Luc | Device and method for transmitting animated and sound images |
FR2749420A1 (en) * | 1996-06-03 | 1997-12-05 | Alfonsi Philippe | METHOD AND DEVICE FOR FORMING MOVING IMAGES OF A CONTACT PERSON |
EP0860811A3 (en) * | 1997-02-24 | 1999-02-10 | Digital Equipment Corporation | Automated speech alignment for image synthesis |
EP0860811A2 (en) * | 1997-02-24 | 1998-08-26 | Digital Equipment Corporation | Automated speech alignment for image synthesis |
US6385580B1 (en) | 1997-03-25 | 2002-05-07 | Telia Ab | Method of speech synthesis |
WO1998043236A3 (en) * | 1997-03-25 | 1998-12-23 | Telia Ab | Method of speech synthesis |
WO1998043235A3 (en) * | 1997-03-25 | 1998-12-23 | Telia Ab | Device and method for prosody generation at visual synthesis |
WO1998043236A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Method of speech synthesis |
WO1998043235A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Device and method for prosody generation at visual synthesis |
US6389396B1 (en) | 1997-03-25 | 2002-05-14 | Telia Ab | Device and method for prosody generation at visual synthesis |
EP0893923A1 (en) * | 1997-07-23 | 1999-01-27 | Texas Instruments France | Video communication system |
WO2000045380A1 (en) * | 1999-01-27 | 2000-08-03 | Bright Spark Technologies (Proprietary) Limited | Voice driven mouth animation system |
US7369992B1 (en) * | 2002-05-10 | 2008-05-06 | At&T Corp. | System and method for triphone-based unit selection for visual speech synthesis |
US7933772B1 (en) | 2002-05-10 | 2011-04-26 | At&T Intellectual Property Ii, L.P. | System and method for triphone-based unit selection for visual speech synthesis |
US9583098B1 (en) | 2002-05-10 | 2017-02-28 | At&T Intellectual Property Ii, L.P. | System and method for triphone-based unit selection for visual speech synthesis |
Also Published As
Publication number | Publication date |
---|---|
GB9119492D0 (en) | 1991-10-23 |
GB9019829D0 (en) | 1990-10-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6330023B1 (en) | Video signal processing systems and methods utilizing automated speech analysis | |
GB2250405A (en) | Speech analysis and image synthesis | |
EP0225729B1 (en) | Image encoding and synthesis | |
US5890120A (en) | Matching, synchronization, and superposition on orginal speaking subject images of modified signs from sign language database corresponding to recognized speech segments | |
CN111816158B (en) | Speech synthesis method and device and storage medium | |
EP1141939B1 (en) | System and method for segmentation of speech signals | |
EP0453649B1 (en) | Method and apparatus for modeling words with composite Markov models | |
KR880700387A (en) | Speech processing system and voice processing method | |
Chen et al. | Lip synchronization using speech-assisted video processing | |
CN108648745B (en) | Method for converting lip image sequence into voice coding parameter | |
EP0731348A3 (en) | Voice storage and retrieval system | |
SE470577B (en) | Method and apparatus for encoding and / or decoding background noise | |
JPS59101700A (en) | Method and apparatus for spoken voice recognition | |
JPH0527792A (en) | Voice emphasizing device | |
Chen et al. | Speech-assisted video processing: Interpolation and low-bitrate coding | |
JP2611728B2 (en) | Video encoding / decoding system | |
JPH09198082A (en) | Speech recognition device | |
FR2692070A1 (en) | Variable speed voice synthesis method and device. | |
KR20040037099A (en) | Viseme based video coding | |
Wrench | A realtime implementation of a text independent speaker recognition system | |
JP3254696B2 (en) | Audio encoding device, audio decoding device, and sound source generation method | |
JPS59111699A (en) | Speaker recognition system | |
JP2709198B2 (en) | Voice synthesis method | |
Shah et al. | An image/speech relational database and its application | |
Shah et al. | Lip synchronization through alignment of speech and image data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |