USRE42000E1 - System for synchronization between moving picture and a text-to-speech converter - Google Patents

System for synchronization between moving picture and a text-to-speech converter Download PDF

Info

Publication number
USRE42000E1
USRE42000E1 US10/038,153 US3815301A USRE42000E US RE42000 E1 USRE42000 E1 US RE42000E1 US 3815301 A US3815301 A US 3815301A US RE42000 E USRE42000 E US RE42000E
Authority
US
United States
Prior art keywords
information
lip
synchronization
text
down motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US10/038,153
Inventor
Jae Woo Yang
Jung Chul Lee
Min Soo Hahn
Hang Seop Lee
YoungJik Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electronics and Telecommunications Research Institute ETRI
Original Assignee
Electronics and Telecommunications Research Institute ETRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electronics and Telecommunications Research Institute ETRI filed Critical Electronics and Telecommunications Research Institute ETRI
Priority to US10/038,153 priority Critical patent/USRE42000E1/en
Application granted granted Critical
Publication of USRE42000E1 publication Critical patent/USRE42000E1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/02Analogue recording or reproducing
    • G11B20/04Direct recording or reproducing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/438Presentation of query results
    • G06F16/4387Presentation of query results by the use of playlists
    • G06F16/4393Multimedia presentations, e.g. slide shows, multimedia albums

Definitions

  • the present invention relates to a system for synchronization between moving picture and a text-to-speech(TTS) converter, and more particulary to a system for synchronization between moving picture and a text-to-speech converter which can be realized a synchronization between moving picture and synthesized speech by using the moving time of lip and duration of speech information.
  • TTS text-to-speech
  • a speech synthesizer provides a user with various types of information in an audible form.
  • the speech synthesizer should provide a high quality speech synthesis service from the input texts given to a user.
  • the speech synthesizer in order for the speech synthesizer to be operatively coupled to a database constructed in a multi-media environment, or various media provided by a counterpart involved in a conversation, the speech synthesizer can generate a synthesized speech so as to be synchronized with these media.
  • the synchronization between moving picture and the TTS is essentially required to provide a user with a high quality service.
  • FIG. 1 shows a block diagram of a conventional text-to-speech converter which generally consists of three steps in generating a synthesized speech from the input text.
  • a language processing unit 1 converts an input text to a phoneme string, estimates prosodic information, and symbolizes it.
  • the symbol of the prosodic information is estimated from the phrase boundary, clause boundary, accent position, sentence patterns, etc. by analyzing a syntactic structure.
  • a prosody processing unit 2 calculates the values for prosody control parameters from the symbolized prosodic information by using rules and tables.
  • the prosody control parameters include phoneme duration and pause interval information.
  • a signal processing unit 3 generates a synthesized speech by using a synthesis unit DB 4 and the prosody control parameters. That is, the conventional synthesizer should estimate prosodic information related to naturalness and speaking rate only from an input text in the language processing unit 1 and the prosody processing unit 2 .
  • the conventional synthesizer is aimed at its use in synthesizing a speech from an input text and thus there is no research activity on a synthesizing method which can be used in connection with multi-media.
  • information required to implement the synchronization of media with a synthesized speech cannot be estimated from the text only.
  • the synchronization between moving picture and a synthesized speech is assumed to be a kind of dubbing
  • One of these methods includes a method of synchronizing moving picture with a synthesized speech on a sentence basis. This method regulates the time duration of the synthesized speech by using information on the start point and end point of the sentence. This method has an advantage that it is easy to implement and the additional efforts can be minimized. However, the smooth synchronization cannot be achieved with this method.
  • there is a method wherein information on the start and end point, and phoneme symbol for every phoneme are transcribed in the interval of the moving picture related to a speech signal to be used in generating a synthesized speech.
  • synchronization information is recorded based on patterns having the characteristic by which a lip motion can be easily distinguished, such as the start and end points of the speech, the opening and closing of the lip, protrusion of the lip, etc.
  • This method can enhance the efficiency of synchronization while minimizing the additional efforts exerted to make information for synchronization.
  • a system for synchronization between moving picture and a text-to-speech converter which comprises distributing means for multi-media input information, transforming it into the respective data structures, and distributing it to each medium; image output means for receiving image information of the multi-media information from said distributing means; language processing means for receiving language texts of the multi-media information from said distributing means, transforming the text into phoneme string, estimating and symbolizing prosodic information; prosody processing means for receiving the processing result from said language processing means, calculating the values of prosodic control parameters; synchronization adjusting means for receiving the processing results from said prosody processing means, adjusting time durations for every phoneme for synchronization with image signals by using synchronization information of the multi-media information from said distributing means, and inserting the adjusted time durations into the results of said prosody processing means; signal processing means for receiving the processing results from said synchronization adjusting means to generate a synthesized speech; and a synthesis unit database block for selecting required unit for
  • FIG. 1 shows a block diagram of a conventional text-to-speech converter
  • FIG. 2 shows a block diagram of a synchronization system in accordance with the present invention
  • FIG. 3 shows a detailed block diagram to illustrate a method of synchronizing a text-to-speech converter
  • FIG. 4 shows a flow chart to illustrate a method of synchronizing a text-to-speech converter.
  • FIG. 2 shows a block diagram of a synchronization system in accordance with the present invention.
  • reference numerals 5 , 6 , 7 , 8 and 9 indicate a multi-data input unit, a central processing unit, a synthesized database, a digital/analog(D/A) converter, and an image output unit, respectively.
  • Data comprising multi-media such as an image, text, etc. is inputted to the multi-data input unit 5 which outputs the input data to the central processing unit 6 .
  • the algorithm in accordance with the present invention is embedded into the central processing unit 6 .
  • the synthesized database 7 a synthesized DB for use in the synthesis algorithm is stored in a storage device and transmits necessary data to the central processing unit 6 .
  • the digital/analog converter 8 converts the synthesized digital data into an analog signal to output it to the exterior.
  • the image output unit 9 displays the input image information on the screen.
  • the structured information includes a text, moving picture, lip shape, information on positions in the moving picture, and information on the time duration.
  • the lip shape can be transformed into numerical values based on a degree of a down motion of a lower lip, up and down motion at the left edge of an upper lip, up and down motion at the right edge of an upper lip, up and down motion at the left edge of a lower lip, up and down motion at the right edge of a lower lip, up and down motion at the center portion of an upper lip, up and down motion at the center portion of a lower lip, degree of protrusion of an upper lip, degree of protrusion of a lower lip, distance from the center of a lip to the right edge of a lip, and distance from the center of a lip to the left edge of a lip.
  • the lip shape can also be defined in a quantified and normalized pattern in accordance with the position and manner of articulation for each phoneme.
  • the information on positions is defined by the position of a scene in a moving picture, and the time duration is defined by the number of the scenes in which the same lip shape is maintained.
  • FIG. 3 shows a detailed block diagram to illustrate a method of synchronizing a text-to-speech converter
  • FIG. 4 shows a flow chart to illustrate a method of synchronizing a text-to-speech converter.
  • reference numerals 10 , 11 , 12 , 13 , 14 , 15 , 16 and 17 indicate a multi-media information input unit, a multi-media distributor, a standardized language processing unit, a prosody processing unit, a synchronization adjusting unit, a signal processing unit, a synthesis unit database, and an image output unit, respectively.
  • the multi-media information in the multi-media information input unit 10 is structured in a format as shown above in table 1, and comprises a text, moving picture, lip shape, information on positions in the moving picture, and information on time durations.
  • the multi-media distributor 11 receives the multi-media information from the multi-media information input unit 10 , and transfers images and texts of the multi-media information to the image output unit 17 and the language processing unit 12 , respectively.
  • the synchronization information is transferred, it is converted into a data structure which can be used in the synchronization adjusting unit 14 .
  • the language processing unit 12 converts the texts received from the multi-media distributor 11 into a phoneme string, and estimates and symbolize prosodic information to transfer it to the prosody processing unit 13 .
  • the symbols for the prosodic information are estimated from the phrase boundary, clause boundary, the accent position, and sentence pattern, etc. by using the results of analysis of syntax structures.
  • the prosody processing unit 13 receives the processing results from the language processing unit 12 , and calculates the values of the prosodic control parameters.
  • the prosodic control parameter includes the time duration of phonemes, contour of pitch, contour of energy, position of pause, and length.
  • the calculated results are transferred to the synchronization adjusting unit 15 .
  • the synchronization adjusting unit 14 receives the processing results from the prosody processing unit 13 , and adjusts the time durations for every phoneme to synchronize the image signal by using the synchronization information which was received from the multi-media distributor 11 .
  • the lip shape can be allocated to each phoneme in accordance with the position and manner of articulation for each phoneme, and the series of phonemes is divided into small groups corresponding to the number of the lip shapes recorded in the synchronization information by comparing the lip shape allocated to each phoneme with the lip shape in the synchronization information.
  • the time durations of the phonemes in each small group are calculated again by using information on the time durations of the lip shapes which is included in the synchronization information.
  • the adjusted time duration information is made to be included in the results of the prosody processing unit 13 , and is transferred to the signal processing unit 15 .
  • the signal processing unit 15 receives the processing results from the synchronization adjusting unit 14 , and generates a synthesized speech by using the synthesis unit DB 16 to output it.
  • the synthesis unit DB 16 selects the synthesis units required for synthesis in accordance with the request from the signal processing unit 15 , and transfers required data to the signal processing unit 15 .
  • a synthesized speech can be synchronized with moving picture by using the method wherein the real speech data and the shape of a lip in the moving picture are analyzed, and information on the estimated lip shape and text information are directly used in generating the synthesized speech. Accordingly, the dubbing of target language can be performed onto movies in foreign languages. Further, the present invention can be used in various applications such as a communication service, office automation, education, etc. since the synchronization of image information with the TTS is made possible in the multi-media environment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)
  • Machine Translation (AREA)

Abstract

A method of formatting and normalizing continuous lip motions to events in a moving picture besides text in a Text-To-Speech converter is provided. A synthesized speech is synchronized with a moving picture by using the method wherein the real speech data and the shape of a lip in the moving picture are analyzed, and information on the estimated lip shape and text information are directly used in generating the synthesized speech.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a system for synchronization between moving picture and a text-to-speech(TTS) converter, and more particulary to a system for synchronization between moving picture and a text-to-speech converter which can be realized a synchronization between moving picture and synthesized speech by using the moving time of lip and duration of speech information.
2. Description of the Related Art
In general, a speech synthesizer provides a user with various types of information in an audible form. For this purpose, the speech synthesizer should provide a high quality speech synthesis service from the input texts given to a user. In addition, in order for the speech synthesizer to be operatively coupled to a database constructed in a multi-media environment, or various media provided by a counterpart involved in a conversation, the speech synthesizer can generate a synthesized speech so as to be synchronized with these media. In particular, the synchronization between moving picture and the TTS is essentially required to provide a user with a high quality service.
FIG. 1 shows a block diagram of a conventional text-to-speech converter which generally consists of three steps in generating a synthesized speech from the input text.
At step 1, a language processing unit 1 converts an input text to a phoneme string, estimates prosodic information, and symbolizes it. The symbol of the prosodic information is estimated from the phrase boundary, clause boundary, accent position, sentence patterns, etc. by analyzing a syntactic structure. At step 2, a prosody processing unit 2 calculates the values for prosody control parameters from the symbolized prosodic information by using rules and tables. The prosody control parameters include phoneme duration and pause interval information. Finally, a signal processing unit 3 generates a synthesized speech by using a synthesis unit DB 4 and the prosody control parameters. That is, the conventional synthesizer should estimate prosodic information related to naturalness and speaking rate only from an input text in the language processing unit 1 and the prosody processing unit 2.
Presently, a lot of researches on the TTS have been conducted through the world for application to mother languages, and some countries have already started a commercial service. However, the conventional synthesizer is aimed at its use in synthesizing a speech from an input text and thus there is no research activity on a synthesizing method which can be used in connection with multi-media. In addition, when dubbing is performed on moving picture or animation by using the conventional TTS method, information required to implement the synchronization of media with a synthesized speech cannot be estimated from the text only. Thus, it is not possible to generate a synthesized speech, which is smoothly and operatively coupled to moving pictures, from only text information.
If the synchronization between moving picture and a synthesized speech is assumed to be a kind of dubbing, there can be three implementation methods. One of these methods includes a method of synchronizing moving picture with a synthesized speech on a sentence basis. This method regulates the time duration of the synthesized speech by using information on the start point and end point of the sentence. This method has an advantage that it is easy to implement and the additional efforts can be minimized. However, the smooth synchronization cannot be achieved with this method. As an alternative, there is a method wherein information on the start and end point, and phoneme symbol for every phoneme are transcribed in the interval of the moving picture related to a speech signal to be used in generating a synthesized speech. Since the synchronization of moving picture with a synthesized speech can be achieved for each phoneme with this method, the accuracy can be enhanced. However, this method has a disadvantage that additional efforts should be exerted to detect and record time duration information for every phoneme in a speech interval of the moving picture.
As another alternative, there is a method wherein synchronization information is recorded based on patterns having the characteristic by which a lip motion can be easily distinguished, such as the start and end points of the speech, the opening and closing of the lip, protrusion of the lip, etc. This method can enhance the efficiency of synchronization while minimizing the additional efforts exerted to make information for synchronization.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a method of formatting and normalizing continuous lip motions to events in a moving picture besides a text in a text-to-speech converter.
It is another object of the invention to provide a system for synchronization between moving picture and a synthesized speech by defining an interface between event information and the TTS and using it in generating the synthesized speech.
In accordance with one aspect of the present invention, a system for synchronization between moving picture and a text-to-speech converter is provided which comprises distributing means for multi-media input information, transforming it into the respective data structures, and distributing it to each medium; image output means for receiving image information of the multi-media information from said distributing means; language processing means for receiving language texts of the multi-media information from said distributing means, transforming the text into phoneme string, estimating and symbolizing prosodic information; prosody processing means for receiving the processing result from said language processing means, calculating the values of prosodic control parameters; synchronization adjusting means for receiving the processing results from said prosody processing means, adjusting time durations for every phoneme for synchronization with image signals by using synchronization information of the multi-media information from said distributing means, and inserting the adjusted time durations into the results of said prosody processing means; signal processing means for receiving the processing results from said synchronization adjusting means to generate a synthesized speech; and a synthesis unit database block for selecting required unit for synthesis in accordance with a request from said signal processing means, and transferring the required data.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will become more apparent upon a detailed description of the preferred embodiments for carrying out the invention as rendered below. In the description to follow, references will be made to the accompanying drawings, where like reference numerals are used to identify like or similar elements in the various drawings and in which:
FIG. 1 shows a block diagram of a conventional text-to-speech converter;
FIG. 2 shows a block diagram of a synchronization system in accordance with the present invention;
FIG. 3 shows a detailed block diagram to illustrate a method of synchronizing a text-to-speech converter; and
FIG. 4 shows a flow chart to illustrate a method of synchronizing a text-to-speech converter.
DETAILED DESCRIPTION OF THE INVENTION
FIG. 2 shows a block diagram of a synchronization system in accordance with the present invention. In FIG. 2, reference numerals 5, 6, 7, 8 and 9 indicate a multi-data input unit, a central processing unit, a synthesized database, a digital/analog(D/A) converter, and an image output unit, respectively.
Data comprising multi-media such as an image, text, etc. is inputted to the multi-data input unit 5 which outputs the input data to the central processing unit 6. Into the central processing unit 6, the algorithm in accordance with the present invention is embedded. The synthesized database 7, a synthesized DB for use in the synthesis algorithm is stored in a storage device and transmits necessary data to the central processing unit 6. The digital/analog converter 8 converts the synthesized digital data into an analog signal to output it to the exterior. The image output unit 9 displays the input image information on the screen.
Table 1 as shown below illustrates one example of structured multi-media input information to be used in connection with the present invention. The structured information includes a text, moving picture, lip shape, information on positions in the moving picture, and information on the time duration. The lip shape can be transformed into numerical values based on a degree of a down motion of a lower lip, up and down motion at the left edge of an upper lip, up and down motion at the right edge of an upper lip, up and down motion at the left edge of a lower lip, up and down motion at the right edge of a lower lip, up and down motion at the center portion of an upper lip, up and down motion at the center portion of a lower lip, degree of protrusion of an upper lip, degree of protrusion of a lower lip, distance from the center of a lip to the right edge of a lip, and distance from the center of a lip to the left edge of a lip. The lip shape can also be defined in a quantified and normalized pattern in accordance with the position and manner of articulation for each phoneme. The information on positions is defined by the position of a scene in a moving picture, and the time duration is defined by the number of the scenes in which the same lip shape is maintained.
TABLE 1
Example of Synchronization Information
Input
Information Parameter Parameter Value
text sentence
moving picture scene
synchronization lip shape degree of a down motion of a lower lip, up
information and down motion at the left edge of an
upper lip, up and down motion at the right
edge of an upper lip, up and down motion
at the left edge of a lower lip, up and
down motion at the right edge of a lower
lip, up and down motion at the center
portion of an upper lip, up and down
motion at the center portion of a lower lip,
degree of protrusing of an upper lip,
degree of protrusion of a lower lip,
distance from the center of a lip to the
right edge of a lip, and distance from the
center of a lip to the left edge of a lip
information position of scene in moving picture
on position
time number of continuous scenes
duration
FIG. 3 shows a detailed block diagram to illustrate a method of synchronizing a text-to-speech converter and FIG. 4 shows a flow chart to illustrate a method of synchronizing a text-to-speech converter. In FIG. 3, reference numerals 10, 11, 12, 13, 14, 15, 16 and 17 indicate a multi-media information input unit, a multi-media distributor, a standardized language processing unit, a prosody processing unit, a synchronization adjusting unit, a signal processing unit, a synthesis unit database, and an image output unit, respectively.
The multi-media information in the multi-media information input unit 10 is structured in a format as shown above in table 1, and comprises a text, moving picture, lip shape, information on positions in the moving picture, and information on time durations. The multi-media distributor 11 receives the multi-media information from the multi-media information input unit 10, and transfers images and texts of the multi-media information to the image output unit 17 and the language processing unit 12, respectively. When the synchronization information is transferred, it is converted into a data structure which can be used in the synchronization adjusting unit 14.
The language processing unit 12 converts the texts received from the multi-media distributor 11 into a phoneme string, and estimates and symbolize prosodic information to transfer it to the prosody processing unit 13. The symbols for the prosodic information are estimated from the phrase boundary, clause boundary, the accent position, and sentence pattern, etc. by using the results of analysis of syntax structures.
The prosody processing unit 13 receives the processing results from the language processing unit 12, and calculates the values of the prosodic control parameters. The prosodic control parameter includes the time duration of phonemes, contour of pitch, contour of energy, position of pause, and length. The calculated results are transferred to the synchronization adjusting unit 15.
The synchronization adjusting unit 14 receives the processing results from the prosody processing unit 13, and adjusts the time durations for every phoneme to synchronize the image signal by using the synchronization information which was received from the multi-media distributor 11. With the adjustment of the time duration of phonemes, the lip shape can be allocated to each phoneme in accordance with the position and manner of articulation for each phoneme, and the series of phonemes is divided into small groups corresponding to the number of the lip shapes recorded in the synchronization information by comparing the lip shape allocated to each phoneme with the lip shape in the synchronization information.
The time durations of the phonemes in each small group are calculated again by using information on the time durations of the lip shapes which is included in the synchronization information. The adjusted time duration information is made to be included in the results of the prosody processing unit 13, and is transferred to the signal processing unit 15.
The signal processing unit 15 receives the processing results from the synchronization adjusting unit 14, and generates a synthesized speech by using the synthesis unit DB 16 to output it. The synthesis unit DB 16 selects the synthesis units required for synthesis in accordance with the request from the signal processing unit 15, and transfers required data to the signal processing unit 15.
In accordance with the present invention, a synthesized speech can be synchronized with moving picture by using the method wherein the real speech data and the shape of a lip in the moving picture are analyzed, and information on the estimated lip shape and text information are directly used in generating the synthesized speech. Accordingly, the dubbing of target language can be performed onto movies in foreign languages. Further, the present invention can be used in various applications such as a communication service, office automation, education, etc. since the synchronization of image information with the TTS is made possible in the multi-media environment.
The present invention has been described with reference to a particular embodiment in connection with a particular application. Those having ordinary skill in the art and access to the teachings of the present invention will recognize additional modifications and applications within the scope thereof.
It is therefore intended by the appended claims to cover any and all such applications, modifications, and embodiments within the scope of the present invention.

Claims (5)

1. A system for synchronization between a moving picture and a text-to-speech converter, comprising:
distributing means for receiving multi-media input information, transforming said multi-media input information into respective data structures, and distributing the respective data structures for further processing;
image output means for receiving image information of the distributed multi-media information and displaying the image information;
language processing means for receiving language texts of the distributed multi-media information, transforming the language texts into phoneme strings, and estimating and symbolizing prosodic information from the language texts;
prosody processing means for receiving the prosodic information from said language processing means, and calculating values of prosodic control parameters;
synchronization adjusting means for receiving the prosodic control parameters from said prosody processing means, adjusting time durations for every phoneme for synchronization with the image information by using synchronization information of the distributed multi-media information, and inserting adjusted time durations into the prosodic control parameters;
signal processing means for receiving the processing results from said synchronization adjusting means and generating a synthesized speech; and
a synthesis unit database block for selecting required units for synthesis in accordance with a request from said signal processing means, and transmitting the required data to said signal processing means.
2. The system according to claim 1, wherein the multi-media information comprises:
the language texts, image information on moving picture, and synchronization information,
and wherein the synchronization information includes:
a text, information on a lip shape, information on image positions in the moving picture, and information on time durations.
3. The system according to claim 2, wherein the information on the lip shape can be transformed into numerical values based on a degree of a down motion of a lower lip, up and down motion at a left edge of an upper lip, up and down motion at a right edge of the upper lip, up and down motion at a left edge of the lower lip, up and down motion at a right edge of the lower lip, up and down motion at a center portion of the upper lip, up and down motion at a center portion of the lower lip, a degree of protrusion of the upper lip, a degree of protrusion of the lower lip, a distance from the center of the lip to the right edge of the lip, and a distance from the center of the lip to the left edge of the lip,
and wherein the information on the lip shape is definable in a quantified and normalized pattern in accordance with the position and manner of articulation for each phoneme.
4. The system according to claim 1, wherein said synchronization adjusting means comprises means for calculating time durations of phonemes within a text by using the synchronization information in accordance with a predicted lip shape determined by a position and manner of articulation for each phoneme within a text a lip shape within the synchronization information, and time durations.
5. The system of claim 2, wherein said synchronization information further includes text.
US10/038,153 1996-12-13 2001-10-19 System for synchronization between moving picture and a text-to-speech converter Expired - Lifetime USRE42000E1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/038,153 USRE42000E1 (en) 1996-12-13 2001-10-19 System for synchronization between moving picture and a text-to-speech converter

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR96-65445 1996-12-13
KR1019960065445A KR100236974B1 (en) 1996-12-13 1996-12-13 Sync. system between motion picture and text/voice converter
US08/970,224 US5970459A (en) 1996-12-13 1997-11-14 System for synchronization between moving picture and a text-to-speech converter
US10/038,153 USRE42000E1 (en) 1996-12-13 2001-10-19 System for synchronization between moving picture and a text-to-speech converter

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US08/970,224 Reissue US5970459A (en) 1996-12-13 1997-11-14 System for synchronization between moving picture and a text-to-speech converter

Publications (1)

Publication Number Publication Date
USRE42000E1 true USRE42000E1 (en) 2010-12-14

Family

ID=19487716

Family Applications (2)

Application Number Title Priority Date Filing Date
US08/970,224 Expired - Lifetime US5970459A (en) 1996-12-13 1997-11-14 System for synchronization between moving picture and a text-to-speech converter
US10/038,153 Expired - Lifetime USRE42000E1 (en) 1996-12-13 2001-10-19 System for synchronization between moving picture and a text-to-speech converter

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US08/970,224 Expired - Lifetime US5970459A (en) 1996-12-13 1997-11-14 System for synchronization between moving picture and a text-to-speech converter

Country Status (4)

Country Link
US (2) US5970459A (en)
JP (1) JP3599538B2 (en)
KR (1) KR100236974B1 (en)
DE (1) DE19753453B4 (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100240637B1 (en) * 1997-05-08 2000-01-15 정선종 Syntax for tts input data to synchronize with multimedia
US6567779B1 (en) 1997-08-05 2003-05-20 At&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
US7366670B1 (en) * 1997-08-05 2008-04-29 At&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
US7076426B1 (en) * 1998-01-30 2006-07-11 At&T Corp. Advance TTS for facial animation
US6539354B1 (en) 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
US6975988B1 (en) 2000-11-10 2005-12-13 Adam Roth Electronic mail method and system using associated audio and visual techniques
MXPA03010750A (en) * 2001-05-25 2004-07-01 Dolby Lab Licensing Corp High quality time-scaling and pitch-scaling of audio signals.
US20020198716A1 (en) * 2001-06-25 2002-12-26 Kurt Zimmerman System and method of improved communication
CA2393014A1 (en) * 2001-07-11 2003-01-11 Genlyte Thomas Group Llc Switch/power drop unit for modular wiring system
US7694325B2 (en) * 2002-01-31 2010-04-06 Innovative Electronic Designs, Llc Information broadcasting system
JP4127668B2 (en) * 2003-08-15 2008-07-30 株式会社東芝 Information processing apparatus, information processing method, and program
KR100678938B1 (en) * 2004-08-28 2007-02-07 삼성전자주식회사 Apparatus and method for synchronization between moving picture and caption
KR100710600B1 (en) * 2005-01-25 2007-04-24 우종식 The method and apparatus that createdplayback auto synchronization of image, text, lip's shape using TTS
FR2899714B1 (en) * 2006-04-11 2008-07-04 Chinkel Sa FILM DUBBING SYSTEM.
CN101359473A (en) 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
DE102007039603A1 (en) * 2007-08-22 2009-02-26 Siemens Ag Method for synchronizing media data streams
US8451907B2 (en) * 2008-09-02 2013-05-28 At&T Intellectual Property I, L.P. Methods and apparatus to detect transport faults in media presentation systems
FR2969361A1 (en) * 2010-12-16 2012-06-22 France Telecom ENRICHMENT OF THE AUDIO CONTENT OF AN AUDIOVISUAL PROGRAM BY VOICE SYNTHESIS
CN107705784B (en) * 2017-09-28 2020-09-29 百度在线网络技术(北京)有限公司 Text regularization model training method and device, and text regularization method and device
CN109168067B (en) * 2018-11-02 2022-04-22 深圳Tcl新技术有限公司 Video time sequence correction method, correction terminal and computer readable storage medium
US20220392439A1 (en) * 2019-11-18 2022-12-08 Google Llc Rescoring Automatic Speech Recognition Hypotheses Using Audio-Visual Matching
KR102215256B1 (en) 2019-11-18 2021-02-15 주식회사 인공지능연구원 multimedia authoring apparatus with synchronized motion and voice feature and method for the same
CN111741231B (en) * 2020-07-23 2022-02-22 北京字节跳动网络技术有限公司 Video dubbing method, device, equipment and storage medium
KR102479031B1 (en) * 2021-10-25 2022-12-19 주식회사 클레온 A Method and an apparatus for generating mouth shape using deep learning network
CN115278382B (en) * 2022-06-29 2024-06-18 北京捷通华声科技股份有限公司 Video clip determining method and device based on audio clip

Citations (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AT72083B (en) 1912-12-18 1916-07-10 S J Arnheim Attachment for easily interchangeable locks.
US4260229A (en) 1978-01-23 1981-04-07 Bloomstein Richard W Creating visual images of lip movements
US4305131A (en) 1979-02-05 1981-12-08 Best Robert M Dialog between TV movies and human viewers
WO1985004747A1 (en) 1984-04-10 1985-10-24 First Byte Real-time text-to-speech conversion system
EP0225729A1 (en) 1985-11-14 1987-06-16 BRITISH TELECOMMUNICATIONS public limited company Image encoding and synthesis
JPH02234285A (en) 1989-03-08 1990-09-17 Kokusai Denshin Denwa Co Ltd <Kdd> Method and device for synthesizing picture
JPH03241399A (en) 1990-02-20 1991-10-28 Canon Inc Voice transmitting/receiving equipment
US5111409A (en) 1989-07-21 1992-05-05 Elon Gasper Authoring and use systems for sound synchronized animation
DE4101022A1 (en) 1991-01-16 1992-07-23 Medav Digitale Signalverarbeit Variable speed reproduction of audio signal without spectral change - dividing digitised audio signal into blocks, performing transformation, and adding or omitting blocks before reverse transformation
JPH04285769A (en) 1991-03-14 1992-10-09 Nec Home Electron Ltd Multi-media data editing method
JPH04359299A (en) 1991-06-06 1992-12-11 Sony Corp Image deformation method based on voice signal
JPH0564171A (en) 1991-09-03 1993-03-12 Hitachi Ltd Digital video/audio signal transmission system and digital audio signal reproduction method
JPH05188985A (en) 1992-01-13 1993-07-30 Hitachi Ltd Speech compression system, communication system, and radio communication device
JPH05313686A (en) 1992-04-02 1993-11-26 Sony Corp Display controller
US5313522A (en) 1991-08-23 1994-05-17 Slager Robert P Apparatus for generating from an audio signal a moving visual lip image from which a speech content of the signal can be comprehended by a lipreader
JPH06326967A (en) 1993-05-12 1994-11-25 Matsushita Electric Ind Co Ltd Data transmission method
JPH06348811A (en) 1993-06-07 1994-12-22 Sharp Corp Moving image display device
US5386581A (en) 1989-03-28 1995-01-31 Matsushita Electric Industrial Co., Ltd. Multimedia data editing apparatus including visual graphic display of time information
JPH0738857A (en) 1993-07-16 1995-02-07 Pioneer Electron Corp Synchronization system for time-division video and audio signals
EP0689362A2 (en) 1994-06-21 1995-12-27 AT&T Corp. Sound-synchronised video system
US5500919A (en) 1992-11-18 1996-03-19 Canon Information Systems, Inc. Graphics user interface for controlling text-to-speech conversion
EP0706170A2 (en) 1994-09-29 1996-04-10 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. Method of speech synthesis by means of concatenation and partial overlapping of waveforms
US5557661A (en) 1993-11-02 1996-09-17 Nec Corporation System for coding and decoding moving pictures based on the result of speech analysis
US5615300A (en) 1992-05-28 1997-03-25 Toshiba Corporation Text-to-speech synthesis with controllable processing time and speech quality
US5630017A (en) 1991-02-19 1997-05-13 Bright Star Technology, Inc. Advanced tools for speech synchronized animation
US5636325A (en) 1992-11-13 1997-06-03 International Business Machines Corporation Speech synthesis and analysis of dialects
US5650629A (en) 1994-06-28 1997-07-22 The United States Of America As Represented By The Secretary Of The Air Force Field-symmetric beam detector for semiconductors
US5657426A (en) 1994-06-10 1997-08-12 Digital Equipment Corporation Method and apparatus for producing audio-visual synthetic speech
US5677993A (en) 1992-08-31 1997-10-14 Hitachi, Ltd. Information processing apparatus using pointing input and speech input
US5677739A (en) 1995-03-02 1997-10-14 National Captioning Institute System and method for providing described television services
US5729694A (en) 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US5751906A (en) 1993-03-19 1998-05-12 Nynex Science & Technology Method for synthesizing speech from text and for spelling all or portions of the text by analogy
US5774854A (en) 1994-07-19 1998-06-30 International Business Machines Corporation Text to speech system
US5777612A (en) 1995-03-20 1998-07-07 Fujitsu Limited Multimedia dynamic synchronization system
US5860064A (en) 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
JP4359299B2 (en) 2006-09-13 2009-11-04 Tdk株式会社 Manufacturing method of multilayer ceramic electronic component

Patent Citations (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AT72083B (en) 1912-12-18 1916-07-10 S J Arnheim Attachment for easily interchangeable locks.
US4260229A (en) 1978-01-23 1981-04-07 Bloomstein Richard W Creating visual images of lip movements
US4305131A (en) 1979-02-05 1981-12-08 Best Robert M Dialog between TV movies and human viewers
WO1985004747A1 (en) 1984-04-10 1985-10-24 First Byte Real-time text-to-speech conversion system
EP0225729A1 (en) 1985-11-14 1987-06-16 BRITISH TELECOMMUNICATIONS public limited company Image encoding and synthesis
US4841575A (en) 1985-11-14 1989-06-20 British Telecommunications Public Limited Company Image encoding and synthesis
JPH02234285A (en) 1989-03-08 1990-09-17 Kokusai Denshin Denwa Co Ltd <Kdd> Method and device for synthesizing picture
US5386581A (en) 1989-03-28 1995-01-31 Matsushita Electric Industrial Co., Ltd. Multimedia data editing apparatus including visual graphic display of time information
US5111409A (en) 1989-07-21 1992-05-05 Elon Gasper Authoring and use systems for sound synchronized animation
JPH03241399A (en) 1990-02-20 1991-10-28 Canon Inc Voice transmitting/receiving equipment
DE4101022A1 (en) 1991-01-16 1992-07-23 Medav Digitale Signalverarbeit Variable speed reproduction of audio signal without spectral change - dividing digitised audio signal into blocks, performing transformation, and adding or omitting blocks before reverse transformation
US5630017A (en) 1991-02-19 1997-05-13 Bright Star Technology, Inc. Advanced tools for speech synchronized animation
US5689618A (en) 1991-02-19 1997-11-18 Bright Star Technology, Inc. Advanced tools for speech synchronized animation
JPH04285769A (en) 1991-03-14 1992-10-09 Nec Home Electron Ltd Multi-media data editing method
JPH04359299A (en) 1991-06-06 1992-12-11 Sony Corp Image deformation method based on voice signal
US5313522A (en) 1991-08-23 1994-05-17 Slager Robert P Apparatus for generating from an audio signal a moving visual lip image from which a speech content of the signal can be comprehended by a lipreader
JPH0564171A (en) 1991-09-03 1993-03-12 Hitachi Ltd Digital video/audio signal transmission system and digital audio signal reproduction method
JPH05188985A (en) 1992-01-13 1993-07-30 Hitachi Ltd Speech compression system, communication system, and radio communication device
JPH05313686A (en) 1992-04-02 1993-11-26 Sony Corp Display controller
US5615300A (en) 1992-05-28 1997-03-25 Toshiba Corporation Text-to-speech synthesis with controllable processing time and speech quality
US5677993A (en) 1992-08-31 1997-10-14 Hitachi, Ltd. Information processing apparatus using pointing input and speech input
US5636325A (en) 1992-11-13 1997-06-03 International Business Machines Corporation Speech synthesis and analysis of dialects
US5500919A (en) 1992-11-18 1996-03-19 Canon Information Systems, Inc. Graphics user interface for controlling text-to-speech conversion
US5751906A (en) 1993-03-19 1998-05-12 Nynex Science & Technology Method for synthesizing speech from text and for spelling all or portions of the text by analogy
JPH06326967A (en) 1993-05-12 1994-11-25 Matsushita Electric Ind Co Ltd Data transmission method
US5860064A (en) 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
JPH06348811A (en) 1993-06-07 1994-12-22 Sharp Corp Moving image display device
JPH0738857A (en) 1993-07-16 1995-02-07 Pioneer Electron Corp Synchronization system for time-division video and audio signals
US5557661A (en) 1993-11-02 1996-09-17 Nec Corporation System for coding and decoding moving pictures based on the result of speech analysis
US5608839A (en) 1994-03-18 1997-03-04 Lucent Technologies Inc. Sound-synchronized video system
US5657426A (en) 1994-06-10 1997-08-12 Digital Equipment Corporation Method and apparatus for producing audio-visual synthetic speech
EP0689362A2 (en) 1994-06-21 1995-12-27 AT&T Corp. Sound-synchronised video system
US5650629A (en) 1994-06-28 1997-07-22 The United States Of America As Represented By The Secretary Of The Air Force Field-symmetric beam detector for semiconductors
US5774854A (en) 1994-07-19 1998-06-30 International Business Machines Corporation Text to speech system
EP0706170A2 (en) 1994-09-29 1996-04-10 CSELT Centro Studi e Laboratori Telecomunicazioni S.p.A. Method of speech synthesis by means of concatenation and partial overlapping of waveforms
US5677739A (en) 1995-03-02 1997-10-14 National Captioning Institute System and method for providing described television services
US5777612A (en) 1995-03-20 1998-07-07 Fujitsu Limited Multimedia dynamic synchronization system
US5729694A (en) 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
JP4359299B2 (en) 2006-09-13 2009-11-04 Tdk株式会社 Manufacturing method of multilayer ceramic electronic component

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Nakumura et al. "Speech Recognition and Lip Movement Synthesis"; HMM based Audio-Visual Integration; pp. 93-98.
Yamamoto et al. pp. 245-246 Nara Institute of Science and Technology.

Also Published As

Publication number Publication date
DE19753453A1 (en) 1998-06-18
DE19753453B4 (en) 2004-11-18
JP3599538B2 (en) 2004-12-08
KR100236974B1 (en) 2000-02-01
US5970459A (en) 1999-10-19
KR19980047008A (en) 1998-09-15
JPH10171486A (en) 1998-06-26

Similar Documents

Publication Publication Date Title
USRE42000E1 (en) System for synchronization between moving picture and a text-to-speech converter
US6088673A (en) Text-to-speech conversion system for interlocking with multimedia and a method for organizing input data of the same
EP0831460B1 (en) Speech synthesis method utilizing auxiliary information
US5561736A (en) Three dimensional speech synthesis
US5278943A (en) Speech animation and inflection system
EP1425736B1 (en) Method for processing audiovisual data using speech recognition
US5826234A (en) Device and method for dubbing an audio-visual presentation which generates synthesized speech and corresponding facial movements
US20020161582A1 (en) Method and apparatus for presenting images representative of an utterance with corresponding decoded speech
JP2518683B2 (en) Image combining method and apparatus thereof
US20070118355A1 (en) Prosody generating devise, prosody generating method, and program
EP1473707B1 (en) Text-to-speech conversion system and method having function of providing additional information
KR20010072936A (en) Post-Synchronizing an information stream
JP2011059412A (en) Synthetic speech text inputting device and program
Goecke et al. The audio-video Australian English speech data corpus AVOZES
Sako et al. HMM-based text-to-audio-visual speech synthesis.
US6332123B1 (en) Mouth shape synthesizing
EP0890168A1 (en) Image synthesis
Hallgren et al. Visual speech synthesis with concatenative speech
CN113724684B (en) Speech synthesis method and system for air traffic control instruction
EP0982684A4 (en) Moving picture generating device and image control network learning device
JP2005309173A (en) Speech synthesis controller, method thereof and program thereof, and data generating device for speech synthesis
JP2003296753A (en) Interactive system for hearing-impaired person
KR100474282B1 (en) The method and apparatus for generating a guide voice of automatic voice guide system
JP2000310995A (en) Device and method for synthesizing speech and telephone set provided therewith
JPH11161297A (en) Method and device for voice synthesizer

Legal Events

Date Code Title Description
FPAY Fee payment

Year of fee payment: 12