KR20010038772A

KR20010038772A - Automatic and adaptive synchronization method of image frame using speech duration time in the system integrated with speech and face animation

Info

Publication number: KR20010038772A
Application number: KR1019990046888A
Authority: KR
Inventors: 최창석
Original assignee: 최창석
Priority date: 1999-10-27
Filing date: 1999-10-27
Publication date: 2001-05-15
Also published as: KR100317036B1

Abstract

PURPOSE: An automatic application synchronism of motion picture frame using duration of voicing in sound and face animation integration system is provided to make an application and synchronism audio and motion picture of synthetic face by controlling duration carrying forward to next half-syllable if it is happened excess or deficiency time than original duration as result of frame creation when calculating average time of total frame, deciding number of frame to be created from half syllable. CONSTITUTION: A number of image frame by each syllables is calculated by dividing duration of voicing(D1) between initial sound and medium sound and duration of voicing(D2) between medium sound and final sound by a frame creating time(T0), considering parameter of mouth shape of initial sound, medium sound, final sound as parameter of key frame by each syllables when hangul text input. A parameter for each frame is calculated by selecting parameter from database of consonant, vowel mouth shape parameter according to consonant, vowel code, and database of expression change parameter, and database of head motion parameter, and by compensating(6-10) these according to the number of image frame. A face image by each frame is synthesized by mapping(6-11) texture in real time after changing shape model of individual face using the compensated parameter.

Description

Automatic and adaptive synchronization method of image frame using speech duration time in the system integrated with speech and face animation}

본 발명은 음성과 얼굴 애니메이션을 동시에 합성하는 통합 시스템에서 음성과 얼굴 애니메이션의 동기화방법에 관한 것으로, 더욱 상세하게는 음성지속시간에 따라 얼굴 애니메이션의 프레임 수를 자동 조절하여, 음성과 얼굴 애니메이션의 동기를 적응적으로 맞추는 음성과 얼굴 애니메이션 통합 시스템에서 음성지속시간을 이용한 동영상 프레임의 자동 적응동기방법에 관한 것이다.The present invention relates to a method of synchronizing voice and facial animation in an integrated system for synthesizing voice and facial animation at the same time, and more particularly, to automatically adjust the number of frames of facial animation according to voice duration, thereby synchronizing voice and facial animation. The present invention relates to an automatic adaptive synchronization method of a video frame using voice duration in an integrated voice and face animation system.

일반적으로, 도 1과 같은 하드웨어의 구성으로 합성음성과 합성 얼굴 동영상을 통합한 시스템을 구축하고 있다. 카메라(1-1) 또는 스캐너(1-2)로 영상이나 사진을 컴퓨터(1-3)에 입력 한 후, 합성음성과 합성 얼굴 동영상을 생성하여, 동영상은 디스플레이(1-4)에 음성은 스피커(1-5)를 통해 동시에 표시한다. 이러한 과정을 통해 여러 가지 얼굴 애니메이션을 제작하고 있다.In general, a system integrating a synthetic voice and a synthetic facial video is constructed using a hardware configuration as shown in FIG. 1. After inputting an image or photo to the computer 1-3 using a camera 1-1 or a scanner 1-2, a synthesized voice and a synthesized facial video are generated, and the video is displayed on the display 1-4. Simultaneously display via speaker (1-5). Through this process, various facial animations are produced.

도 2는 합성음성과 합성 얼굴 동영상을 통합한 시스템의 흐름도이다.2 is a flowchart of a system integrating a synthetic voice and a synthetic facial video.

얼굴 동영상 합성부(20)에서는 얼굴DB(2-1)에 있는 얼굴영상에 얼굴의 3차원 형상모델DB(2-2)에 있는 형상모델을 정합하여(2-3), 개인 얼굴의 3차원 형상모델(2-4)을 얻는다.The facial video synthesizing unit 20 matches the shape model in the three-dimensional shape model DB (2-2) of the face to the face image in the face DB (2-1) (2-3), and then three-dimensionally the individual face. Obtain the shape model 2-4.

한편, 문장해석부(30)에서는 한글 텍스트(2-5)가 입력되면, 음운변환기(2-6)를 통해 문장 및 어절을 검출하고(2-7), 음절별로 자·모음을 코드로 변환하여, 그 자·모음 사이의 지속시간을 부여한다(2-8).On the other hand, when the Hangul text (2-5) is input in the sentence interpretation unit 30, the sentence and word is detected through the phonological converter (2-6) (2-7), and the consonants and vowels are converted into codes for each syllable. The duration between the children and the vowel is given (2-8).

적응동기부(40)에서는 얼굴 동영상 합성부(20)의 파라미터DB(2-9)로부터 자·모음 코드에 따라 파라미터를 선택한다(2-10).The adaptive synchronization unit 40 selects a parameter from the parameter DB (2-9) of the facial video synthesizing unit 20 according to the rule and the vowel code (2-10).

상기 파라미터DB(2-9)에 구축되어 있는 얼굴 애니메이션을 위한 파라미터DB는 자모음입모양 파라미터DB와 표정변화 파라미터DB 및 두부동작 파라미터DB로 이루어져 있다. 또한, 각 파라미터는 얼굴근육의 움직임에 따라 여러 종류의 파라미터를 가지고 있다. 입 모양에 대한 파라미터는 자·모음에 대한 입 모양을 표현하기 위해서 여러 개의 파라미터의 조합으로 선택한다. 표정 및 두부동작에 대한 파라미터도 유사한 방법으로 선택하나, 한글 텍스트의 의미내용에 어울리는 표정 및 두부동작을 할 수 있도록, 자·모음과는 별도의 방법으로 선택한다.The parameter DB for facial animation constructed in the parameter DB 2-9 includes a consonant shape parameter DB, an expression change parameter DB, and a head motion parameter DB. In addition, each parameter has various types of parameters according to the movement of the facial muscles. The parameter for the mouth shape is selected by a combination of several parameters to express the shape of the mouth for the collection. The parameters for facial expressions and tofu motions are selected in a similar way, but they are selected separately from the ruler and vowels to enable facial expressions and tofu motions that match the meaning of Korean text.

선택한 파라미터를 영상 프레임 수에 따라 보간하여(2-12), 각 프레임에 대한 파라미터를 얻는다. 영상의 프레임 수는 음성의 자·모음간 지속시간과 컴퓨터기종에 따른 프레임 생성시간을 고려하여 산출한다(2-11). 나아가서, 실제 합성소요시간을 프레임별로 감시하여, 소요시간에 따라 음절별 프레임 수를 적응적으로 산출한다. 보간된 파라미터를 이용하여 개인 얼굴의 형상모델을 변형한 후(2-13), 실시간 텍스쳐 매핑(2-14)으로 프레임별 얼굴영상을 합성하여 디스플레이 한다(2-16).The selected parameter is interpolated according to the number of video frames (2-12) to obtain a parameter for each frame. The number of frames of the image is calculated by considering the duration of the vowel / vowel of the audio and the frame generation time according to the computer model (2-11). Furthermore, the actual synthesis time is monitored for each frame, and the number of frames per syllable is adaptively calculated according to the time required. After deforming the shape model of the individual face using the interpolated parameters (2-13), the face image for each frame is synthesized and displayed by real-time texture mapping (2-14) (2-16).

한편, 음성은 자·모음 코드에 따라 음성합성부(50)의 실시간 음성합성기에서 실시간 합성한 후(2-17), 얼굴 동영상과 동시에 디스플레이 한다(2-18).On the other hand, the voice is synthesized in real time by the real-time voice synthesizer of the voice synthesizer 50 according to the self-vowel code (2-17), and then simultaneously displayed with the face video (2-18).

도 3a∼도 3c는 합성음성과 합성 얼굴 동영상의 동기를 위한 기본개념을 설명하기 위한 도면이다. 한글에 대한 입 모양은 초성, 중성, 종성 또는, 초성과 중성으로 구성된다. 여기에서는 반음절 단위로 동기가 이루어지기 때문에, 두 경우를 같이 취급할 수 있어, 이후 초성, 중성, 종성으로 이루어진 음절을 예를 들어 동기의 기본개념을 설명한다.3A to 3C are diagrams for explaining a basic concept for synchronizing a synthesized voice and a synthesized facial video. The mouth shape for Hangeul is composed of Choseong, Neutral, Jongsung, or Choseong and Neutral. In this case, since the synchronization is performed in units of half-syllables, the two cases can be treated together, and then the basic concept of the synchronization will be explained using a syllable composed of primary, neutral, and final.

음절에 대한 입모양의 변화는 초성의 입모양에서 시작하여, 중성에 해당하는 모음의 입 모양으로 변화하여, 종성의 입 모양으로 종결된다. 즉, 음절별 초성, 중성, 종성에 대한 입 모양은 미리 구축해 놓은 입 모양 파라미터 DB로부터 자·모음에 대한 파라미터를 선택하여, 결정한다.The change of mouth shape for syllable starts from the shape of mouth of the beginning, changes to the shape of the mouth of the vowel corresponding to the neutral, and ends with the shape of the mouth of the bell. That is, the mouth shape for the initial syllable, the neutral, and the final syllable for each syllable is determined by selecting parameters for the vowels and vowels from the previously established mouth shape parameter DB.

도 3a ∼ 도 3c의 예에서는 초성은 파라미터2가, 중성은 파라미터1과 파라미터2가, 종성으로는 파라미터2와 파라미터3이 선택되었다고 가정한다. 각 파라미터의 강도는 입 모양 파라미터 DB에 저장해 놓은 것이다. 이들 초성, 중성, 종성의 입 모양 파라미터를 키 프레임의 파라미터로 하여, 초성과 중성사이의 음성 지속시간(D₁), 중성과 종성사이의 지속시간(D₂)을 한 프레임 생성시간(T₀)로 나누어 영상 프레임 수를 산출한 후, 파라미터의 강도를 선형적으로 보간하고 있다. 한 프레임의 생성시간(T₀)는 컴퓨터 CPU 속도, 고속 텍스쳐 매핑방법에 따라 달라진다. 도 3a의 숫자는 영상 프레임의 번호를 나타낸다.In the example of FIGS. 3A to 3C, it is assumed that the initial property is parameter 2, the neutral property is parameter 1 and parameter 2, and the final property is parameter 2 and parameter 3 selected. The strength of each parameter is stored in the mouth shape parameter DB. Frame generation time (T ₀ ) with the voice duration (D ₁ ) between the initial and the neutral and the duration (D ₂ ) between the neutral and the final with the mouth shape parameters of the initial, neutral, and final as the key frame parameters. After calculating the number of video frames by dividing by), the intensity of the parameter is linearly interpolated. The generation time (T ₀ ) of one frame depends on the computer CPU speed and the fast texture mapping method. The numbers in FIG. 3A represent the numbers of video frames.

도 4는 초성, 중성, 종성의 입 모양 파라미터를 보간하여 합성한 동영상의 예이다.4 is an example of a moving picture synthesized by interpolating mouth shape parameters of initial, neutral, and final.

고속 텍스쳐 매핑 방법에서는 얼굴의 형상모델이 대개 3각형의 집합으로 이루어져 있기 때문에, 삼각형 단위로 텍스쳐 매핑을 하게 된다. 그러나, 입 모양 변화, 표정변화, 두부동작의 내용에 따라서는 전 프레임과 비교해서 삼각형이 변화가 있는 경우와 없는 경우가 있다. 변화가 없는 삼각형에 대해서는 전 프레임과 동일하기 때문에 고속화를 위해, 그 삼각형은 텍스쳐 매핑을 하지 않고, 전 프레임을 그대로 사용한다. 이 경우, 매 프레임마다 생성속도가 다르기 때문에, 프레임의 생성속도를 정확히 예측하기가 어렵게 된다. 이것이 긴 문장에 대해서 음성과 얼굴 동영상의 동기가 어긋나는 커다란 원인중의 하나이다.In the fast texture mapping method, since the shape model of the face is usually composed of a set of triangles, texture mapping is performed in units of triangles. However, depending on the change in the shape of the mouth, the expression change, and the head movements, there may or may not be a triangle change compared to the previous frame. For the unchanged triangle, it is the same as the previous frame. For the sake of speed, the triangle does not perform texture mapping and uses the entire frame as it is. In this case, since the generation rate is different for each frame, it is difficult to accurately predict the generation rate of the frame. This is one of the main reasons why the voice and facial video are out of sync for long sentences.

음성과 동영상의 동기 구현에 있어서는 컴퓨터 CPU 기종에 따른 동영상 프레임 생성속도의 차이, 프레임 수 계산시의 반올림 오차, 입 모양, 표정 및 두부 동작에 따른 프레임별 생성 속도의 차이, 컴퓨터의 단위 시각표시의 불규칙성 때문에, 긴 문장의 경우에는 동기가 어긋나게 된다. 컴퓨터 CPU기종에 따른 동영상 프레임 생성속도의 차이는 시스템을 운용하는 컴퓨터의 기종을 어느 하나에 국한하지 않고, 다종 다양한 컴퓨터 또는 프로세서에서 사용할 가능성이 있기 때문에 일어나는 현상으로, 기종에 따라 CPU가 달라지고, CPU가 달라지면 생성 속도는 차이가 난다. 프레임 수 계산시의 반올림 오차는 음절 지속시간이 한 프레임 생성시간의 정수 배로 주어지지 않기 때문에, 정수개의 프레임을 생성하면 원래의 지속시간과의 차이가 생기기 때문에 일어나는 오차이다. 프레임별 생성시간은 상기에 설명한 바와 같이, 동기가 어긋나는 가장 큰 이유의 하나이다. 컴퓨터의 단위시간 표시가 마이크로 초(㎲) 단위에서는 일정하지 않고 약간씩 불규칙하기 때문에, 현재시각을 정확히 알 수 없는 문제가 있다. 이러한 이유로 음성과 영상의 동기를 맞추는데는 문제가 있다. 긴 문장에 대한 음성과 동영상의 동기구현을 위해서는, 이러한 요인을 흡수할 수 있는 적응적인 동기가 필요로 한다.In synchronizing audio and video, the difference in the frame rate of the video frame produced by the computer CPU, the rounding error when calculating the number of frames, the difference in the frame rate according to the shape of the mouth, the facial expression and the head, and the unit time display of the computer Because of the irregularities, in the case of long sentences, the synchronization is off. The difference in the video frame generation rate according to the computer CPU type is a phenomenon that occurs because the possibility of using a variety of different computers or processors, not limited to any one type of computer operating the system, the CPU varies depending on the type, Different CPUs produce different speeds. The rounding error in calculating the number of frames is an error that occurs because the syllable duration is not given as an integer multiple of one frame generation time. As described above, the generation time for each frame is one of the biggest reasons why the synchronization is off. Since the unit time display of the computer is not constant in microseconds and is slightly irregular, there is a problem that the current time cannot be known accurately. For this reason, there is a problem in synchronizing audio and video. Synchronization of voice and video for long sentences requires adaptive motivation to absorb these factors.

본 발명은 상기한 사정을 감안하여 발명한 것으로, 컴퓨터의 기종과 실시간 텍스쳐 매핑 방법에 따른 프레임 생성시간의 차이, 컴퓨터 표시시각의 불규칙성, 각 반음절의 평균 프레임 생성 속도에 대해 정수배가 아닌 경우 등으로 생기는 동기오차를 적응적으로 수정하여 음성과 얼굴 동영상의 동기를 맞추도록 하는 음성과 얼굴 애니메이션 통합 시스템에서 음성지속시간을 이용한 동영상 프레임의 자동 적응동기방법을 제공하고자 함에 발명의 목적이 있다.The present invention has been invented in view of the above-described circumstances, and the present invention relates to differences in frame generation time according to computer models and real-time texture mapping methods, irregularities in computer display time, and non-integral times with respect to the average frame generation speed of each syllable. It is an object of the present invention to provide an automatic adaptive synchronization method of a video frame using a voice duration in an integrated system of voice and face animation by adaptively correcting a synchronization error generated by the voice and face animation.

도1은 합성음성과 합성동영상을 통합한 시스템의 하드웨어 구성도,1 is a hardware configuration diagram of a system integrating synthesized voice and synthesized video;

도 2는 합성음성과 합성얼굴 동영상을 통합한 실시간 시스템의 흐름도,2 is a flowchart of a real-time system incorporating synthesized voice and synthesized facial video;

도 3은 한 음절에 대한 음성과 동영상의 동기를 위한 기본개념도,3 is a basic conceptual diagram for synchronizing a voice and a video for one syllable;

도 4는 음절의 초성, 중성, 종성에 대한 동영상의 입 모양 변화 예시도,4 is an example of changing the shape of the mouth of the video for the initial, neutral, and final of the syllables;

도 5도 음성과 동영상의 적응동기의 개념도,5 is a conceptual diagram of adaptive synchronization of voice and video,

도 6은 음성 지속시간에 따른 동영상생성의 적응동기의 흐름도이다.6 is a flowchart of adaptive synchronization of video generation according to voice duration.

*도면의 주요부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

20 -- 얼굴 동영상 합성부, 30 -- 문장해석부,20-facial video synthesis, 30-sentence interpretation,

40 -- 적응 동기부, 50 -- 음성 합성부.40-Adaptive Synchronizer, 50-Speech Synthesizer.

본 발명은 한글 텍스트가 입력되면 각 음절별로 초성, 중성, 종성의 입 모양 파라미터를 키 프레임의 파라미터로 하여, 초성과 중성사이의 음성 지속시간(D₁), 중성과 종성사이의 음성 지속시간(D₂)을 한 프레임 생성시간(T₀)으로 나누어 음절별 영상 프레임 수를 산출하는 단계, 자·모음 코드에 따라 자모음입모양 파라미터 DB와 표정변화 파라미터DB 및 두부동작 파라미터DB로부터 파라미터를 선택하고 이를 영상 프레임 수에 따라 보간하여 각 프레임에 대한 파라미터를 구하는 단계, 보간된 파라미터를 이용하여 개인 얼굴의 형상모델을 변형한후 실시간 텍스쳐 매핑으로 프레임별 얼굴영상을 합성하도록 하는 단계로 이루어져 있다.According to the present invention, when the Hangul text is input, the mouth shape parameters of the initial, neutral, and final voices are key frame parameters for each syllable, and the voice duration between the first and neutral voices (D ₁ ) and the voice duration between the neutral and the final voices ( Calculating the number of image frames for each syllable by dividing D ₂ ) by one frame generation time (T ₀ ), and selecting a parameter from a consonant shape parameter DB, a facial expression change parameter DB, and a head movement parameter DB according to a vowel / vowel code. The interpolation is performed to obtain a parameter for each frame by interpolating the number of image frames, to deform the shape model of the individual face using the interpolated parameters, and to synthesize a face image for each frame by real-time texture mapping.

이하 예시도면에 의거 본 발명의 일실시에 대한 구성 및 작용에 대해 상세히 설명한다.Hereinafter, the configuration and operation of one embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 5a ∼ 도 5d는 본 발명에서 채용한 적응동기를 설명하기 위한 개념도이다. 도 5a의 D₁은초성과 중성사이, D₂는 중성과 종성사이의지속시간을 의미한다. 도 5a는 이 지속시간을 평균 프레임 속도 T₀로 나누어서 초기 프레임 수 N₁을 산출한다. 즉,5A to 5D are conceptual views for explaining the adaptive synchronization employed in the present invention. In FIG. 5A, D ₁ denotes a duration between the supernova and neutral, and D ₂ denotes a duration between the neutral and the final. 5A divides this duration by the average frame rate T ₀ to calculate the initial frame number N ₁ . In other words,

N₁=[D₁/T₀] (1)N ₁ = [D ₁ / T ₀ ] (1)

이다. 여기서 [ ]은 반올림을 나타낸다. T₀는 미리 계산해 놓은 프레임의 평균 생성 시간이다. R₁은 정수 프레임의 생성 후, 과부족 시간을 의미한다. 이것은 프레임 수를 반올림하기 때문에 생기는 현상이다.to be. Where [] indicates rounding. T ₀ is the average generation time of precomputed frames. R ₁ means excess or short time after generation of the integer frame. This is caused by rounding the number of frames.

도 5b는 제 1프레임 생성 후, 추후 생성해야할 프레임 수를 재 산출한 것이다. 즉,5B recalculates the number of frames to be generated later after generating the first frame. In other words,

N₂= [(D₁-T₀')/T₁] (2)N ₂ = [(D ₁ -T ₀ ') / T ₁ ] (2)

이다. 여기서 T₀'는 제 1 프레임을 실제로 생성한 시간이다. T₁은 T₀'와 같은 제 1프레임의 생성시간이다. T₀'가 예상보다 짧아지면, 프레임 N₂가 늘어나고, 예상보다 길어지면 N₂가 줄어들 수 있다.to be. Where T ₀ ′ is the time when the first frame is actually generated. T ₁ is a generation time of the first frame equal to T ₀ ′. If T ₀ 'is shorter than expected, frame N ₂ may increase, and if longer than expected, N ₂ may decrease.

도 5c는 제 2프레임 생성후에 프레임 수를 조정한 것이다. 적응동기를 위하여, 매 프레임 생성 시마다, 시스템 초기부터 생성된 전체 프레임의 평균시간을 재계산한다. 즉,5C adjusts the number of frames after generating the second frame. For adaptive synchronization, at every frame generation, the average time of all frames generated from the beginning of the system is recalculated. In other words,

(3) (3)

N은 시스템 시작에서부터 현재까지 생성된 프레임 수이다. 정수프레임을 생성후의 과부족 시간 R_N을 계산한다. 이것은 미리 산출하는 것이 아니라, 반 음절의 최후 프레임을 생성한 후에 도 5d에서와 같이 산출한다. 즉,N is the number of frames generated from system startup to the present. Calculate the excess short time R _N after generating an integer frame. This is not calculated in advance, but after the last frame of the half syllable is calculated as in FIG. 5D. In other words,

(4) (4)

이다.to be.

이와 같이 초성과 중성사이의 지속시간 D₁에 대해서 프레임 생성이 끝나면, D₂를 조정한다. 즉,Thus, when frame generation is finished for the duration D ₁ between the initial and the neutral, adjust D ₂ . In other words,

(5) (5)

이다. D₂'는 조정된 D₂로서, 중성과 종성의 지속 시간으로 주어진다. 일반적으로는, 어떤 반음절의 프레임 생성결과 본래 지속시간보다 과부족 시간이 생기면, 다음 반음절에 이월하여 지속시간을 약간 조정하고 있다. 프레임 수를 적응적으로 조절함에 따라, 파라미터는 반음절 시작부터 현재까지의 생성소요시간을 고려하여 보간한다. 즉,to be. D ₂ ′ is the adjusted D ₂ , given by the duration of the neutral and the final. In general, if a half-syllable frame is generated as a result of excess or shortage of the original duration, the duration is slightly adjusted to carry over to the next half-syllable. As the number of frames is adaptively adjusted, the parameters are interpolated in consideration of the generation time from the beginning of the half syllable to the present. In other words,

(6) (6)

이다. 여기서, A_N은 현재 생성해야할 프레임의 파라미터 강도이다. A₀, A_l은 제 i번째 반음절의 시작과 끝의 파라미터의 강도이다. Q_i는 제 i번째 반음절의 시작부터 현재 생성된 프레임까지 합성소요시간이다. D_i'는 제 i번째의 반음절의 조정된 지속시간이다.to be. Where A _N is the parameter strength of the frame to be generated currently. A ₀ , A _l is the strength of the parameter at the beginning and end of the i-th syllable. Q _i is the synthesis time from the start of the i-th syllable to the currently generated frame. D _i 'is the adjusted duration of the i th half syllable.

이러한 적응동기 순서는 도 6과 같은 방법으로 이루어진다. 적응동기를 이용하면, 반음절별로 음성지속시간과 동영상 프레임 생성시간과의 차이가 T_N/2를 넘지 않는다.This adaptive synchronization sequence is performed in the same manner as in FIG. With adaptive synchronization, the difference between the voice duration and the video frame generation time does not exceed T _N / 2 for each syllable.

(1) 시스템을 기동하여 시작한다(6-1단계).(1) Start up and start the system (step 6-1).

(2) 가장 최근에 프로그램을 사용하였을 때의 프레임의 평균생성시간 T_O를 디스크로부터 읽는다(6-2단계). 프로그램을 컴퓨터에 처음 포팅하였을 경우는, 포팅 전 컴퓨터의 프레임 평균생성시간이다. T_o는 프로그램을 사용함에 따라, 사용중의 컴퓨터의 프레임 생성주기 T_N에 자동 적응된다.(2) Read the average generation time T _O of the frame when the program was most recently used from the disc (step 6-2). The first time a program is ported to a computer, it is the average frame generation time of the computer before porting. T _o is automatically adapted to the frame generation period T _N of the computer in use as the program is used.

(3) 입력텍스트의 반음절 수 M을 읽는다(6-2단계).(3) Read the half syllable number M of the input text (step 6-2).

(4) 합성된 총 프레임수 N, 반음절 순서 i, 반음절의 누적 지속시간 D, 합성과부족시간 R₀를 초기화한다(6-3단계).(4) Initialize the synthesized total number of frames N, the half-syllable order i, the cumulative duration D of the half-syllable, and the synthesized oversufficiency time R ₀ (step 6-3).

(5) 합성시작 시각 P₀를 체크한다(6-4단계).(5) Check synthesis start time P ₀ (steps 6-4).

(6) 제 i번째 반음절의 지속시간 D_i를 읽는다(6-5단계).(6) The duration D _i of the i th syllable is read (steps 6-5).

(7) 지속시간 D_i를 누적한다(6-6단계).이다.(7) Accumulate duration D _i (steps 6-6). to be.

(8) 제 i번째 반음절의 지속시간 D_i를 제 i-1번째 반음절에서의 과부족시간을 조정한다(6-7단계). 즉,이다.(8) The duration D _i of the i-th half syllable is adjusted to the oversufficiency time in the i-th half-syllable (steps 6-7). In other words, to be.

(9) 반음절별 프레임 순서 j를 초기화한다(6-8단계).(9) Initialize the frame order j for each syllable (steps 6-8).

(10) 제 i번째 반음절의 시작시각 P_i를 체크한다(6-9단계).(10) The start time P _i of the i-th syllable is checked (steps 6-9).

(11) 식(6)을 이용하여 파라미터를 보간한다(6-10단계).(11) Interpolate the parameters using Equation (6).

(12) 보간된 파라미터에 따라 얼굴형상모델을 변형하여, 실시간으로 텍스쳐 매핑한다(6-11단계).(12) The face shape model is transformed according to the interpolated parameters, and texture mapping is performed in real time (steps 6-11).

(13) 합성후 시각 P_i'를 체크한다(6-12단계).(13) Check the time P _i 'after synthesis (steps 6-12).

(14) 합성최초부터 제 i번째 반음절의 프레임 합성중 소요시간를 산출한다(6-13단계).(14) Time required during frame synthesis of the i-th half-syllable from the beginning of synthesis (6-6-13).

(15) 제 i번째 반음절까지의 누적 지속시간 D와 합성중 소요시간 Q를 비교한다(6-14단계).이면, 합성을 계속하기 위해 다음단계로 간다. 아니면, 다음 반음절 합성을 위해 (6-18단계)로 간다.(15) The cumulative duration D up to the i-th syllable is compared with the duration Q during synthesis (steps 6-14). If so, go to the next step to continue synthesis. Otherwise, go to step 6-18 for the next half-syllable synthesis.

(16) 생성된 총 프레임 수 N, 음절별 프레임 수 j를 1만큼 증가시킨다(6-15단계).(16) The total number of generated frames N and the number of frames j for each syllable are increased by one (steps 6-15).

(17) 시스템 시작으로부터 이제까지 프레임의 생성시간 Q를 총 프레임 수 N으로 나누어서 프레임 당 평균 생성시간 T_N을 산출한다(6-16단계). 즉,이다. 프레임 당 평균시간을 매 프레임마다 산출하므로써 컴퓨터 CPU, 실시간 텍스처 매핑의 속도에 자동적응된다.(17) An average generation time T _N per frame is calculated by dividing the generation time Q of the frames so far from the system start by the total number of frames N (steps 6-16). In other words, to be. By calculating the average time per frame every frame, it automatically adapts to the speed of the computer's CPU and real-time texture mapping.

(18) 제 i번째 반음절의 시작부터 현재까지 생성된 프레임의 합성소요시간를 산출한다(6-17단계).(18) Synthesis time of frame generated from the beginning of the i-th syllable syllable to the present (6-6-17).

(19) 입력 텍스트의 반음절 수 M과 현재 반음절 번호 i를 비교한다.이면, 합성을 계속하기 위해 다음단계로, 아니면 (6-21단계)로 진행한다(6-18단계).(19) Compare the number of half syllables M of the input text with the current number of syllables i. Then continue to next step (6-6-21) to continue synthesis (steps 6-18).

(20) 제 i번째 반음절의 잔여시간를 산출한다. 단계(5)로 진행한다(6-19단계).(20) Remaining time of the i th syllable To calculate. Proceed to step 5 (steps 6-19).

(21) 반음절 번호 i를 1만큼 증가시킨다(6-20단계).(21) Increment the half-syllable number i by 1 (steps 6-20).

(22) 프레임의 평균 생성시간 T_N를 저장한다(6-21단계).(22) The average generation time T _N of frames is stored (steps 6-21).

(23) 시스템을 종료한다(6-22단계).(23) Shut down the system (steps 6-22).

상기한 바와 같이 본 발명은 전체 프레임의 평균시간을 계산하여, 반음절에서 생성될 프레임 수를 정하고, 프레임 생성결과 반음절의 본래 지속시간보다 과부족시간이 생기면 다음 반음절에 이월하여 지속시간을 조정하도록 함으로써 합성 얼굴의 동영상과 음성을 적응동기시킬 수 있는 효과가 있다.As described above, the present invention calculates the average time of the entire frame, determines the number of frames to be generated in the half-syllable, and adjusts the duration by carrying over to the next half-syllable when there is an oversufficiency time than the original duration of the half-syllable as a result of the frame generation. By doing so, it is possible to synchronize the video and voice of the synthesized face.

Claims

When Hangul text is input, each syllable is composed of mouth shape parameters of initial, neutral, and final voice as key frame parameters, and voice duration between first and neutral (D ₁ ) and voice duration between neutral and final (D ₂ ). Calculating the number of image frames for each syllable by dividing the information into one frame generation time (T ₀ ), selecting a parameter from a consonant shape parameter DB, a facial expression change parameter DB, and a head movement parameter DB according to a consonant / vowel code. Integrating voice and face animation, which consists of interpolating according to the number of frames to obtain parameters for each frame, deforming the shape model of the individual face using the interpolated parameters, and then composing face images for each frame by real-time texture mapping. Automatic Adaptive Synchronization of Video Frames Using Voice Duration in System.

The method of claim 1, wherein the calculating of the number of image frames for each syllable is performed by dividing the duration D ₁ between the initial and the neutral by the average generation time T ₀ of a precomputed frame, and rounding the rest. Calculating N ₁ ) and recalculating the number of frames N ₂ by dividing the remaining duration D ₁ -T ₀ ′ after generation of the first frame by the actual generation time T ₁ ′ of the first frame. After adjusting the number of frames by repeating the step of renumbering the number of frames after generating the second frame, the average time T _N of the entire frames generated from the initial system is repeated for each frame generation for adaptive synchronization. Repeating the step of reproducing the frame number and reproducing the frame number to generate the last frame of the half syllable ), After the generation of the frame for the duration (D ₁ ) between the initial and the neutral, the excess shortage time (R _N ) is added to the duration between the neutral and the final (D ₂ ) By adjusting the duration (D ₂ '), and then performing the same steps as the frame number generation step between the initial and the neutral, if a half-syllable frame is generated as a result of over-short time than the original duration, it is carried over to the next half-syllable. An automatic adaptive synchronization method of a video frame using a voice duration in an integrated voice and face animation system characterized by adjusting the duration.

The method of claim 1, wherein the interpolation step of the parameters for adaptive synchronization is based on a difference between the parameter strengths A ₁ -A _o between the start parameter intensity A ₁ of the i-th half-syllable and the end parameter intensity A _o . Multiply the synthesis time (Q _i ) from the beginning of the i-sixth syllable to the currently generated frame by the sum (Q _i + T _N ) of the average time (T _N ) of all frames created from the beginning of the system, and multiply it by the i-th Voice and face characterized by dividing by the adjusted duration (D _i ') of the syllables and adding the parameter strength (A _o ) at the end of the i-th syllable to interpolate as the parameter strength of the frame to be generated now. Automatic Adaptive Synchronization of Video Frames Using Voice Duration in Animation Integration System.