MXPA00010044A

MXPA00010044A - Wavelet-based facial motion capture for avatar animation

Info

Publication number: MXPA00010044A
Application number: MXPA/A/2000/010044A
Authority: MX
Inventors: Thomas Maurer; Egor Valerievich Elagin; Luciano Pasquale Agostino Nocera; Johannes Bernhard Steffens; Hartmut Neven
Original assignee: Egor Valerievich Elagin; Eyematic Interfaces Inc; Thomas Maurer; Hartmut Neven; Luciano Pasquale Agostino Nocera; Johannes Bernhard Steffens
Priority date: 1998-04-13
Filing date: 2000-10-13
Publication date: 2002-03-26

Abstract

The present invention is embodied in an apparatus, and related method, for sensing a person's facial movements, features and characteristics and the like to generate and animate an avatar image based on facial sensing. The avatar apparatus uses an image processing technique based on model graphs and bunch graphs that efficiently represent image features as jets. The jets are composed of wavelet transforms processed at node or landmark locations on an image corresponding to readily identifiable features. The nodes are acquired and tracked to animate an avatar image in accordance with the person's facial movements. Also, the facial sensing may use jet similarity to determine the person's facial features and characteristic thus allows tracking of a person's natural characteristics without any unatural elements that may interfere or inhibit the person's natural characteristics.

Description

CAPACITY OF MOVIM IENTO FAC IAL BASED ON TRAIN OF WAVES FOR ANIMATION OF A REPRESENTATION FIELD OF THE INVENTION The present invention relates to a dynamic detector of facial features, and more, particularly with a motion capture system based on vision that allows the finding, tracking and real-time classification of facial features for introduction in a graphic program that animates a representation.

BACKGROUND OF THE INVENTION Virtual spaces filled with representations are an attractive way to allow the experience of a perm ented environment. However, the environments with existing parties generally lack a detection of facial features of sufficient quality to allow the incarnation of a user, for example, the provision of a representation with the wishes, expressions or gestures of the user. The detection of quality facial features is a significant advantage because facial gestures are a primary means of communication. Therefore, the incarnation of a user increases the attractiveness of virtual spaces. Existing methods of detecting facial features typically use markers that adhere to a person's face. The use of markers for the capture of facial movement is problematic and is generally restricted to the use of facial motion capture in high-cost applications such as the production of films.

Consequently, there is an important need for vision-based motion capture systems that implement the detection of. convenient and efficient facial features. The present invention satisfies this need.

BRIEF DESCRIPTION OF THE INVENTION The present invention is constituted in an apparatus, and related method, for detecting the movements, traits and partial characteristics of a person. The results of facial detection can be used to animate a representation image. The display apparatus uses an image processing technique based on model graphics and group graphics that efficiently depict image features as beams constituted of wavelet transform at reference points in a facial image corresponding to the easily identifiable traits. The detection system allows the follow-up of the natural characteristics of the person without any unnatural element interfering with the natural characteristics of the person. The feature detection process operates on a sequence of image frames that transform each image frame using a wavelet transform to generate a transformed image frame. The location of nodes associated with the wave train beams of a model graph for the transformed image frame is initialized by moving the graph of model through the transformed imaging box and placing the model graph at a position in the image frame transformed from the maximum simulation of the beam between the wave train beams at the node positions and the transformed imaging frame. The position of one or more node locations of the model chart is followed between imaging boxes. A node is then reinitialized if the node position deviates when a predetermined position restriction between image frames is exceeded. In one embodiment of the invention, the finding of the facial feature can be based on a coincidence of elastic group graphics for individualization of a head model. In addition, the model graph for facial image analysis may include a plurality of location nodes (eg, 18) associated with differentiating features on a human face. Other features and advantages of the present invention will become apparent from the following description of preferred embodiments, taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS Figure I is a block diagram of a representation animation system and process, according to the invention. Figure 2 is a block diagram of an apparatus and process for detecting facial features, according to the invention, for the representation animation system and process of Figure 1. Figure 3 is a block diagram of a video imaging processor for implementation of the facial feature detection apparatus of Figure 2. Figure 4 is a flow diagram with accompanying photographs to illustrate a finding technique. of reference points of the apparatus and system of facial features detector of Figure 2. Figure 5 is a series of images showing the processing of a facial image using Gabor wave trains, according to the invention.

Figure 6 is a series of graphs showing the construction of a beam, an image graph and a group graph using the wavelet processing technique of Figure 5, according to the invention. Figure 7 is a diagram of a graph of a model, according to the invention, for processing partial images. Figure 8 includes two diagrams showing the use of wave train processing to locate a facial feature. Figure 9 is a flow chart showing a tracking technique for following reference points found by the reference point finding technique of Figure 4. Figure 10 is a diagram of a Gaussian image pyramid technique. to illustrate the follow-up of reference points in a dimésion. Figure 11 is a series of two facial images, with accompanying graphs of pose angle versus frame number, showing the tracking of facial features over a sequence of 50 frames of image. Fig. 1 2 is a flow diagram, with accompanying photographs, to illustrate a pose estimation technique of the apparatus and detector system of facial features of Fig. 2. Fig. 1 3 is a schematic diagram of a face with regions extracted from the eye and mouth, to illustrate a technique for finding reference points from coarse to fine. Figure 14 are photographs showing the extraction of the profile and facial features using the elastic group graphic technique of Figure 6. Figure 15 is a flow diagram showing the generation of a customized group graphic labeled together with a corresponding gallery of image patches encompassing a variety of expressions of a person for representation animation, in accordance with the invention.

Fig. 16 is a flow diagram showing a technique for animating a representation using image patches that are transmitted to a remote site, and which are selected at the remote site based on the transmitted tags based on the facial detection associated with the current facial expressions of a person. Fig. 17 is a flowchart showing the conversion of a generated three-dimensional head image, based on facial feature positions and labels, using integrated volume imaging with dynamic texture generation. Figure 18 is a block diagram of a representation animation system, according to the invention, including audio analysis to animate a representation.

DETAILED DESCRIPTION OF THE PREFERRED MODALITIES The present invention is constituted in a device and related method to detect the movements, features and facial features of a person and the like to generate and animate a representation image based on facial detection. The rendering apparatus uses an image processing technique based on model graphics and group graphics that represent strong image features as beams. The beams are constituted by wave train transforms which are processed at node or reference point positions in an image corresponding to easily recognizable features. The nodes are acquired and followed to animate a representation image according to the person's facial movements. further, facial similarity can be used to determine the traits and facial characteristics of a person and thus allow the tracking of the natural characteristics of a person without any unnatural element that may interfere with the natural characteristics of a person. As shown in Figure I, the display animation system 10 of the invention includes an image forming system 12, a facial detection process 14, a data communication network 16, a process 1 8 of facial animation and a representation screen 20. The image formation system generates and digitizes a live video image signal of a person and in this way a stream of digitized video data organized in image frames is generated. The digitized video image data is provided to the facial detection process which locates the face of the person and the corresponding facial features in each frame. The facial detection process also follows the positions and characteristics of the facial features from frame to frame. The follow-up information can be transmitted via the network to one or more remote sites which receive the information and generate, using a pro-gram to graphic, an animated facial image on the displayed representation. The animated facial image can be based on a photorealistic model of the person, a cartoon character or a face completely unrelated to the user. The imaging system 1 2 and the facial detection process 14 are shown in greater detail in FIGS. 2 and 3. The imaging system retains the image of a person using a digital video camera 22. which generates a stream of video image frames. The video image frames are transferred to a video random access memory (VRA M) 24 for processing. A satisfactory imaging system is the M atrox M atrox II atrox available from M atrox R which generates digitized images produced by a conventional CC D camera and which transfers the images in real time within the m emory to a frame rate of 30 Hz. The picture frame is processed by an image processor 26 having a central processing unit (C PU) coupled to the VRA M and a RA M 30 random process memory. The RA M stores a program and data code to implement the facial detection and the representation animation process. The process of facial features works on the digitized images to find the facial features of a person (block 32), follow the features (block 34) and reinitialize trait tracking, as needed. Facial features can also be classified (block 36). The facial feature process generates data associated with the position and classification of the facial features with which it is provided to an interconnection with the facial animation process (block 38). The facial feature can be located using an elastic graphic match shown in Figure 4. In the elastic graphic matching technique, a captured image (block 40) is transformed in the Gabor space using a wavelet transformation. (block 42), which is described later in greater detail, with respect to figure 5. The transformed image (block 44) is represented by 40 complex values, representing the wave train components, for each pixel of the original image. Then, a rigid copy of a model graph, which is described in greater detail later with respect to FIG. 7, is placed on the transformed image in the variable model node positions to locate a simulated position. Optimum ility (block 46). The search for optimal sim ility can be carried out by placing the model graphic in the upper left corner of the image, extracting the beams in the nodes, and determining the sim ility between the image and the image. graphic of model. The search continues by sliding the model graph from left to right starting from the upper left corner of the image (block 48).

When a general position of the face is found (block 50), the nodes are allowed to move individually, introducing elastic graphic distortions (block 52). A symmetry function insensitive to the ase is used in order to locate a good match (block 54). Then a phase-sensitive simulation function is used to place a beam with precision because the phase is very sensitive to small beam shifts. Simplicity functions that are non-phase sensitive and phase sensitive are described below with respect to Figures 5-8. Note that although the graphs are shown in Figure 4 with respect to the original image, the movements of the pattern of the model and the coincidence are actually made on the transformed image. The wavelet transform is described with reference to FIG. 5. An original image is processed using a Gabor wave train to generate a convolution result. The Gabor-based wave train consists of a two-dimensional complex wave field modulated by a Gaussian envelope.

The wave train is a plane wave with a wave vector k, restricted by a Gaussian interval. the size of which in relation to the wavelength is parameterized by. The term between the keys is the DC component. The amplitude of the wave vector k can be chosen as follows, where it is related to the desired spatial solutions. (2) k = 2 - p, v = \, 2, ...

A wave train is used, centered in the position of the magnet x to extract the wave train component J from the image with gray level distribution l (x). (3) The space of the wave vectors k is typically displayed in a discrete hierarchy of 5 resolution levels (which differs by semi-octaves) and 8 orientations at each resolution level (see, for example, Figure 8), and In this way, 40 complex values are generated for each sampled imaging point (the real and im aginary components refer to the cosine and sine phases of the plane wave). The samples in this space k are called by the index j = 1, .., 40, and all the wave train components centered on a single image point are considered as a vector which is called a beam 60, shown in figure 6. Each beam describes the local features of the area surrounding .v. If it is sampled with sufficient density, the image can be reconstructed from beams within the bandpass covered by the sampled frequencies. Therefore, each component of a beam is the filter response of a Gabor wave train extracted at a point (x and y) of the image. A labeled image graph 62 is used, as shown in Figure 6, to describe the appearance of an object (in this context, a face). The nodes 64 of the labeled graph refer to points on the object and are labeled by beams 60. The edges 66 of the graph are labeled with distance vectors between the nodes. The nodes and edges define the graph topology. You can compare graphs with the same topology. The standard dot product of the absolute components of the two beams defines the beam similitude. This value is independent of the contrast changes. To calculate the sim ility between the graphs, the sum is taken over simulations of corresponding beams between graphs. In Figure 7 a pattern graph 68 is shown which is designed particularly to find a human face in an image. The number of nodes in the graph obtains the following positions: 0 pupil of the right eye 1 pupil of the left eye 2 top of the nose 3 right corner of the right eyebrow 4 left corner of the right eyebrow 5 right corner of the left eyebrow 6 left corner of the left eyebrow 7 right nostril 8 tip of the nose 9 left nostril 10 right corner of the mouth 1 1 center of the upper lip 12 left corner of the mouth 1 3 middle of the lower lip 1 4 lower part of the right ear 1 5 upper part of the right ear 1 6 upper part of the left ear 1 7 lower part of left ear To represent a face, a data structure called graphic group 70 is used (figure 6). It is similar to the graph described above, but instead of joining only a single beam to each node, the entire bundle group 72 (a group beam) is attached to each node. Each beam is derived from a different facial image. To form a beam graph, a collection of facial images (the group graph gallery) is marked with the node positions in defined head positions. These defined positions are called reference points. When a group graph is matched to an image, the beam extracted from the image is compared to all the beams in the corresponding group attached to the group graph and the one that best matches is selected. This matching process is called graphic elastic group matching when constructed using a properly selected gallery, a group graph covers a wide variety of faces that may have significantly different local properties, for example samples of female and male faces, and people of different ages or races. Again, in order to find a face in an im a-gen box, the graph becomes smaller and is scaled and distorted until the place where the graph is best matches is found (the beams that match best within the group beams are more similar to the beams extracted from the image in the current positions of the nodes). Since the features of the face differ from one face to another, the graph becomes more general for the task, for example each node is assigned with beams of the corresponding reference point taken from 10 to 100 individual faces. Two different beams with simulation functions are used for two different tasks, and even complementary ones. If the components of a beam, / are written in form with an amplitude and phase (a form for the sim ility of the two beams, / and J 'is the standardized scalar product of the amplitude vector (4) The other simulation function has the form (5) This function includes a relative displacement vector between the image points to which the two beams refer. When the two beams are compared during the graphical match, the similitude between them is maximized with respect to d, which leads to an accurate determination of the beam position.

Both simulation functions are used, with preference often given to the phase-insensitive version (which varies in a uniform manner with the relative position) when a graph coincides, and with the face-sensitive version, when the precision is placed with precision. make. After the facial features are located, the facial features can be tracked over consecutive frames, as illustrated in Figure 9. The tracking technique of the invention provides robust tracking over a large sequence of frames by utilization of the tracking correction scheme that detects if the tracking of a characteristic or node has been lost and reinitializes the tracking process for that node. The position X_n of a single node in an image of a sequence of images is known either by the finding of reference points in the image using the method of finding the reference point (block 80) described. before, or by tracking the node from the image I_ (n-1) to I_n using the tracking process. The node is then traced (block 82) to a corresponding position X_ (n + I) in the image l_ (n + l) by one of several techniques. The tracking methods described below advantageously accommodate rapid movement. A first tracking technique involves the prediction of linear motion. The search for the corresponding node position X_ (n + 1) in the new image l_ (n + l) starts at a position generated by a motion estimator. A disparity vector (X_n - X_ (n-1) is calculated which represents the displacement, assuming a constant velocity, of the node between two preceding squares.The disparity or displacement vector D_n can be added to the position X_n to predict the position of nodes (X_ (n + 1) .This model of linear motion is particularly advantageous for accommodating constant velocity movement.The linear motion model also provides a good follow-up if the frame rate is high compared to the acceleration of The objects that are followed, however, the linear motion model works poorly on the frame rate is too low compared to the acceleration of the objects in the sequence of images.Since it is difficult that any movement model have been objects under such conditions, the use of a camera having a higher frame rate is recommended.The linear motion model can generate too much large of an estimated motion vector D_n which can lead to an accumulation of error in the estimation of motion. Consequently, the linear prediction can be damped using a damping factor f_D. The resulting estimated motion vector is D_n = f_D * (X_n - X_ (n-1)). An adequate damping factor is 0.9. If there is no previous frame I_ (n-1), for example, for a frame immediately following the finding of a reference point, the estimated motion vector is set equal to zero (D_n = 0).

Figure 10 illustrates a tracking technique based on a Gaussian image pyramid applied some dimension. Instead of using the original image resolution, the image is sampled 2-4 times to create a Gaussian pyramid of the image. A 4-level image pyramid results in a distance of 24 pixels at the finest original resolution level, which is represented as only 3 pixels at the thickest level. The beams can be calculated and compared at any level of the pyramid. Tracking a node in the Gaussian image pyramid is usually done first at the thickest level and then advancing to the finest level. A beam is extracted at the thickest Gauss level of the current picture frame l_ (n + l) at the position X_ (n + 1) using the damped linear motion estimate X_ (n + 1) = (X_n + D_n) as described before, and when compared with the corresponding beam calculated in the thickest Gauss level of the previous image frame. From these two beams, the disparity is determined, that is, the 2D R vector that points from X_ (n + 1) to that position that corresponds best to the beam from the previous frame. This new position is assigned to X_ (n + 1). The disparity calculation is described below in more detail. The position of the next finest Gauss level of the real image (which is 2 * X_ (n + l), corresponds to the position X_ (n + 1) at the thickest Gauss level is the starting point for the calculation The beam extracted at this point is compared to the corresponding beam calculated at the same Gauss level and the previous image frame.This process is repeated for all Gauss levels until the finer resolution level, or until the Gauss level is reached, which is specified to determine the position of the node corresponding to the previous frame position, Figure 10 shows two representative levels of the Gaussian image pyramid, a thicker level 94 above a thinner 96 level below.The assumption is established in each beam that it has filter responses for two frequency levels.Starting at position 1 of the thickest Gauss level, X_ (n + 1 ) = X_n + D_n, a first d isparity is only used using the bearings of smaller frequency that lead to position 2. A second disparity is increased by using all the beam coefficients of both levels in frequency leading to position 3, the final position in this level of Gauss. Position 1 at the thinnest Gauss level corresponds to position 3 at the thickest level where the coordinates have doubled. The disparity movement sequence is repeated, and the 3 position at the finest Gauss level is the final position of the reference point followed. For a more precise follow-up, the Gauss number and frequency levels can be increased. After the new node position followed in the actual picture frame has been determined, the beams of all Gauss levels are calculated in this position. A stored beam arrangement that is calculated for the previous frame, representing the node followed, is then replaced by a new array of beams calculated for the current frame. The use of the Gaussian image pyramid has two main advantages: First, the nodes of the nodes are much smaller in terms of pixels on a thicker level than in an original image, which makes tracking possible by performing only a local movement instead of an exhaustive search in a large image region. Secondly, the calculation of the beam components is much faster for smaller frequencies, because the calculation is made with a small number interval in a descending sampled image, instead of a core interval. large in an original resolution image. Note that the level of correspondence can be dynamically chosen, for example in the case of tracking of facial features, the level of correspondence can be chosen depending on the actual year of the face. In addition, the size of the Gauss image pyramid can be altered by the tracking process, that is, the size can be increased when the movement is faster and can be decreased when the movement is slower. Typically, the maximum node movement above the thickest Gauss level is limited to 4 pixels. Also note that movement estimation is often done only at the thickest level. Now we describe the calculation of the displacement vector between two beams given at the same Gauss level (the disparity vector). To calculate the displacement between two consecutive frames, a method is used, which is originally developed to estimate disparity in stereoscopic images., based on D. J. Fleet and A. D. Jepson Computation of component image velocity from local phase information. In International Journal of Computer Vision, volume 5, edition 1, pages 77-104, 1990 and W. M. Theimer and H. A. Mallot, Phase-based binocular vergence control and depth reconstruction using active vision. In GVGIP: Image Undersanding, volume 60, issue 3, pages 343-358. November 1994. The strong variation of the phases of the complex filter responses is explicitly used to calculate displacement with subpixel accuracy (Wiskott, L., "Labeled Graphs and Dynamic Link Matching for Face Recognition and Scene Analysis", Verlag Harri Deutsch, Thun-Frankfurt am Main, Reihe Physik 53 (PhD thesis, 1995). By writing the answer J in the filter of Gabor jes, m "in terms of amplitude a and phase j, we can define a function of similarity as (6)? a)] .cos (f¡-f¡. -dk,) S (J, J ', d) = JS ^ S, "? Suppose that J and J' are two beams in positions X and X '= X + d, we can find the displacement d to maximize the similitude S with respect to ad, k and are the wave vectors associated with the filter that generates Jr Because the estimate of d is only accurate for small displacements, ie a large superimposition of the Gabor beams, the large displacement vectors are treated at a first estimate only, and the process is repeated in the following way: First, only the filter responses of the frequency level m The lower ones are used which results in a first estimate of d_ 1. Then, this estimate is executed and the beam J is recalculated at the position X_l = X = d_ l which is closer to the position X 'of the Then, the two lowest frequency levels are used for the estimation of the displacement d_2, and the J beam is recalculated in the position X_2 = X_ l + d_2. This is iterated until the highest frequency level used is reached, and the final disparity d between the two start beams J and J 'is provided as the sum a d = d_ l + d_2 + .... As a result, displacements up to half the wavelength of the nucleus can be calculated with the least frequency (see W iskott 1 995, supra). Unless the displacements are determined using floating point numbers, the beams can be extracted (that is, they can be calculated by convolution) only in pixel (integer) positions, resulting in a systematic rounding error. To com think this error of your pixel d, the phases of the Gabor filter responses should be shifted according to d. k. (7) so that the beams appear as if they had been extracted in the correct subpixel position. In consequence, the Gabor beams can be tracked with precision of their pixel which takes into account the additional rounding errors. Note that Gabor beams provide a substantial advantage in imaging processing because the problem of subpixel accuracy is more difficult to solve in most of the other image processing methods. The tracking error can be detected by determining whether a confidence or simile value is less than a predetermined threshold (block 84 of FIG. 9). The S-value of sim ility (or confidence) can be calculated to indicate how well two image regions in two image frames correspond to each other simulta neously with the calculation of the displacement of a node between consecutive image frames. Typically, the confidence value is close to 1, indicating good correspondence. If the confidence value is not close to 1, either the corresponding point in the image has not been found (for example because the frame rate is too slow compared to the speed of the moving object) , or this region of image has changed so drastically from one picture frame to the next that it is no longer well defined the correspondence (For example, for the node tracking of the pupil of the eye and the eyelid has been closed). Nodes that have a confidence value less than a certain threshold can be deactivated. A tracking error can also be detected when certain geometric constraints are violated (block 86). If many nodes are followed simultane- ously, the geometrical representation of the nodes can be verified to determine their consistency. Such geometrical restrictions can be lost difficultly, for example when the facial features are followed, the nose should be between the eyes and the mouth. Alternatively, such geometric constraints may be more precise, for example, a model containing the information of a specific form of the face followed. For an intermediate precision, the restrictions can be passed over a model in a flat plan. In the model in a flat plan, the nodes of the face graph are supposed to be in a plane that is enclosed in a single plane. For sequence of images that start with a frontal view, the node positions followed can be com pared with the corresponding node positions of the frontal graph transformed by a transformation of affinity to the current frame. The 6 parameters of the optimal affinity transformation are found when minimizing the least squares errors in the node positions. The deviations between the node positions followed and the transformed node positions are compared with a threshold. Nodes that have deviations greater than the threshold are inactivated. The parameters of the affinity transformation can be used to determine the relative pose and scale (in comparison to the initial graph) of simulative m anage (block 88). Therefore, this plan model in general ensures that tracking errors do not grow and exceed a predetermined threshold. If a node is inactivated repeatedly due to a tracking error, the node can be reactivated in the correct position (block 90), using advantage-group graphics that include different poses and the tracking continues from the corrected position. (block 92). After a followed node has been activated, the system can wait until a predefined pose is reached for which there is a pose-specific group graphic. Otherwise, if a frontal group graphic is only stored, the system must wait until the frontal pose is reached to correct any tracking error. The bundled groups of beams can be com pared with the image region surrounding the placement position (eg, from a model in a flat plan) which works in the same way as the follow-up, except that instead of comparing with the beam of a previous imaging frame, the comparison is repeated with all the beams of the example group, and the most similar is taken. Because the partial features are known, for example the actual pose, the scale and even the general position, the pairing of graphs or an exhaustive search in the image and / or pose space is not needed and can be perform a node tracking correction in real time. For the follow-up correction, the group graphs are not needed for many different poses and scales due to the rotation in the plane of the image as well as the scale which can be taken into consideration when transforming the region. of local image or beams of the group graph accordingly, as shown in Figure 1 1. In addition to the frontal pose, the group graphics need to be created only for deep rotations. The speed of the reinitialization process can be increased by taking advantage of the fact that the identity of the person followed remains the same during an image sequence. Consequently, in a cession of initial learning, a first sequence of the person can be taken with which the person shows all his repertoire of frontal facial expressions. This first sequence can be followed with high precision using the follow-up and correction scheme described above based on a large generalized group chart that contains the knowledge of approximately several different people. This process can be done offline and generates a new custom group graph. The custom group chart can then be used to follow this person at a fast speed in real time because the custom group chart is much smaller than a group chart. more generalized. It is also possible to increase the speed of reinitialization of the process by using the partial group graphical reinitialization. A partial group graph contains only a subset of the nodes of a complete group graph. The subset can be as small as only one node. A group graph of pose estimation makes use of a family of two-dimensional group graphs defined on an image plane. The graphical differences within a family constitute different poses and / or scales of the head. The process of finding the reference point attempts to match each group graph of the family to the input image in order to determine the pose or size of the head in the image. An example of such a pose estimation procedure is shown in Figure 12. The first stage of pose estimation is equivalent to that of a regular landmark finding. The image (block 98) is transformed (blocks 100 and 102) in order to use graphical simility functions. After, instead of only 1, a family of three group graphs is used. The first group graph contains only the front-facing faces (equivalent to a front view described above), and the other two group graphs contain three-quarter faces (one represents rotations to the left and one to the right). As in the above, the initial positions of each of the graphs is in the upper left corner, and the positions of the graphs are scanned over the image and the position and graph return to their highest simulation after the reference point finding is selected (blocks 104-1 14). After the initial match for each graph, the simulations of the final positions are com- pared (block 1 16). The graph that corresponds best to the pose given in the image will have the greatest similarity. In Figure 12. the graph that has been turned to the left provides the best fit for the image, as indicated by its similarity (block 1 1 8). Depending on the resolution of the degree of rotation of the face in the image, the sim ility of the correct graph and the graphs for other poses may vary, which becomes very close when the face is approximately half the length between two poses for which graphs have been defined. By creating group graphs for more poses, a finer pose estimation procedure can be added that can differentiate between more degrees of head rotation and handle rotations in other directions (eg, upward). or down).

In order to have a robust face at an arbitrary distance from the camera, a similar solution can be used in which two or three graphs of different groups can be used, each with different scales. The face in the image will be assumed to have the same scale as the group graph that returns most of it with the facial image. The three-dimensional reference point (3 D) finding techniques are related to the technique described above, and multiple group charts adapted to different poses can also be used. However, the 3 D solution uses only a group graph defined in the 3 D space. The geometry of the 3 D graph reflects an average face or head geometry. To extract the beams from images of the faces of several people in different degrees of rotation, a graph of group 3 D is generated which is analogous to the 2D solution. Each beam is now parameterized with three rotation angles. Like the 2D solution, the nodes are located at the fiducial points on the surface of the head. The projections of the 3 D graph are then used in the matching process. One important generalization of the 3 D solution is that each node has the param etrized family of group beams adapted to different poses. The second generalization is that the graph can experience Euclidean transformations in 3 D space and not just transformations in the image plane. The image matching process can be formed as a coarse to fine solution that first uses graphics with few nodes and cores and then in later stages uses more dense graphics. The solution of coarse to fine is particularly suitable if a high-precision location of the trait points in certain areas of the face is desired. Therefore, com putational effort is saved by adopting a hierarchical solution in which the finding of reference points is first performed in a coarser resolution, and later the adapted graphs are verified with a higher resolution to analyze certain regions with finer detail. In addition, the putational workload can easily be divided into a multiprocessing machine so that once it is in the thick regions, some child processes initiate parallel work, each for its part of the total image. At the end of the child processes, the process combines the coordinates of traits that they have located with the master process, which makes an appropriate scale and the combination to place it back in the original image and in this way considerably reduces the time po of total calculation. As shown in Figure 13, the facial features that correspond to the nodes can be classified to consider inappropriate follow-up error indications such as, for example, blinking or opening of the mouth. The labels are attached to the different beams in the group graph that correspond to the facial features, for example, open / closed eyes, open / closed mouth, etc. The labels can be copied together with the corresponding beam in the group graph, which is more similar to a comparison with the current image. Label tracking can be monitored continuously, regardless if a tracking error is detected. Consequently, you can join classification nodes to the nodes followed for the following: eyes open / closed mouth open / closed tongue visible or non-classification of hair type - detection of wrinkles (for example on the forehead) Therefore, follow-up allows the use of two sources of information. A source of information is based on trait positions, that is, node positions, and the other source of information is based on the feature class. Feature class information is more texture based and, in comparison with a local image region with a set of stored examples, it can work using lower resolution and tracking precision in comparison with class information of traits that is based solely on the positions of the nodes. The facial detection of the invention can be applied to the creation and animation of static and dynamic representations, as shown in Figure 14. The representation can be based on a generic facial model or can be based on a specific facial model. of a person. The recognition of follow-up and facial expression can be used for the embodiment of the representation with the traits of the person. The generic facial model can be adapted to a representative number of individuals and can be adapted to carry out a realistic animation and to supply a wide range of facial features and / or expressions. The generic of a model can be obtained by the following techniques. 1 . Single-chamber systems (T. A kimoto et al., 1 993) can be used to produce a realistic representation for use in low-end telein- mission systems. The face profile information of the individuals, as observed from the sagittal and coronal planes, can be merged to obtain the representation. 2. Systems of stereoscopic cameras that are capable of accurate 3-D editing when the cameras are fully calibrated (the camera parameters are calculated through a calibration process). Afterwards, an individual facial model can be obtained by adjusting the generic facial model to the obtained 3-D data. Because stereoscopic algorithms do not provide accurate information in non-textured areas, active texturized light projection can be used. 3. Stereoscopic techniques based on traits where the markers are used on the individual face to calculate the precise 3-D positions of the m archers. The 3-D report is then used to place a generic model. 4. Three-dimensional digitizers in which a sensor or positioning device is mowed on each surface point to be measured. 5. Active structured light where patterns are projected and the resulting video stream is processed to exact 3-D measurements. 6. Laser-based surface scanning devices (such as those developed by Cyberware, Inc.) that provide accurate measurements of the face. 7. A combination of the previous techniques. These different techniques are not of equal convenience for the user. Some are able to obtain measurements in the individual in a one-time process (the face is in the desired position for the duration of the editing), while others need sample collection and are more problematic. of using. A generic three-dimensional head model can be generated for a specific person using two facial images that show a front and profile view. The fac ial detector allows efficient and robust generation of the 3-D head model. The extraction of facial contour is done together with the location of the person's eyes, nose, mouth and cheek. This feature location information can be obtained by using the elastic group graphic technique in combination with the hierarchical match to automatically extract the facial features, as shown in Figure 14. The information The location of traits can then be combined (see T. A kimoto and Y. Sue-naga, A utom atic Creation of ED Facial Models, IEEE Com puter Graphics &A pplications, pages 16-22, September 1993. ) to produce a 3 D model of a person's head. A generic three-dimensional head model is adapted in such a way that these proportions are related to the measurements of the image.

Finally, both lateral and frontal images are combined to obtain a better texture model for representation, that is, the front view is used to texturize a map of the front of the model and the side view is used for one side of the model. A facial detector improves this technique because the extracted characteristics can be labeled (you can define known points in the profile) so that the two images do not need to be taken simultaneously. A representational image can be animated by the following common techniques (see F. Parke and K. Waters, Computer Facial A nimation, A K Peters, Ltd. Wellesley M a sas csc, 1996). 1 . Frame in key and geometric interpolation, where the number of key poses and expressions are defined. The geometric interpolation is then used between the keyframes to provide animation. Such a system is often referred to as a model based on performance (or im- pulsed by performance). 2. Direct parameterization which establishes directly from expressions and poses to a set of parameters that are then used to im- press the model. 3. Models of pseudomuscule which simulate the actions of the muscles using geometrical deformations. 4. Models based on muscles where the muscles and skin are modeled using physical models. . 2-D and 3-D transformation which uses the 2D transformation between images of a video stream to produce 2D animation. A set of reference points are identified and used to wrap between two images of a sequence. Such a technique can be extended to 3 D (see FF Pighin, J. Hec-ker, D. Lischinski, R. Szeliski, and D. H. Salesin.Synthesozing Realistic Facial Ex-pressions from Photographs.In S IGG RA PH 98 Conference Proceedings, pages 75-84, July 1998). 6. Other approaches, such as control points and finite element models. For these techniques, facial detection improves the animation process by providing an automatic extraction and characterization of facial features. The extracted features can be used to interpolate expressions in case of keyframe and interpolation models or to select parameters to direct parametric models or pseudom muscles or muscle models. In the case of 2-D and 3-D transformation, face detection can be used to automatically select features on a face that provide the appropriate information to perform the geometric transformation. An example of representation animation that uses tracking and classification of facial features can be shown with respect to Figure 1. 5. During the training phase, the individual is asked to perform a series of facial expressions predetermined (block 1 20), and detection is used for trait tracking (block 1 22). In predetermined positions, the beam and image patches are extracted from the various expressions (block 1 24). The image patches surrounding the facial features are collected together with the beams 126 extracted from these features. These beams are subsequently used to classify or label facial features. This is done through the use of these beams to generate a customized group graph and by applying the classification method described above.

As shown in Figure 16, for animation of a representation, the system transmits all the image patches 128, as well as the image of the entire face 130 (the "face frame") minus the parts shown in the patches of image and are transmitted to a remote site (blocks 132 and 134). The software for the animation program may also need to be transmitted. The detection system then observes the face of the user and applies facial detection to determine which of the image patches is more similar to the current facial expression (blocks 136 and 138). The image tags are transmitted to the remote site (block 140) which allows the assembly program 142 face 142 to use the correct image patches. To match the image patches evenly inside the picture box, Gaussian indetermination can be used. For a realistic supply, the local image transformation may be necessary because the animation may not be continuous in the sense that the succession of images may be represented as imposed by the detection. The transformation can be carried out using a linear interpolation of the corresponding points in the image space. To create intermediate images, linear interpolation is applied using the following equations: P, = (2-i) P, + (i-l) P: (1) I, = (2-i) I, + (i-l) I, (8) where Pl and P2 are the corresponding points in the images I, and I ,, and I, is the interpolated image iC5? ma: with l < i < 2. Note that for an efficient process, the image interpolation can be implemented using an invalid table pre-calculated for Pi and I. The number and precision of points used, and their accuracy, the facial model interposed generally determines the resulting image quality.

Therefore, the face reconstructed in the remote display can be constituted by assembling pieces and images corresponding to the expressions detected in the learning stage. Consequently, the representation shows traits that correspond to a person instructing the anim ation. Therefore, in the initialization, a set of blunt images correspond to each facial trait followed and a "container of faces" as the resulting image of the face after each trait is removed. The animation is initiated and facial detection is used to generate specific labels which are transmitted as previously described. The decoding occurs when selecting image pieces associated with the transmitted label, for example the image of the mouth labeled with a "smiling mouth" label 146 (FIG. 16). A more advanced model of representation animation can be obtained when integrating the dynamic texture generation mentioned earlier with more conventional volume transformation techniques, as shown in Figure 1 7). To obtain volume transformation, the location of the node positions can be used to push control points on a scale of 1 50. Then, the textures 1 52 dynamically generated using labels are then moved onto a single level. generate a realistic image 1 54. An alternative to use the node positions detected as im pulses of the control points in the mesh is used in the labels to select the local transformation targets. A transformation objective is a local character configuration that has been determined for the different facial expressions and gestures for which sample beams have been collected. These local mesh geom etries can be deter mined by stereoscopic vision techniques. The use of transformation objectives is further developed in the following common references (see, JR Kent, W. E. Carlson, and RE Parent, Shape Transformation for Polyhedral Objects, In S IGG RA PH 92 Conference Proceedings. at 26, pages 47-54, after 1992, Pighin et al., 1998, supra).

A useful extension of the animation of vision-based representations is to integrate the facial detection with the spoken analysis in order to synthesize the correct movement of the lips, as shown in Figure 1 8. The technique of lip synchronization It is particularly useful for mapping lip movements that result from speech in representation. It is also useful as a support in case of failure to follow the lips based on vision. While the foregoing describes the preferred embodiments of the present invention, it is understood by those skilled in the art that various changes can be made to the preferred embodiments without departing from the scope of the invention. The invention is defined solely by the following claims.

Claims

1. A method for detecting features in a sequence of picture frames, comprising a step for transforming each picture frame using a wavelet transformation to generate a transformed picture frame. of transformed image and place the model graph in a position in the transformed beam picture of maximum beam similarity between the wavelet beam beams of the nodes and the positions in the transformed image frame determined as the model graph moves through the transformed image frame, and a step for tracking the position of one or more node locations of the model graphic between the image frames, characterized in that the method comprises: a step to reinitialize the position of a node followed if the position of the node deviates exceeding a predetermined position restriction between the image frames.

2. A method for detecting traits, according to claim 1. characterized in that the model chart used in the initialization step is based on a predetermined pose.

3. A method for detecting traits, according to claim 1. characterized in that the tracking step follows the node positions using a graphical elastic group match.

4. A method for detecting features, according to claim 1, characterized in that the tracking step uses the linear position prediction to predict node positions in a subsequent frame of picture and the reinitialization stage reinitializes a node position based on a deviation from the predicted node position that is greater than the predetermined threshold deviation.

5. A method for detecting traits, according to claim 1, characterized in that the predetermined position restriction is based on a geometric position restriction associated with relative positions between the node positions.

6. A method for detecting features, according to claim 1, characterized in that the node positions are transmitted to a remote site to animate a representation image.

7. A method for detecting facial features, according to claim 1, characterized in that the step of tracking node positions includes the lip synchronization based on audio signals associated with the movement of the specific node positions to a mouth that generates the audio signals.

8. A method for detecting facial features, according to claim 7. characterized in that the reinitialization step is performed using a group graph match.

9. A method for detecting facial features, according to claim 8, characterized in that the group graph coincidence is performed using a partial group graph.

10. A method for detecting traits, according to claim 1, characterized in that the tracking step includes determining a facial feature.

1 1. A method for performing a detection, according to claim 10, characterized in that the method further comprises transmitting the node positions and facial features to a remote site for animation in a rendering image having facial features which are based on the facial characteristics determined by the follow-up stage.

12. A method for detecting traits, according to claim 10, characterized in that the facial characteristic determined by the monitoring step is to determine whether the mouth is open or closed.

13. A method for detecting traits, according to claim 10, characterized in that the facial characteristic determined by the monitoring step is to determine whether the eyes are open or closed.

14. A method for detecting traits, according to claim 10, characterized in that the facial characteristic determined by the monitoring step is to determine if the tongue is visible in the mouth.

15. A method for detecting traits, according to claim 10, characterized in that the facial characteristic determined by the tracking stage is based on facial wrinkles detected in the image.

16. A method for detecting traits, according to claim 10, characterized in that the facial characteristic determined by the tracking stage is based on the type of hair.

17. A method for detecting traits, according to claim 10, characterized in that each facial feature is associated by training with an image tag that identifies an image segment of the image frame that is associated with the facial feature.

18. A method for detecting features, according to claim 17, characterized in that the image segments identified by the associated image label are transformed into a representation image.

19. A method for detecting features, according to claim 17, characterized in that the node positions and the feature labels are used for volume transformation of the corresponding image segments in a three-dimensional image.

20. A method for detecting traits, according to claim 10. wherein the model chart comprises 18 exposure nodes associated with differentiating features on a human face.

21. A method for detecting features according to claim 20, characterized in that the 18 node positions of the face are associated, respectively, with: the pupil of the right eye; the pupil of the left eye; the upper part of the nose; the right corner of the right eyebrow; the left corner of the right eyebrow; the right corner of the left eyebrow: the left corner of the left eyebrow; the right nostril; the tip of the nose; the left nostril; the right corner of the mouth; the center of the upper lip; the left corner of the mouth; the center of the lower lip; the lower part of the right ear; the upper part of the right ear; the upper part of the left ear; and the lower part of the left ear.

22. A method to individualize a head model based on a finding of facial features, characterized in that the finding of facial features is based on the coincidence of graphs of an elastic group.

23. A method for individualizing a head model, according to claim 22, characterized in that the matching is performed using a coarse to fine approach.

24. An apparatus for detecting features in a sequence of picture frames, characterized in that it comprises a means for transforming each picture frame by using a waveform transformation to generate a transformed picture frame, a means for initializing nodes of a model graphic, each node associated with a beam wave train specific to a feature, to positions in the transformed picture frame by moving the model chart through the transformed picture frame and placing the model chart in a position in the picture box transformed of maximum beam similarity between the wave train beams and the positions in the transformed picture frame determined as the model chart moves through the transformed picture frame, and a means to track the position of one or more nodes of the model graphic between image frames, characterized in that the apparatus further comprises: means for reinitializing a node followed if the position of the node is deviated by exceeding a predetermined position restriction between image frames.

25. An apparatus for detecting traits, according to claim 24, characterized in that it further comprises: means for determining a facial characteristic; and a means for animating a representation image having facial features which are based on the facial features generated by the means of determination.