MXPA00007221A

MXPA00007221A - Method and system for combining video sequences with spacio-temporal alignment

Info

Publication number: MXPA00007221A
Application number: MXPA/A/2000/007221A
Authority: MX
Inventors: Serge Ayer; Martin Vetterli
Original assignee: Rale
Priority date: 1998-01-16
Filing date: 2000-07-24
Publication date: 2002-06-05

Abstract

Given two video sequences, a composite video sequence can be generated which includes visual elements from each of the given sequences, suitably synchronized and represented in a chosen focal plane. For example, given two video sequences with each showing a different contestant individually racing the same down-hill course, the composite sequence can include elements from each of the given sequences to show the contestants as if racing simultaneously. A composite video sequence can be made also by similarly combining a video sequence with an audio sequence.

Description

METHOD AND SYSTEM. TO COMBINE VIDEO SEQUENCES WITH SPACE AND TEMPORAL ALIGNMENT TECHNICAL FIELD The present invention relates to visual displays and, more specifically, to visual displays dependent on time. BACKGROUND OF THE INVENTION In video visualizations, for example, in television programs related to sports, special visual effects can be used to increase the appreciation of the action by a viewer. For example, in the case of a team sport such as football, instant replay gives the viewer a second chance to see the critical moments of the game. Such moments can be repeated with slow movements, and superimposed features such as, for example, hand-drawn circles, arrows and letters can be included in order to provide emphasis and indications. These techniques can also be used with other types of sports, such as racing. In the case of team sports, instant replay techniques and the like are more appropriate since typically many people participate in the scenes. Similarly, for example, in the 100-meter competition, the scene includes the participants side by side, and a presentation with slow movements at the finish line allows to see the essence of the race. On the other hand, when starting times are staggered, for example, when it is necessary for practical reasons and safety in the case of some races, for example downhill skiing or ski jumping, the actual scene typically includes only one participant. . COMPENDIUM OF THE INVENTION For better visualization, both by the sports fan and by the participant and his coach, it is desired to obtain deployments in which the element of competition among the participants is manifested. This is especially true in the case in which the participants perform individually, for example in the case of alpine skiing, and can also be applied to races in groups in which qualification schemes are used to decide who will advance from the rooms from the end to the semifinals to the final. We have recognized that, given two or more video sequences, a composite video sequence can be generated which includes visual elements from each of the given sequences, suitably synchronized and represented in a chosen focal plane. For example, given two video sequences each showing a different competitor participating individually in the same downhill ski race, the composite sequence can include elements of each of the given sequences to show the competitors as if they were participating simultaneously.

A composite video sequence can also be produced by similarly combining one or several video sequences with one or more different sequences such as audio sequences, for example. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram of a preferred embodiment of the invention. Figures 2A and 2B are schematic representations of different skiers passing in front of a video camera. Figures 3A and 3B are schematic representations of images recorded by the video camera, corresponding to Figures 2A and 2B. Figure 4 is a schematic representation of the figures 2A and 2B combined. Figure 5 is a schematic representation of the desired video image, with the scenes of Figures 3A and 3B projected in a selected focal plane. Figure 6 is a table from a composite video sequence that was developed with a prototype implementation of the invention. DETAILED DESCRIPTION Conceptually, the invention can be seen in analogy with a two-dimensional "map" (2D), that is, the transformation, deformation or smooth projection of an image II, in another image 12 in computer graphics. Said orphis leads to a sequence that shows the transformation of II into 12, for example, of the image of an apple in the image of an orange, or of one human face in another. The video sequence is three-dimensional, it has two spatial dimensions and one temporal dimension. Parts of the sequence can be of special interest, for example intermediate images, for example, the average of two faces or composite, for example a face with the eyes of II and the smile of 12. Thus, the morphism between images can be seen as a form of combination of characteristics of the images. The invention relates to a more complicated task, namely the morphism of two video sequences. Morphism or projection from one sequence to another leads to four-dimensional data that can not be easily visualized, however, any intermediate combination or composite sequence leads to a new video sequence. What is especially interesting is the generation of a new video sequence that combines elements of two or more given sequences, with alignment or synchronization in space and time, and projection in a chosen focal plane. For example, in the case of a sports race, such as downhill skiing, video sequences obtained from two participants who made the race separately can be synchronized in time by selecting the boxes corresponding to the start of the race. Alternatively, the sequences can be synchronized for the coincident passage of the participants at a critical point, for example a slalom gate, for example. The chosen focal plane may be the same as the focal plane of one or other of the given sequences, or it may be constructed appropriately different from both planes. It is also interesting the synchronization based on a distinctive event, for example, in athletics, a competitor of long jump rising from the ground or making contact with the ground again. In this aspect it is interesting to synchronize two sequences in such a way that both takeoff and contact with the earth coincide, which requires temporary modification. The resulting composite sequence allows a comparison of the trajectories. With the synchronized video sequences, they can be further aligned in space, for example, to generate a composite sequence that gives the impression that the participants are competing simultaneously. In a simple approach, spatial alignment can be done on a frame-by-frame basis. Alternatively, taking into account several frames from a camera, you can extend the view of an output image to include background elements from sequential images. The formation of a composite image includes representative component scenes in a chosen focal plane, typically requiring a considerable amount of computerized processing, for example, as illustrated in Figure 1, for the special case of two video input sequences. Figure 1 shows two sequences of images II and IS2 fed to a module. 11 for synchronization in synchronized sequences IS1 'and IS2'. For example, the sequences IS1 and IS2 may have been obtained for two competitors in a downward run and may be synchronized through the module 11 such that the first frame of each sequence corresponds to the competitor moving away from the exit. The synchronized sequences are fed with a module 12 for extraction of background / foreground, as well as a module 13 for estimation of transformation of camera coordinates. For each of the image sequences, the module 2 is provided with a weighted mask (WMS) sequence, with each weighted mask being a set having one input for each pixel position and differentiation between the scene of interest and the background /foreground. The generation of the weighted mask sequence includes the computerized search of images for elements that, from frame to frame, move in relation to the background. The module 13 provides sequence parameters SP1 and SP2 including azimuth and elevation camera angles, and camera focal length and aperture therebetween. These parameters can be determined from each video sequence by computerized processing that includes interpolation and image matching. Alternatively, a properly equipped chamber can supply the sequence parameters directly, thus avoiding the need for its estimation through computerized processing. The weighted mask sequences WMS1 and WMS2 are fed to a module 13 for calculating "alpha layer" sequences. The alpha layer is a set that specifies how much weight each pixel must receive in each of the images in the composite image. The sequence parameters SP1 and SP2 as well as the alpha layer are fed to a module 15 to project the sequences of images aligned in a chosen focal plane, resulting in the sequence of desired composite images. This is further exemplified by Figures 2A, 2B, 3A, 3B, 4 and 5. Figure 2A shows a skier A who is about to pass a placeholder 21, and the scene is recorded from a position of camera 22 with an angle of view f (A). The position reached by A can be after the passage of t (A) seconds from the moment in which A came out in a • 5 career. Figure 2B shows another skier B, in a similar position in relation to marker 21, and with the scene recorded from a different camera position 23 and with a different, narrower viewing angle f (B). To compare with skier A, the position of skier B • corresponds to a period of t (A) seconds from the time of departure from B. As illustrated, within t (A) seconds, skier B has moved more along the trajectory of the race compared with the skier A. 15 Figures 3A and 3B show the respective resulting images. Figure 4 shows a combination with figures 2A and 2B superimposed on a common camera location. Figure 5 shows the resulting desired projected image 20 in a chosen focal plane, allowing the immediate display of skiers A and B as if they had participated together during t (A) seconds from a common exit. Figure 6 shows a frame from a sequence of 25 composite images generated by a prototype implementation of the technique with the frame corresponding to an intermediate time point. The value of 57.84 is the time, in seconds, that the slower skier required to reach the intermediate time point, and the value of +0.04 (seconds) indicates what is his delay in relation to the fastest skier. The prototype implementation of the technique was written in the programming language "C", for execution on a SUN workstation or a PC, for example. Firmware or dedicated equipment can be used to increase processing efficiency, and especially to process signals that involve correspondence and interpolation. Individual aspects and variations of the technique are described below in more detail. A. Background / foreground extraction In each sequence, the background and foreground can be extracted using an appropriate movement estimation method. This method must be "robust", for background / foreground extraction where sequences of images are acquired through a camera in motion and when the acquired scene contains moving agents or objects. It also requires temporary consistency, so that the background / foreground extraction is stable over time. When both the camera and the agents move in a predictable manner, for example, at a constant speed or at a constant acceleration, a temporary filtration can be used to improve the temporal consistency. Based on determinations of the speed with which the background is displaced due to the displacement of the camera and the speed of the skier in relation to the camera, the background / foreground extraction generates a weighted layer that differentiates between the pixels that follow the camera and pixels that do not follow the camera. The weighted layer is then used to generate the alpha layer for the final composite sequence. B. Alignment of sequences in space and time Temporal alignment includes the selection of corresponding frames in the sequences, according to a chosen criterion. Typically, in races, it is the time code of each sequence supplied by the timing system, for example, to select the frames corresponding to the start of the race. Other criteria of possible times are the time corresponding to a designated spatial location such as a gate or a jump entry, for example. A spatial alignment is made by choosing a reference coordinate system for each frame and by estimating the transformation of camera coordinates between the reference system and the corresponding frame of each frequency. Said estimation may be unnecessary when camera data, for example the position of the camera, the direction of the view and the focal length recorded together with the video sequence. Typically, the reference coordinate system is chosen as one of the given sequences - the sequence to be used for the composite sequence. As described below, spatial alignment can be performed on a single-frame or multiple-frame basis. Bl. Spatial alignment based on a single frame In each step of this technique, the alignment uses a picture from each of the sequences. Since each of the sequences includes moving agents / objects, the method for estimating the transformation of camera coordinates must be robust. For this purpose, the masks generated in background / foreground extraction can be used. Also, as a reason for bottom / foreground extraction, a temporary filtration can be used to increase the temporal consistency of the estimation process. B2. Spatial alignment based on multiple frames In this technique, a spatial alignment is applied to reconstructed images of the scene displayed in each sequence. Each video sequence is analyzed first in multiple frames for reconstruction of the scene, using a technique similar to the technique for background / foreground extraction, for example. Once each scene has been reconstructed separately, for example, to cover as much background as possible, the scenes can be specially aligned in accordance with what is described above. This technique allows the free choice of the field of vision of each frame in the scene, in contrast to the single frame technique where the field of vision must be chosen as the field of view of the reference frame. Thus, in the multi-frame technique, in the case in which not all the participants are visible in all the frames, the field and / or angle of view of the composite image can be chosen in such a way that all the participants can be seen. C. Superimposition of video sequences After the extraction of the background / foreground in each sequence and after the estimation of the transformation of camera coordinates between each sequence and a reference system, the sequences can be projected in a selected focal plane for simultaneous viewing in a single deployment. Alpha layers for each frame of each sequence generated from the various, weighted background / foreground masks. Thus, the composite sequence is formed by transforming each sequence into the selected focal plane and overlaying the different transformed images with the corresponding alpha weight. D. Applications In addition to ski competitions as shown in the example, the techniques of the present invention can be applied to other speed / distance sports such as car racing and athletics, for example. To further visualize, an application of a composite video sequence made in accordance with the invention is apparent from Figure 6, ie to determine a differential time between two riders at any desired location of a race. This includes the simple count of the number of frames in the sequence between the two runners that pass the location, and multiplying by the time interval between the frames. A composite sequence can be broadcast in existing facilities such as network, cable and satellite television, and as video on the Internet, for example. Said sequences can be offered as services, for example, in a channel separated from a main channel strictly in real time. Or, the place of its emission in a separate channel, a composite video sequence can be included as part of a regular channel, visualized as a corner portion, for example. In addition to its use in broadcasting, generated composite video sequences can be used for training purposes, and, apart from applications in sports, there are potential industrial applications such as automobile accident analysis, for example. It is understood that composite sequences can be of larger dimension as for example composite stereovideo sequences. In another application one of the given sequences is an audio sequence to be synchronized with a video sequence. Specifically, given a video sequence of an actor or singer, A, who pronounces a sentence or who sings a song, and an audio sequence from another actor B, who does the same thing, the technique can be used to generate a sequence of overpopulation or "lip synchronization" of actor A that is speaking or singing with the voice of B. In this case, it requires more than a simple time shift, you can use dynamic programming techniques for synchronization. The spatio-temporal realization method can also be applied in the biomedical field. For example, after orthopedic surgery, it is important to monitor the progress of the patient's recovery. This can be obtained by comparing specific movements of the patient over a period of time. In accordance with an aspect of the invention, said comparison can be made very precisely by synchronizing the beginning and end of the movement, and aligning the members to be monitored in two or more video sequences. Another application is in the case of automobile accident analysis. The technique can be used to accurately compare the deformation of different trolleys that were injured in similar situations, to determine the magnitude of the difference. Also in the case of car accident analysis, it is important to compare effects on the dolls that participated in the simulated accidents. Again, in two accidents with the same type of car, you can accurately compare how the dolls are affected, for example, according to the configuration of the safety belts.

Claims

CLAIMS A method for generating a video sequence composed from several given video sequences comprising: (a) the synchronization of the given sequences; and (b) the formation of the composite sequence from the projected synchronized sequences in a chosen focal plane. The method according to claim 1, wherein the synchronization is carried out in relation to a timed event in the given sequences. The method according to claim 1, wherein the synchronization is carried out in relation to a common spatial event in the given sequences. The method according to claim 1, wherein the synchronization is carried out in relation to two events in each of the given sequences, with a change of time scale to equalize the time between the events. The method according to claim 1, wherein the video sequences given have camera parameters that include camera location and focal length, where the chosen focal plane corresponds to the focal plane of one of the given sequences, and where the composite sequences are in accordance with what is seen from the camera location of one of the given sequences. The method according to claim 1, wherein the formation of the composite sequence is carried out on a frame-by-frame basis. The method according to claim 1, wherein the formation of the composite sequence is based on several frames of at least one of the sequences, for an expanded field of mission in the composite sequence compared to the field of view of a of the sequences. The method according to claim 1, wherein the video sequences given are of a sporting event. 9. The method according to claim 8, wherein the sporting event is a ski race. 10. The method according to claim 8, wherein the sporting event is a car race. 11. A system for generating a video sequence composed from a number of given video sequences, comprising: (a) a device for synchronizing the given sequences; Y (b) a device for forming the composite sequence from the sequences synchronized according to the projected in a chosen focal plane. The system according to claim 11, wherein the device for synchronizing comprises a device for aligning the given sequences with respect to a timed event in the given sequences. 13. The system according to claim 11, wherein the device for synchronizing comprises a device for aligning the given sequences in relation to a common spatial event in the given sequences. The system according to claim 11, wherein the device for synchronizing comprises a device for aligning the given sequences with respect to two events in each of the given sequences, and a device for changing the time scale to equalize the time between the events. The system according to claim 11, wherein the given video sequence has camera parameters including camera location and focal length, and wherein the device for forming the composite sequence comprises a device for selecting the focal plane corresponding to the focal plane of one of the given sequences, and a device for forming the composite sequence as observed from the camera location of one of the given sequences. 16. The system according to claim 11, wherein the device for forming the composite sequence comprises a device for processing the given sequences on a frame-by-frame basis. The system according to claim 11, wherein the device for forming the composite sequence comprises a device for employing several frames of at least one of the sequences, for a broad field view in the composite sequence compared to one of the sequences. 18. The system according to claim 11, where the video sequences given come from a sporting event. 19. The system according to claim 18, wherein the sporting event is a ski race. 20. The system according to claim 18, wherein the sporting event is a car race. 21. A system for generating a composite video sequence from several given video sequences, comprising: (a) a device for synchronizing the given sequences; Y (b) a processor that receives instruction to form the composite sequence from the projected synchronized sequences in a chosen focal plane. 22. A method for generating a composite image comprising: (a) synchronization of various video sequences; and (b) the formation of the composite image from the corresponding frames of the projected synchronized sequences in a chosen focal plane. . A broadcast service comprising: (a) synchronization of several video sequences; (b) the formation of a composite sequence from the projected synchronized sequences in a chosen focal plane; and (c) the diffusion of the composite sequence. . A method for generating a composite video sequence from a given video sequence and a given audio sequence, comprising: (a) synchronization of the video sequence and of. the audio sequence given for synchronization between the visual characteristics of the video sequence and the audio characteristics of the audio sequence; and (b) forming the composite sequence from the synchronized sequences, which have a video portion corresponding to the given video sequence and an audio portion corresponding to the given audio sequence. A system for generating a video sequence composed from a given video sequence and a given audio sequence, comprising: (a) a device for synchronizing the given video sequence and the given audio sequence for synchronization between the features visuals of the video sequence and the audio characteristics of the audio sequence; and (b) a device for forming the composite sequence from the synchronized sequences, having a video portion corresponding to the given video sequence and an audio portion corresponding to the given audio sequence. . A system for generating a video sequence composed from a given video sequence and a given audio sequence, comprising: (a) a device for synchronizing the given video sequence and the given audio sequence for synchronization between the features visuals of the video sequence and the audio characteristics of the audio sequence; and (b) a processor that receives instructions in the sense of forming the composite sequence from the synchronized sequences, which has a video portion corresponding to the given video sequence and an audio portion corresponding to the audio sequence Dadaist. . A method for determining a differential time between two competitors at a specific location in a race, comprising: the formation of a video sequence composed from a given synchronized video sequences of the competitors, projected on a selected focal plane; and count the number of frames among the competitors that pass in the location. . The method according to claim 1, wherein the video sequences given have biomedical relevance. . The method according to claim 28, wherein the biomedical relevance comprises relevance to the movement of a member of a patient. . The method according to claim 1, wherein the video sequences given comprise automobile accident test sequences. . The method according to claim 30, wherein the automobile accident test sequences comprise car images used for the test. . The method according to claim 30, wherein the automobile accident test sequences comprise images of dolls in cars under test.