EP1834475A2

EP1834475A2 - Video-telephony terminal with intuitive adjustments

Info

Publication number: EP1834475A2
Application number: EP05850555A
Authority: EP
Inventors: Alexis Martin; Jean-Jacques Damlamian; Roland Airiau
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2005-01-07
Filing date: 2005-12-20
Publication date: 2007-09-19
Also published as: US8264522B2; FR2880762A1; US20080246830A1; WO2006075063A8; WO2006075063A1

Abstract

The invention relates to an intuitive adjustment of the framing of a terminal (100) by a remote correspondent using a mobile terminal (200) during video-telephonic communication. The terminal (100) comprises a camera (103), a framing means (105), a video encoding means (106) and a communication and multiplexing means (113). The terminal (200) also comprises movement sensors (208-210), shaping means (211-212) and a communication and multiplexing means (213). In response to a movement of the terminal (200), it produces framing data DC from movement information Δx, Δy and Δz from the movement sensors (208-210) and sends said framing data DC to the terminal (100). The framing means (105) extracts image portions from images captured by the camera (103) in response to framing data IC corresponding to framing data DC,.

Description

VISIOPHONE TERMINAL WITH INTUITIVE SETTINGS

The present invention relates to high data rate mobile phones capable of capturing and viewing images for video telephony communications. These phones are also called videophone terminals.

More particularly, the invention relates to the adjustment of the image capture means.

The arrival of broadband in telecommunications makes it possible to provide the general public with videophone services. More particularly, the third generation of radio networks, such as UMTS (Universal Mobile Telecommunications System), allow videophone applications with mobile phones or terminals.

Videotelephony allows two people to communicate remotely while seeing each other. For this purpose, the terminal of each person has a display screen and a camera. The portable terminals have a small screen with a low resolution, and it is important to have a good framing closeup if you want to see the features of his interlocutor.

A disadvantage comes from the fact that an interlocutor A communicating with a speaker B sees only what the camera A transmits. The interlocutor B having no control of the framing, it is the interlocutor A who must take care to control the shooting of his own camera. This control of shooting can be realized using a control sticker in a corner of its screen. Each interlocutor must then ensure that his image remains in the center of the control sticker. Such a framing system is not very practical for many reasons. It reduces the useful area of the display screen of the terminal which is already small. Each interlocutor must pay close attention to his own framing. The framing movements are not natural movements because the vignette shows a filmed image with a reversal between right and left. For fixed videoconferencing systems, it is known to have recourse to a remote control of the camera. Thus, the interlocutor A can adjust the camera of B and vice versa. Each user has a remote control that allows him to send different zoom and move commands. Such a system can not be set up on a portable terminal because the cameras of the portable terminals are generally not mobile and, in addition, it would require using the keys of the keyboard of the terminal during the communication. The keys of a portable terminal are small, and it is impractical to use them while keeping the terminal in a given direction of framing. EP-A-1 304 853 discloses a portable apparatus such as a mobile phone provided with a camera and motion sensors. The camera is used to take multiple images of an object, and these images are then combined using synchronized motion information provided by the motion sensors to realign the images to be combined. This concept does not involve two remote videophone terminals.

The invention proposes to remedy the framing problems mentioned above. Each portable terminal is equipped with a camera that can have a resolution greater than the resolution of the transmitted image. The framing of the transmitted image is done using framing information from a remote interlocutor. Motion sensors are placed in each portable terminal to retrieve motion information from said terminal. The motion information is then transformed into framing commands to be sent to the other terminal.

According to a first aspect, the invention proposes a portable videophone terminal comprising communication means, motion sensors and shaping means. The communication means make it possible to communicate with another videophone terminal via a radio communication network. The motion sensors produce information representative of movements of the terminal. The formatting means make it possible to transform the movement information into outgoing framing data for the other terminal. The means of communication are arranged to insert the outgoing framing data in data to be transmitted on the radio network to the other terminal.

The shaping means include filtering and control developing means for comparing the motion information with a minimum motion threshold and a maximum motion threshold. The outgoing framing data is produced in response to detecting a movement between the minimum movement threshold and the maximum movement threshold.

According to a second aspect, the invention proposes a portable videophone terminal comprising communication means, a camera, a framing means and a video encoding means. The communication means make it possible to communicate with another videophone terminal via a radio communication network. The camera captures images of a first size. The framing means extracts a portion of an image from an image captured by the camera. The framing means selects the image portion based on remote framing information from the other terminal. Said image portion having a second size smaller than the first size. The video encoding means transforms a stream of image portions from the frame means into outgoing video data. The communication means are arranged to extract the remote registration information from data received from the radio network from the other terminal.

Thus, an intuitive movement of the terminal according to the first aspect allows a user A to crop the image filmed by the terminal according to the second aspect of its correspondent B. Preferably, the two aspects are implemented on the same terminal.

According to a third aspect, the invention relates to a method of reframing an image taken by a camera of a first portable videophone terminal with the aid of a second portable videophone terminal equipped with a screen and video sensors. movement during a videophone call. The method comprises a step of generating framing data, in response to a movement of the second portable terminal, from - A - motion information from the motion sensors, and a step of sending said registration data to the first portable terminal.

In response to framing information received by the first terminal and corresponding to the framing data sent by the second terminal, the first portable terminal extracts image portions corresponding to said framing information from images captured by its camera and produced a video sequence representative of a succession of image portions.

According to a last aspect, the invention relates to a signal carrying a stream of videophone frames between a first portable terminal and a second portable terminal. At least one frame sent by the first terminal includes audio data, video data, and framing data. The registration data indicates the position and / or the displacement of an image portion taken by a camera of the second terminal. Said image portion corresponds to an image to be sent from the second terminal to the first terminal.

The invention will be better understood and other features and advantages will become apparent on reading the description which follows, the description referring to the appended figures among which: FIG. 1 is a block diagram showing two portable terminals in communication according to FIG. FIG. 2 shows a transfer characteristic between a movement information and a framing control, FIG. 4 shows an operating flow diagram for the terminal performing a capture. FIG. image according to the invention.

FIG. 1 represents two portable terminals 100 and 200 communicating via a radiotelephone or radiocommunication network 300. The radiotelephone network 300 is a third generation radiocommunication network, for example in accordance with the UMTS standard. Third-generation radiotelephone network requires comprising a high-speed radio network for exchanging audio, video or other data between a radio terminal and the network.

The present description is concerned with the management of the framing during a videophone communication between two terminals. Only the means implemented for the framing are detailed. The other constituent elements of the terminals and the network are well known to those skilled in the art.

To simplify the description, the two terminals 100 and 200 are identical. Similar references 1xx and 2xx are used to describe similar elements, the number of hundreds to differentiate the terminal. Thus, what is described with reference to the elements of the terminal 100 is applicable to the terminal 200 and vice versa.

The first terminal 100, of the mobile phone type, comprises:

a microphone 101 for capturing sound, a speaker 102 for reproducing sound,

a camera 103 for capturing images,

a display screen 104 for reproducing images,

- a camera 105 connected to the registration means 103 for extracting an image area from an image captured by the camera 103, the framing means 105 selecting the image portion in accordance with the framing information _c from another remote terminal,

audio and video encoding means 106 connected to the microphone 101 and the framing means 105 for transforming the sound captured by the microphone 101 into outgoing audio data and a stream of image portions coming from the framing means 105 into video data the outgoing video data being for example a video sequence compressed according to an image compression algorithm,

an audio and video decoding means 107 connected to the loudspeaker 102 and to the display screen 104 for transforming incoming audio data into a driving signal of the loudspeaker 102, and incoming video data into an image signal at reproduce on the display screen 104, motion sensors 108 to 110 for producing movement information of the terminal,

a filtering means 111 connected to the motion sensors 108 to 110 for filtering the motion information; a control generating means 112 connected to the filtering means;

111 which transforms the filtered motion information into outgoing frame data D ₀ to another terminal,

a multiplexing and communication means 113 connected to the encoding means 106, the control generation means 112, the decoding decoder 107 and the framing means 105 for, on the one hand, grouping audio, video and framing data Dc to be emitted in data packets to the network, and for, on the other hand, receiving and separating received data packets into audio data, video data and frame information l _c , and - an antenna 114 connected to the multiplexing means and communication 113 to exchange with the network 300 radio signals representative of the data packets transmitted and received by the terminal 100.

Conventionally, to maximize the integration of the components of a portable terminal, it mainly comprises a central processor, a signal processing processor and possibly an image processing processor. These three processors are used in a microprogrammed way to process all data and signals in digital form. Thus, the means 105-107 and 111-112 functionally described in FIG. 1 can be realized in practice by programming these processors. Analog / digital and digital / analog converters provide the link between the processors and the various elements 101-104 and 108-110 to which they are connected. The multiplexing and communication means 113 is also realized using the processors of the terminal but it also includes a radio interface connected to the antenna 114.

During a videophone communication, the audio and video decoding means 107 receives the audio and video data from the terminal 200 and transforms them into control signals of the loudspeaker 102 and the screen 104 in order to restore to a user A of the terminal 100 the sound captured by the microphone 202 and the image captured by the camera 203 of the terminal 200 of its correspondent B. The screen 104 is for example of LCD type (of the English: Liquid Crystal Display) or OLED type (English: Organic Light-Emitting Diode Display) small, for example a diagonal less than two inches. The resolution of this screen 104 is for example less than 200 pixels per side. The user A of the terminal 100 can thus have a satisfactory image enabling him to capture the expressions of a close-up face. However, such a resolution does not make it possible to distinguish these same details in a broader plane.

The camera 103 of the terminal 100 comprises a wide-angle lens and an image sensor, for example of the CCD type (Charge Coupled Device). The camera 103 is used for video telephony but also for other applications including digital photography. According to a known technique, the photos taken by the portable terminal 100 can be sent through the network 300 to a computer. Thus, the camera 103 generally has a resolution greater than the resolution of the screen 104. By way of example, the resolution of the camera 103 is at least 640 × 480 pixels in order to have a minimum quality of visualization on a screen. computer screen.

During a videophone communication, the framing means 105 serves to extract, from each image taken by the camera 103, an image portion whose shape and resolution correspond to the screen 204 of the terminal 200. form and resolution are transmitted when initializing the communication. The framing means also comprises means for performing oversampling and / or subsampling of the image making it possible to perform an electronic zoom function according to a known technique. The electronic zoom makes it possible to transform a portion of a chosen image of any size in order to adapt it to the size of the screen 204 of the terminal 200 of the correspondent B.

The framing means 105 includes an image stabilizer capable of compensating for low amplitude displacements of the image which correspond to to possible tremors of a user. The image stabilizer, of a known type, is for example capable of detecting any global and uniform movement between two images coming from the camera and of carrying out a corresponding translation in number of pixels in the direction opposite to the displacement. Stabilization of the image is preferably done before the extraction of the image portion.

In the invention, the framing means 105 further moves the image portion as a function of framing information Ic from the multiplexing means 113. The framing information Ic received by one of the terminals 100, respectively 200, correspond to the framing data Dc transmitted by the other of the terminals 200, respectively 100.

According to the invention, the framing is performed remotely by the correspondent who is best placed to adjust the image displayed. The invention provides an intuitive control of remote framing. When the user A of the terminal 100 sees the user B of the terminal 200 exit his screen 104, the most intuitive gesture is to move his screen 104, held by hand, to follow the movement of the user of the terminal 200 Also, when a detail is particularly eye-catching and you want to see it more closely, a user naturally brings his screen closer. The proposed intuitive control is to move the terminal in a natural direction to move the image portion seen on the screen.

Figure 2 shows the various framing changes as a function of a displacement of the terminal. FIG. 2a shows an image 400 captured by the camera 203 of the terminal 200 and an image portion 401 transmitted and seen on the screen 104 of the terminal 100. The user of the terminal 100 wishes to move the registration of the image portion 401 to obtain an image portion 402 differently. The user moves his terminal

100 of a movement Δm in the desired direction. This movement Δm can be decomposed into a movement Δx along a first axis parallel to a first side of the screen 104 and in a movement Δy along an axis along a second axis parallel to a second side of the screen 104. The movement is applied in proportion to the frame position of the image portion. FIG. 2b shows an image 400 captured by the camera 203 of the terminal 200 and an image portion 403 transmitted and seen on the screen 104 of the terminal 100. The user of the terminal 100 wishes to have a wider plan view corresponding to the image portion 404. The user then moves his terminal 100 eh away from the screen 104 of a movement away Δz which causes a widening of the frame. Subsampling is then performed to adapt the image portion to the size of the screen 104, which corresponds to a zoom out.

The registration data D ₀ are produced by the control generation means 112 as a function of motion information from the motion sensors 108 to 110 after filtering in the filtering means 111.

The motion sensors 108 to 110 consist, for example, of accelerometers, gyroscopes and / or electronic magnetometers capable of providing information relating to displacement and acceleration in translation and in rotation along three perpendicular axes.

Preferably, two of the three axes are respectively parallel to the sides of the screen 104 and the third axis is perpendicular to the screen 104. The translation and rotation movements are combined with each other to obtain motion information Δx, Δy and Δz representative of a relative displacement of the screen 104 for a predetermined duration along one of the three axes.

The predetermined duration corresponds, for example, to a sampling time of the movement information Δx, Δy and Δz. The filtering means 111 then performs a filtering of the motion information Δx, Δy and Δz. The motion information Δx, Δy and Δz are representative of an amplitude and a speed of displacement. To avoid taking into account movements related to tremors (low amplitude movements) or to fatigue (slow movements) of the user, only higher movement information, in absolute value, should be taken into account. minimum threshold S _m .

Moreover, the user can move during the conversation by videophone without wanting to change the framing. For this purpose, it is advisable not to take into account the information of movement of great amplitude, that is to say of amplitude higher, in absolute value, with a maximum threshold SM- The filter has the function of transfer following for the motion information Δx, Δy and Δz:

Au / = Au if S ₁₁₁ <| Δ «| <S _u , and

Auf = 0 if | Δω | <S ₁₁₁ or if | Δw | > S _M ,

with u replacing x, y or z, and Δxf, Δyf and Δzf corresponding to the filtered motion information. The command generation means 112 transforms the filtered movement information Δxf, Δyf and Δzf into the frame data Dc. The framing data Dc can be of different shapes. For example, two forms of data are described below.

A first form of the registration data is a form corresponding to registration orders. The orders will consist of three data representative of the modification of the framing.

The filtered information Δxf and Δyf are quantized to transform the motion into a number (positive or negative) of displacement pixels of the frame defining the image portion. The filtered information Δzf is quantized to indicate the number of pixels increasing or decreasing the frame defining the image portion. The position and the dimension of the frame of the image portion are then managed by the framing means 205 of the other terminal 200.

An example of a transfer characteristic between a movement information and a corresponding motion data is shown in FIG. 3. The abscissa axis corresponds to the possible values of a motion information, for example Δx. Three inoperative zones 410 to

411 correspond to the cancellation of the movement information produced by the filtering means 111. Two quantization zones 413 and 414 correspond to the movement control of the frame defining the image portion. As for example, it is possible to match a value of 1 displacement pixel when the motion information corresponds to the minimum threshold S _m and a value of 20 moving pixels when the motion information corresponds to the maximum threshold S _M. In a simple embodiment, a linear extrapolation is performed between the two thresholds S _m and SM; but note that a different curve could be used.

A second form of the registration data may consist of a position and frame size of a selected image portion. In this case, the modifications of the frame are carried out by the control generating means 112. The framing means 205 only performs the selection of the image portion optionally accompanied by a zoom effect.

Whatever the form of the scoping data Dc, these are provided by means of multiplexing and communication 113. The multiplexing and communication means 113 constitute frames of data to be sent combining a set of data destined for the terminal 200. The frame comprises audio data and video data relating to the videophone communication but also the frame data Dc. The frame is then packaged with service data to identify the frame and its destination. The data packet thus formed is transformed into a burst which is inserted in a radio signal consisting of a stream of frames to the network 300. The network 300 then needle the different frames to send them to the terminal 200 in the form of a packet. a radio signal.

The framing means 105 has an image stabilizer which compensates for the movements of the camera when they are of low amplitude. This compensation makes it possible to avoid that the movements made to crop the image of the interlocutor in turn causes a need for mutual reframing. If the maximum threshold SM corresponds to a displacement that can be compensated by the image stabilizer, the movement made to modify the registration of the other terminal 200 is automatically compensated. If the image stabilizer does not have the capacity to compensate for the movement related to a cropping displacement, it is possible to connect the scaling means 105 to the filtering means 111 so that the actual movement of the terminal 100 is taken into account to move its framing window. The cropping performed in the terminal 100 is of the same nature but different amplitude and sign compared to the reframing of the other terminal 200. With the means described above, it is possible to crop the images filmed by the camera 103 of the terminal 100 from the terminal 200 and vice versa. The reframing is done intuitively in response to a movement of the terminal 200 which leads to the development of framing data through the motion sensors 208 to 210, the filtering means 211 and the control developing means 212. Thus the video sequence generated by the terminal 100 is controlled by the user B of the terminal 200 during a videophone communication.

However, such a cropping is useful for a dialogue where the two users are face to face and / or quasi-immobile. When one of the users consecutively moves his terminal, for example to show something to his correspondent, the cropping becomes superfluous. In addition, if the crop is made permanently during a long videophone conversation where significant movements have been made, the frame delimiting the portion of the image may be on an edge of the image filmed by the camera and the reframing can become impossible in one direction.

According to a first improvement, the image filmed is reframed using the framing means 105 of the terminal 100 if the terminal 100 is almost immobile. Detection of the quasi-immobility of the terminal 100 can be done using the image stabilizer which detects a homogeneous motion of low amplitude of the image. It is also possible to detect the movement of the terminal 100 using the motion sensors 108 to 110. Thus, if the movements are, for example below the maximum threshold SM _I can be considered that the terminal is quasi-immobile. According to a second improvement, the framing means 105 comprises shape recognition means able to recognize the shape of a face. The form recognition is done for example with the help of a known technique for identifying that a face is present in the image. Thus, the terminal 100 can establish that the user is in dialogue if a face is detected. The framing means then takes into account the framing information if a face is detected. This allows remote framing to be controlled even if the background is moving or the terminal is moving.

Preferably, the two improvements are combined. The flow chart of FIG. 4 illustrates the implementation of these two improvements in the framing means 105. During a videophonic communication initialization step 420, the framing means 105 receives the characteristics of the screen 204 of the 200 correspondent terminal to be able to determine the format of the image portion to be sent.

Once the communication initialized, the framing means 105 operates in free image (step 421). During this step 421, the framing means divides an image portion, for example centered, independently of any received framing information Ic.

On a regular basis, a test 422 is performed to determine whether conditions make it possible to switch to remote control mode of the image. The test 422 consists, for example, in verifying whether the image can be considered as immobile or quasi-immobile or whether a face is present in the filmed image. If one of the two conditions is achieved, proceed to step 423. If neither condition is performed, a test 424 is performed.

The test 424 checks if the videophone communication is finished. If the communication is not completed, it returns to step 421 and the framing means operates in free image.

Step 423 corresponds to the operation of the framing means 105 taking into account the framing information Ic. The image portion is moved, enlarged or narrowed according to the framing data received. The framing means performs, if necessary, an oversampling or a subsampling of the framed image portion to match the resolution of the screen.

On a regular basis, a test 425 is performed to determine whether conditions allow you to switch to remote control mode of the image. The test 425 consists for example in checking whether the image can be considered as immobile or quasi-immobile or if a face is present in the filmed image. If one of the two conditions is achieved, return to step 423. If neither condition is performed, a test 426 is performed.

The test 426 checks whether the video telephony communication is complete. If the communication is not completed, the image portion is reinitialized and returns to step 421 and the framing means operates as a free image.

The described invention can be broken down into different variants. In the example described, the terminals 100 and 200 are identical and both capable of transmitting framing information and of receiving and taking into account framing information. However, those skilled in the art will understand that a terminal could develop and output framing data without receiving framing information. Reciprocally, a terminal can take into account framing information without itself elaborating framing data. The terminals implementing the invention can therefore be limited to the means necessary for the implementation of the invention without reproducing all the means included in the examples described.

Claims

A portable video telephony terminal (100, 200) comprising:

- communication means (113, 213) for communicating with another video telephony terminal (200, 100) via a radio communication network (300),

motion sensors (108-110, 208-210) for producing information (Δx, Δy, Δz) representative of movements of the terminal, and

- shaping means (111, 112, 211, 212) for transforming the movement information into outgoing framing data (Dc) to the other terminal, the communication means (113, 213) being arranged to inserting the outgoing frame data (D ₀ ) into data to be transmitted on the radio network to the other terminal.

2. Terminal according to claim 1, further comprising a display for displaying an image taken by a camera of the other videophone terminal, the framing data being adapted to control the framing of the image taken by said camera.

Terminal according to claim 1 or 2, wherein the motion sensors (108-110, 208-210) comprise gyroscopes and / or accelerometers and / or magnetometers.

Terminal according to any one of claims 1 to 3, wherein the shaping means (111, 112, 211, 212) are arranged to compare the motion information with a minimum movement threshold (S _m ) and a maximum movement threshold (SM), and wherein the outgoing frame data (D ₀ ) is produced in response to the detection of a movement between the minimum movement threshold (S _m ) and the maximum movement threshold ( SM)

Terminal according to any one of claims 1 to 4, further comprising:

a camera (103, 203) for capturing images having a first size, a framing means (105, 205) for extracting a portion of an image

(401-404) from a captured image (400) by the camera (103, 203), the registration means (105, 205) selecting the image portion based on remote registration information (Ic) from the other terminal, said image portion having a second size smaller than the first size,

video encoding means (106, 206) for transforming a stream of image portions from the scaling means (105, 205) into outgoing video data, and the communication means (113, 213) being arranged to extract the framing information (Ic) remote from data received from the radio network (300) from the other terminal (200, 100).

6. Portable videophone terminal (100, 200) comprising:

a camera (103, 203) for capturing images having a first size,

a framing means (105, 205) for extracting an image portion (401-404) from a captured image (400) by the camera (103, 203), the framing means (105, 205) selecting the image portion according to remote registration information (Ic) from the other terminal, said image portion having a second size smaller than the first size,

video encoding means (106, 206) for transforming a stream of image portions from the scaling means (105, 205) into outgoing video data, and the communication means (113, 213) being arranged to extract the remote frame information (Ic) from data received from the radio network (300) from the other terminal (200, 100).

7. Terminal according to one of claims 5 or 6, wherein the framing means (105, 205) is arranged to take into account remote framing information (Ic) if the captured image is considered immobile or almost motionless.

8. Portable terminal according to one of claims 5 to 7, wherein the framing means (105, 205) comprises form recognizing means adapted to recognize a face, and wherein the framing means (105, 205). is arranged to take into account remote framing information (Ic) if a face is detected.

9. A method of reframing an image taken by a camera (103) of a first portable videophone terminal (100) with a second portable videophone terminal (200) having a screen (204) and motion sensors (208-210) during a videophone communication, the method comprising a step of generating registration data (Dc), in response to a movement of the second portable terminal (200), from motion information (Δx, Δy, Δz) from the motion sensors (208-210), and a step of sending said registration data (D _c ) to the first portable terminal (100).

The method of claim 9, wherein the second portable terminal (200) generates the registration data (D _c ) if the motion information (Δx, Δy, Δz) is greater than a minimum motion threshold (S _m ). and if this movement information is less than a maximum movement threshold (SM).

11. Method according to one of claims 9 or 10, wherein, in response to registration information (Ic) received by the first terminal (100) and corresponding to the registration data (Dc) sent by the second terminal (200). ), the first portable terminal (100) extracts portions image corresponding to said framing information (Ic) from images captured by its camera (103) and produces a video sequence representative of a succession of image portions.

12. The method of claim 11, wherein the framing information (l _c ) is taken into account if the image taken by the camera of the first portable terminal (100) moves slightly or if a face is detected in the filmed image. .

13. Signal conveying a stream of videophone frames between a first portable terminal (100) and a second portable terminal (200), characterized in that at least one frame sent by the first terminal (100) comprises audio data, video data and frame data (Dc), the frame data indicating the position and / or displacement of an image portion taken by a camera (203) of the second terminal (200), said corresponding image portion an image to be sent from the second terminal (200) to the first terminal (100).