EP1446956A2

EP1446956A2 - Method and device for producing acoustic signals from a video data flow

Info

Publication number: EP1446956A2
Application number: EP02790332A
Authority: EP
Inventors: Markus Simon
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2001-11-22
Filing date: 2002-11-04
Publication date: 2004-08-18
Also published as: WO2003045066A2; JP2005510907A; WO2003045066A3

Abstract

According to the invention, acoustic signals (snd) are produced from a video data flow containing a temporal sequence of individual images (img). A displacement vector field (vfd) is determined from the image data of the individual images (img), for example by means of an MPEG algorithm; at least one characteristic variable (chs) is then derived from the displacement vector field (vfd), e.g. a dominant displacement vector, in an analysis module (AAN); and an acoustic signal (snd) is produced on the basis of said variables (chs) by means of a synthesis module (SYN).

Description

description

Method and device for generating acoustic signals from a video data stream

The invention relates to the generation of signals - in particular acoustic signals - from a video data stream which contains a chronological sequence of individual images.

The conversion of movement information into acoustic signals is e.g. known in the form of an acoustic motion detector. For example, this takes a picture using a video camera and triggers a signal, such as an acoustic alarm as soon as the picture changes. However, such a movement elder delivers a constant signal; At best, the acoustic signal can be changed by dividing the examined image area into fields, each of which is assigned a different signal. In cases where e.g. if an object moving relative to a background is to be monitored, motion detectors of this type naturally fail.

The generation of sounds depending on movements is also used in the fields of the performing arts. For example, the body movements performed by a person are interpreted and used to control sound generation. In this way, the person performing can shape the music rhythm and theme through their movements. The interaction between movements and sounds is a creative process that encourages the person to create more and more sound sequences through new movements. The generation of a sound is usually based on the fact that the person responds to sensors during their movements, for example in the manner of a motion detector or a light barrier, the signals from the sensors are processed by a data processing system, and sounds associated with the sensors are generated in this way. However, these sound generations are based on quite complex installations that receive their visual input via several recording facilities and are also installed in a fixed position during operation.

It is the object of the invention to show a way in which movement information recorded via a camera can be converted into sound patterns, the sounds produced changing depending on the type of movement, in particular its direction. should be. Such a sound generation should be able to be integrated, in particular, as a function of a cell phone with a built-in camera.

This object is achieved according to the invention by a method of the type mentioned at the outset with the following steps: a) determining a motion vector field from the image data of a single image with the aid of the image data of previous and / or subsequent individual images, b) deriving at least one characteristic variable from the motion vector field, and c ) Generating an acoustic signal depending on the characteristic variable (s).

A device with a control device with a means for determining a motion vector field from the image data of a single image with the aid of the image data of preceding and / or subsequent individual images is suitable for the method according to the invention, the control device additionally being used to derive at least one characteristic variable from the motion vector field and to generate it an acoustic signal is set up as a function of the at least one characteristic variable (s).

The solution according to the invention provides a generation of sounds or acoustic signals, which is based not only on the presence of movements, but also on the size and / or direction of the recorded movements. There- a differentiated design of the generated sound can be achieved. In the case of monitoring, this enables, for example, differentiated monitoring of an image area via acoustic signals, which among other things enables the differentiation of different movement sequences.

In a preferred embodiment, the solution according to the invention is implemented on the part of a telecommunication terminal, in particular a mobile telephone. Since functions for image processing are often already provided in mobile telephones, implementation of the invention is possible there in a particularly inexpensive and compact manner. In this case, the video data stream can advantageously be generated by a camera device of the terminal. Furthermore, the acoustic signals can be output via a

Listening device of the terminal or via a telecommunications connection existing with the terminal.

In addition, it is advantageous, in particular in the case of a cell phone, if the motion vector field is determined in step a) by means of an MPEG encoder method known per se.

An easy-to-implement evaluation of the motion vector field in step b) is that, e.g. using statistical methods, a distribution is derived and statistical parameters are determined for this distribution, from which the at least one characteristic variable is determined.

The invention and further advantages are explained in more detail below with reference to a non-restrictive exemplary embodiment, which is shown in the accompanying drawings. The drawings show in schematic form: FIG. 1 a front view of a mobile telephone according to the exemplary embodiment; Fig. 2 is a rear view of the mobile phone of Fig. 1;

Fig. 3 is a block diagram of the mobile phone of Fig. 1;

Figure 4 is a block diagram of an MPEG encoder.

5 shows an example of a vector field obtained from a motion estimation; and

FIG. 6 shows a movement histogram derived from the vector field of FIG. 5.

It should be noted that the exemplary embodiment described below is only intended to serve as an example, and the invention presented above is not to be understood as being limited thereto.

1 and 2 show a mobile phone MOG which, according to the invention, converts movement recorded from the surroundings into acoustic signals (“acoustic kaleidoscope *). The features of the MOG telephone, which are visible on the housing, are a microphone MIC, a loudspeaker LSP and an input field EIN (e.g. a keyboard) for entering operator commands and telephone numbers, as well as an output DIS in the form of a screen, e.g. an LCD display on which a video image can be displayed. The video image img comes in particular from a camera module CAM located on the back (FIG. 2), which is used to record images from the environment and feed them to the image data processing according to the invention. There is also a CAC compartment for the battery and SIM card, as is known, on the back of the MOG mobile phone.

3 shows the components of the MOG telephone. In a known manner, in addition to the input / output elements LSP, MIC, EIN and the display DIS, an antenna ANN and a transmitting / receiving device SEE for performing the telecommunication functions; In addition, a processor PRC is used as a control device for interpreting the user the input ON input commands and corresponding control of the device SEE provided.

The processor PRC is also set up according to the invention for processing the image data img of the camera module in order to generate sounds snd from them, as described below, with the aid of video coding and motion field analysis. The function for generating sounds snd from the motion information of images img essentially comprises the following processing steps: a) encoding of the image information img by an encoder module ENC, for example by means of an MPEG algorithm, for determining an associated motion vector field vfd, b) Analysis of the vector field vfd for predetermined characteristic quantities chs, for example the prevailing motion vector, in an analysis module AAN of the processor system PRC and c) sound generation based on the quantities chs, for example generation of a sound as a function of the orientation and the amount of the prevailing motion vector, by a synthesis module SYN, which in the exemplary embodiment shown here also in the processor system PRC is realized.

Step a), the encoding of the image information takes place, for example, using the known MPEG encoding. This is a standardized procedure for the compression and transmission of digitized image sequences and often in cell phones with cameras, e.g. already implemented for the purpose of video telephony. Essential components in the method are the motion estimation BS of successive images (see below), the motion compensation BK and the transformation of the motion-compensated image into the frequency space by means of a discrete cosine transformation

(DCT) in conjunction with a data reduction DK. The principle of the method can be seen in FIG. 4; for a more precise representation For more information, see Digital Signal Processing for Multimedia Systems "by Keshab K. Parhi and Takao Nishitani (ed.), Marcel Dekker, Inc., New York, pages 31-37. In FIG. 4, iDCT denotes the transformation inverse to DCT and DG is the digitization of the image stream received as input.

To derive acoustic signals according to the invention, less an image compressed according to the MPEG method is required, but the result of the motion estimation BS, which is part of the MPEG method. In addition to the image to be evaluated imn at time t _n, the input to motion estimation BS is the image imp preceding this at time t _n -ι. The image imp is obtained from the DCT-transformed and motion-compensated signal of the previous image by means of an inverse DCT (iDCT), or temporarily buffered by means of an image memory IS for the duration of an image change.

The image imp serves as a reference image and is subdivided into a number of blocks bbl, for example as shown in FIG. 5 into 36 pixel blocks of 16x16 pixels each. For each of these pixel blocks bbl, the best possible match is sought in the image to be evaluated in accordance with the MSE method ( ^Λ Mean Squared Error ') in a local neighborhood. In this way, information about the displacement vector is obtained for each block bbl. FIG. 5 shows the image img of FIG. 1 as an example, in which the determined motion vectors v are additionally entered for each block bbl. The picture shows an automobile driving against a background; the camera was pivoted with the vehicle during the recorded image sequence. For this reason, the motion vectors determined on the vehicle are almost zero, while the vectors in the vicinity have motion. The motion vector field vfd resulting from this processing is the input for the next processing level. It should be noted that the motion estimation described here, which is part of the known MPEG method, provides a simple yet effective motion analysis. Other methods of motion analysis can also be used within the scope of the invention, which provide a motion vector field and which generally use one or more images which are preceding or following the image to be examined.

In the next step b) the motion vector field vfd becomes one or more characteristic quantities chs, here e.g. the dominant movement orientation. This takes place in the AAN analysis module. In the example considered here, the orientation of all vectors from the motion field is entered in a histogram, cf. Fig. 6. (The

For the sake of clarity, the division into 16 direction classes takes place in FIG. 6; Of course, the number of classes of the basic set can be significantly higher and is only limited by the number of blocks bbl and the resolution of the motion estimation.) The maximum hmx within the distribution, which is represented by the histogram his, gives the main direction of motion within the Picture. For this direction, an amount of the speed movement is determined, which is calculated by simple averaging from all vectors belonging to this main movement direction with a predeterminable tolerance (e.g. two classes adjacent to the histogram class of the maximum). The result of this processing is a vector, the orientation and amount of which describe the main movement in the image. In the example considered here, this vector is the (two-dimensional) characteristic quantity chs from which sounds are derived in the subsequent processing stage.

In other embodiments of the invention, the evaluation can of course also take place in a different way. For example, a histogram could be evaluated, which takes into account the orientation and amount of the vectors (= frequency a two-dimensional basic set). Variables that can be used as the basis for the evaluation are, in particular, statistical characteristic data of a distribution such as the most common value (maximum), secondary maxima, associated variances, weights of higher orders, etc.

In step c), the sound is then generated in the synthesis module SYN on the basis of the characteristic size (n) chs. As a function of the orientation and the amount of the previously determined main motion vector, a sound snd is generated and output via the loudspeaker LSP. Alternatively, the sound sequence, which is in the form of an acoustic signal in a known electrical representation, can be transmitted to another subscriber via a telecommunication connection of the mobile phone MOG.

For example, the amount of movement speed controls the volume during sound generation, while different types of sounds are generated depending on the movement orientation. Here, e.g. each orientation class represented in the histogram his is assigned a pre-stored sound which differs from the other sounds by its pitch and / or sound characteristics (overtone spectrum). The sounds can - but do not have to - be arranged according to their pitch.

Of course, the type of sound generation can be varied. Thus, in a variant of the invention, several maxima hmx, bmx of the distribution his could lead to an overlay of sounds. In another variant, the histogram could be fed directly to the synthesis device SYN, which uses this as the overtone spectrum of a fundamental tone; the pitch of the fundamental can remain the same (e.g. an A, 110 Hz) or can be determined using the procedure described above.

Claims

claims

1. A method for generating signals from a video data stream which contains a chronological sequence of individual images (img), characterized by the following steps: a) determining a motion vector field (vfd) from the

Image data of a single image (img) with the aid of the image data of previous and / or subsequent individual images, b) deriving at least one characteristic variable (chs) from the motion vector field (vfd), and c) generating an acoustic signal (snd) as a function of the / the characteristic size (n) (chs).

2. The method according to claim 1, characterized in that it is carried out by a telecommunications terminal (MOG), in particular a mobile phone.

3. The method according to claim 2, characterized in that the video data stream is generated by a camera device (CAM) of the terminal (MOG).

4. The method according to claim 2 or 3, characterized in that the output of the acoustic signals via a hearing device (LSP) of the terminal (MOG).

5. The method according to claim 2 or 3, characterized in that the output of the acoustic signals via an existing with the terminal (MOG) telecommunications connection.

6. The method according to any one of claims 1 to 5, characterized in that in step a) Motion vector field is determined by means of an MPEG encoder method known per se.

7. The method according to any one of claims 1 to 6, characterized in that in step b) a distribution (his) is derived from the motion vector field (vfd) and statistical parameters are determined for this distribution, from which the at least one characteristic variable (chs) is determined.

8. Device for generating signals from a video data stream, which has a time sequence of

Contains individual images (img), characterized by a control device with a means (ENC) for determining a motion vector field (vfd) from the image data of a single image (img) with the aid of the image data of previous and / or subsequent individual images, the control device additionally at least for deriving a characteristic variable (chs) from the motion vector field (vfd) and generating an acoustic signal (snd) as a function of the at least one characteristic variable (s) (chs).

9. The device according to claim 8, characterized in that it is provided in a telecommunications terminal (MOG), in particular a mobile phone.

10. The device according to claim 9, characterized by a camera device (CAM) for generating a video data stream (img) which can be fed to the means (ENC) for determining a motion vector field (vfd).

11. Device according to one of claims 8 to 10, characterized in that the means (ENC) for determining a motion vector field (vfd) is an MPEG encoder.