WO2002042242A2

WO2002042242A2 - Candidate level multi-modal integration system

Info

Publication number: WO2002042242A2
Application number: PCT/EP2001/013414
Authority: WO
Inventors: Antonio Colmenarez; Srinivas Gutta
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2000-11-22
Filing date: 2001-11-16
Publication date: 2002-05-30
Also published as: WO2002042242A3; KR20020070491A; JP2004514970A; EP1340187A2

Abstract

A multi-modal integration system includes sensing devices and a multi-modal integration unit. The sensing devices provide lists of candidate pairs, each pair including a candidate characterization and a probability expressing a confidence with respect to that candidate. The multi-modal integration unit receives the lists from the sensing devices, and provides multi-modal contextual information to each sensing device responsive to the lists. The sensing devices then provide new lists of candidate pairs iteratively, using the new information from the multi-modal integration unit to alter sensing or other performance. A super-system including a hierarchy of these multi-modal integration units may be constructed.

Description

Candidate level multi-modal integration system

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to the field integrating signals representing sensed data from multiple sensing modalities, also known as multi-modal integration; and in particular to integration of data from multiple sensing modalities that preprocess data and make at least a tentative characterization or labeling of that data.

2. Background of the Invention The field of ulti -modal integration has a number of applications. These will be discussed at greater length in the Detailed Description.

Prior art attempts at multi-modal integration have adopted several approaches. One example can be found in A. Jain et al., "A multim dal biometric system using fingerprint, face and speech", 2^nd Int. Conf. on Audio and Nideo-Based Biometric Person Authentication (Washington, USA, 3/99) . The general approach of this work will be referred to herein as "decision level integration". Fig. 1 gives a conceptual view of decision level integration as applied to taking data from a "scene" 101. The scene is sensed and processed by at least two separate modules at 102 and 103. Each module includes a sensing operation 104, a feature extraction operation 105, and a recognition operation 106. Each module yields a uni-modal ("UM") "decision" 107, which characterizes or labels data gathered from the scene. Herein, the term "characterization" is intended to be a generic term, which includes both the concepts of "decision" and "label."

Feature extraction 105 normally involves applying a mathematical transformation or predetermined algorithm to the data acquired in the sensing step. Recognition 106 normally involves a type of processing that requires some training, for instance through use of a neural network. Afterwards, a multi-modal integration unit ("MMI") applies multi-modal heuristics and or rules to decide how to yield a final multi- modal decision, which characterizes or labels some aspect of the scene based on the disparate data gathered and processed in the processes 102 and 103. Decision level integration has the advantage of simplicity of implementation. It can incorporate uni-modal systems that are independently studied, developed, and updated. These systems thus can operate as pre-processors. Also the communication channels between the uni-modal systems and the MMI are one-way and have little bandwidth. Decision-level integration, however, is limited in the level of cooperation that can be implemented between different modalities. In general, correlation between modalities is not fully exploited; therefore, information from one modality cannot be used to improve decisions made on the others. For instance, when the decisions from two redundant modalities do not agree, the most confident one is to be taken and the other is to be discarded, resulting in no overall improvement, if not degrading the results obtained with one modality, because of the competition with the others.

SUMMARY OF THE INVENTION

It is an object of the invention to enhance multi-modal integration by improving cooperative use of data between uni-modal contributors to the multi-modal system, while retaining the advantages of preprocessing from independent uni-modal systems.

This object is achieved in that the independent uni-modal systems create sets of characterization pairs, each pair including a respective candidate characterization and confidence level. The MMI receives and processes the sets of characterization pairs and supplies at least one final characterization of the signals. The final characterization is chosen from at least one of the characterization pairs.

Alternatively, the object is achieved in that the MMI receives candidate characterizing signals from the uni-modal contributors and provides at least one control signal thereto. The control signal controls processing and/or sensing. The control signal is derived from the candidate characterizing signals.

In a second alternative, the object is achieved in a training method. The method includes a training phase and a normal operation phase.

In the training phase, candidate characterization signals and ground truths are received. The candidate characterization signals are from a plurality of previously trained sensing devices, which devices include trained processors, and the candidate characterization signals result from an initial physical reality setting. Then training parameters are tuned to achieve ground truths about the physical reality, by evaluating optimization criteria and the candidate characterization signals. In the normal operation phase, further candidate characterization signals are received from the plurality of previously trained sensing devices. A tentative final characterization signal is created. Then at least one control signal is fed back to the at least one of the sensing devices. The control signal is adapted to cause a change in training and/or performance of a sensing device. The steps of the normal operation phase are repeated until a characterization criterion is met.

Additionally, the object of the invention is achieved in a uni-modal sensing device which provides characterization information upwards to a multi-modal integration unit, and receives multi-modal contextual information down from the multi-modal integration unit.

In a related field, more cooperation between uni-modal devices is achieved without pre-processing. This field will be called "feature level integration" herein. An example of this field is to be found in U.S. Pat. No. 5,586,215. Fig. 2 shows the general concept of feature level processing. Again, at 101, a scene is presented. At 202, at least two different types of sensing occur. Then at 203, all sensed data, is subjected to some type of feature extraction to yield a feature vector. The feature vector is then processed to yield some kind of multi-modal recognition at 204, with a multi-modal decision being output at 205. Again, feature extraction typicallyresults from applying some sort of mathematical transformation or predefined algorithm to the sensed data; while recognition is usually an operation requiring some kind of training, such as use of a neural network.

BRIEF DESCRIPTION OF THE DRAWING

The invention will now be described by way of non-limiting example, with reference to the following figures: Fig. 1 shows a prior art multi-modal integration architecture.

Fig. 2 shows a prior art multi-modal integration architecture.

Fig. 3 shows a system in accordance with the invention.

Fig. 4 shows a system in accordance with the invention.

Fig. 5 illustrates a list of symbols used to explain the invention. Fig. 6 is a flowchart describing development of a candidate list.

Fig. 7 is a schematic diagram of a super-system including several MMI devices.

Fig. 8 is a flow chart describing operation of an MMI. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Application areas

There are a number of fields of application for multi-modal decision making. One is is lip-reading, where audio data for phoneme recognition is used in combination with video data in an effort to understand a speaker. Similarly, identification of a person might include a combination of audio and video data.

Yet another area where multi-modal information might be integrated could be in image processing, where different image aspects such as local or global 2-D shape, color characterizations, gray level appearance, and textural properties might all be considered in characterizing an image. In general, sensing video data may include gathering and characterizing information about any number of things such as fingerprints and facial images including: feature positions, feature appearance, and profile shapes.

One camera may be used to gather more than one type of data about a scene, with different processing modules within a connected processor using the data in different ways. The modules that gather different types of information from an image then effectively become different sensing devices, even though they may physically be housed within a single processor.

In other applications, signals from other types of sensors may need to be combined. Other types of sensors that might be useful in multi-modal integration applications include infrared and range sensors. In addition, user entry devices including keyboards and pointer devices such as mice, stylus type sensors, track balls, and so forth, can be used as uni-modal sensing devices. Other areas where multi-modal integration may be useful include acoustic localization via microphone arrays and the use of echo cancellation by direct input of a known source of audio/noise. Even text data might be used in some applications.

In general those of ordinary skill in the art may devise any number of applications for multi-modal integration.

Architecture of a System in Accordance with the Invention Figures 3 and 4 are illustrated in the context of a system that has both video and audio uni-modal systems, for instance a video conferencing system. However, the concepts of candidate level integration are equally applicable to many other applications, such as those listed in the preceding section. Fig. 3 shows an architecture of a system in accordance with the invention. Again, there is a scene 101, which is sensed by sensors 301, 301', and 302. These sensors are shown to be a microphone and a video camera, but they might be any sensors appropriate to a desired application area, including user entry devices such as keyboards, mice, touch screens, or any other user entry device. At 303, 304, and 305, features are extracted from signals derived from the sensors. At 306, 307, and 308 the extracted features are processed and recognized. At 310, 312, and 314, candidate decisions are presented to the MMI 317. At 309, 311, and 313, control signals, in the form of multi-modal contextual information, are provided back down to boxes 306, 307, and 308. In this system, two sets of features are shown as being extracted from the video data, at 304 and 305. For instance, facial feature data might be extracted at 305, while gesture feature data might be extracted at 304. Boxes 305 and 308 function together as a separate sensing device from boxes 304 and 307. Thus the video camera 302 is actually connected to two sensing devices. In other words, a single sensing element can be connected with any number of sensing devices.

In contradistinction with the video part of the system, the plurality of microphone in the array 301 and 301 ' function together with a single pair of boxes 303 and 306. Boxes 303 and 306 thus function together as a third sensing device, for instance to collect position data. Thus more than one sensing element can feed a single sensing device. Additional sensing devices might be added ~ whether coupled to the existing sensing elements or to additional sensing elements. There can be any number of sensing elements and sensing devices.

The control data fed back at 309, 311, and 313 will affect the performance and/or training of the respective sensing devices. For instance, control signals to a video sensing device might bias what part of the picture the sensing device looks at .

In Fig. 3, the sensing devices are shown in the same processor 316 with the MMI 317. In Fig. 4, an alternative embodiment is shown, where the sensing devices 416, 417, and 418 are housed separately from the MMI 417. The connections 409-414 that supply the candidate decisions are now external leads. Boxes 303-305 do feature extraction on the data received from the scene. The output of boxes 303-305 will be in the form of feature vectors per formula (3) from Fig. 5. Boxes 306-308 produce candidate lists in accordance with the invention. In the prior art, normally only a single decision was made based on the values of a discriminating function or functions. The field of discriminating functions is well- developed, for instance as described in K. Fukunaga, Introduction to Statistical Pattern Recognition (2d Ed., Academic Press, 10/99). If a single discriminating function was applied, it would typically have a number of local maxima. Then the decision would be the highest of those local maxima. If multiple discriminating functions were applied, or if a single discriminating function were applied repeatedly to various parts of the data, then the decision would be the highest value received from all the functions or applications of the function.

In the preferred embodiment, the discriminating functions will normally be probability distributions, denominated "P" herein. However, those of ordinary skill in the art will be able to devise other discriminating functions in accordance with the needs of whatever application area is chosen.

In accordance with the invention, it is desired to supply a candidate list from the sensing devices, on lines 310, 312, or 314. Each sensing device should produce a candidate list per formula (1) of Fig 5, where _* - is a variable representing a candidate from a uni-modal sensing device

• k is an index variable numbering the uni-modal sensing devices

• i is an index variable

• M_k is the number of candidates to be produced for uni-modal sensing device number K. Fig. 6 is a flow-chart showing more of the operation of the individual recognition units, 306-308 within the sensing devices. The labels of the flowchart make reference to the formula numbers from Fig. 5.

At 601, the list of multi-modal contextual information of Fig. 5 in the form of an initialized list of default values for formula (2) is received from the MMI on lines 313, 311, and 309. At 602, formula (5) is applied to get the candidates (1). Formula (5) expresses multiplication of the results of formula (2), received from the MMI, with a probability based discriminating function, per formula (4). At 603, some criterion is evaluated. The criterion could be that some fixed number of iterations have been completed, or that no change in the candidate list (6) has been achieved since the last iteration, or any other suitable criterion devised by the skilled artisan. If the criterion is not met, then, at 604, the current list of candidate pairs per formula (6) is sent to the MMI 317, 417. The candidate pair list includes the candidates from formula (1) together with the confidence level from formula (4). The candidate pair list is an example of the term "characterization pairs" used elsewhere herein, and is provided to the MMI on lines 310, 312, and 314. At 606, new multi-modal contextual information is received from the MMI at 606 in the form of formula (2), based on the new proposed candidate list and control is returned to 602.

If the criterion is met, then at 605 a final set of candidates in the form of formula (6) is sent to the MMI. The MMI 317, 417 in turn performs an evaluation of all the combinations of candidates from the uni-modal sensing devices. Fig. 8 shows a flowchart of the operation of the MMI.

At 801, the candidate pair lists, per formula (6) are received from the uni- modal sensing devices. Each uni-modal sensing device, k, produces a list of candidate pairs, per equation (6). At 802, a list of combinations of uni-modal candidates is formed as expressed in formula (7). The total number of combinations is L and the index numbering the combinations is c.

Each combination of candidates normally includes one uni-modal candidate from each of the uni-modal sensing devices. Each combinations of uni-modal candidates is used to create a multi-modal characterization c of the scene. The multi-modal characterization may be the same as one of the characterizations (1) coming from the uni- modal sensing devices. Alternatively, the multi-modal characterization may characterize some combination pattern derived from the patterns recognized by the uni-modal devices. The multi-modal characterizations are analyzed according to a multi-modal discriminating function (8). This function evaluates a product of a) super-multi-modal contextual information P(c); and b) a product of a probability function applied to each combination with a product of all of the probabilities of all the of the uni-modal decisions, per formula (4). Analogously with the uni-modal systems, the super-multi-modal contextual information P(c) will first be initialized to some default value. Advantageously, the value of P(c) can then be modified based on information received at a higher level from the MMI. This modified value will then be supplied as new super-multi-modal contextual information from the higher level.

Based on the analysis set forth in formula (8), super-candidates are chosen to be supplied from the MMI. These are a subset {c } of the possible combinations (7). The super-candidates will be provided as another list of characterization pairs. This time the characterization pairs will have the format of formula (9).

Then, at 803, a criterion is tested. This criterion may be a number of iterations, lack of change of the output (2) since the last iteration, lack of change of the multi- modal candidate pairs (9) since the last iteration, or any other suitable criterion devised by the skilled artisan. If the criterion is not met, then the multi-modal contextual information, per formula (2) is sent to the individual uni-modal devices at 804. The values sent to the uni- modal device will typically vary according to what type of data that device is gathering. Fig. 7 shows a system with a super-MMI 701. In this case, there are three MMI' S 702-704, each of which corresponds to the MMI 317, 417 discussed before. Each MMI is coupled with a plurality of uni-modal sensing devices 705. The MMFs 702-704 send super-candidate lists, i.e. characterization pairs, per formula (9) via 707 to the super-MMI 701 and receive super-multi-modal contextual information P(c) via 706 from the super-MMI 701. The super-MMI may produce further characterization pairs at 708, and can therefore be part of a super-super-MMI system, with another level of hierarchy. The super-MMI 70 operates analogously to the MMI, treating the MMFs the way the MMI's treat uni-modal sensing devices.

In Fig. 7 there are three MMFs (702) each with three uni-modal sensing devices (705). However, those of ordinary skill in the art will appreciate that there could be other numbers of components. For instance, the super-MMI might be coupled with at least one MMI and at least one free-standing uni-modal sensing device. Alternatively, there might be two MMI's, each being fed by two uni-modal sensing devices - and so forth.

From reading the present disclosure, other modifications will be apparent to persons skilled in the art. Such modifications may involve other features that are already known in the design, manufacture and use of recognition of sensed data and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present application also includes any novel feature or novel combination of features disclosed herein either explicitly or implicitly or any generalization thereof, whether or not it mitigates any or all of the same technical problems as does the present invention. The applicants hereby give notice that new claims may be formulated to such features during the prosecution of the present application or any further application derived therefrom.

The word "comprising", "comprise", or "comprises" as used herein should not be viewed as excluding additional elements. The singular article "a" or "an" as used herein should not be viewed as excluding a plurality of elements.

Claims

CLAIMS:

1. A multi-modal integration unit (317, 417, 702, 703, 704) comprising:

- means for receiving (310, 312, 314, 410, 412, 414), from each of a plurality of sensing devices (416 -418, 303-308), a respective set of characterization pairs ((6)), each characterization pair comprising a respective candidate characterization ((1)) and a respective confidence ((4)) indication relating to the respective candidate characterization, which characterization pairs result from pre-processing in the sensing devices;

- means for processing the sets of characterization pairs and to supply (315) at least one final characterization ((9)) of signals received at the sensing devices, which final characterization is chosen from at least one of the characterization pairs.

2. A data processing system, comprising:

- a plurality of sensing devices that together include:

- at least one sensing element (301 , 301 ' , 302) adapted to receive input signals representing physical reality; and

- a plurality of processors or processes (303-308, 416-418) adapted to supply respective characterizing signals for each sensing device characterizing said input signals, the respective characterizing signals from each sensing device comprising a set of characterization pairs, each characterization pair comprising a respective candidate characterization and a respective confidence indication relating to the respective candidate characterization;

- a multi-modal integration unit as claimed in claim 1.

3. The system of claim 2, wherein the multi-modal integration unit employs a discriminating function to process the sets of characterization pairs.

4. The system of claim 3, wherein the discriminating function is a probability distribution.

5. A super-system comprising:

- at least one system as claimed in claim 2, wherein the at least one final characterization for each system comprises a respective further set of characterization pairs ((9));

- if there is only one such system, then at least one further uni-modal sensing device; and - at least one super-multi-modal integration unit (701) adapted

- to receive and process the further sets of characterization pairs along with signals from the at least one further sensing device and

- to supply at least one super-final characterization (708) of the signals, which super- final characterization is chosen from at least one characterization pair from the further set of characterization pairs .

6. A multi-modal integration unit (702- 704, 317, 417) comprising

- means for

- receiving (314, 312, 310, 410, 412, 414) respective candidate characterizing signals ((1), (6)) from each of a plurality of sensing devices (303-308, 416-418), which sensing devices comprise pre-processing capability, which candidate characterizing signals characterize a physical reality; and

- supplying (409, 411, 413, 309, 311, 313) at least one control signal to the sensing devices for controlling processing and/or sensing therein; and - means for processing the candidate characterizing signals to derive therefrom at least one final characterizing signal and the at least one control signal.

7. A data processing system comprising:

- a plurality of sensing devices each comprising: - at least one sensing element (301 , 301 ' , 302) adapted to receive input signals representing physical reality; and

- at least one processor or process (303-308,416-418) adapted

- to provide (414, 412, 410) at least one respective candidate characterizing signal characterizing said physical reality based on the input signals; and - to receive (413 , 411 , 409) control signals for controlling processing and/or sensing; and

- the multi-modal integration unit of claim 6.

8. The system of claim 7, wherein the control signals relate to biasing a selection of physical reality.

9. The system of claim 7 wherein the control signals are in the form of feedback to the sensing devices from the multi-modal integration unit.

10. The system of claim 7, wherein the respective candidate characterization signal includes a respective candidate list ((1)) from each sensing device.

11. The system of claim 10, wherein each respective candidate list comprises a set of characterization pairs ((6)), each characterization pair comprising a respective candidate characterization ((1)) and a respective confidence indication ((4)) relating to the respective candidate characterization.

12. The system of claim 7, wherein the multi-modal integration unit employs a discriminating function to process the respective candidate characterization signals.

13. The system of claim 12, wherein the discriminating function is a probability distribution.

14. A super-system comprising:

- at least one systems as claimed in claim 12,

- if there is only one system, then at least one further uni-modal sensing device; and

- at least one super-multi-modal integration unit (701) adapted - to receive and process the at least one final characterizing signal ((9), 707) from the at least one system and any signals from any further uni-modal sensing device, and

- to derive therefrom at least one super-final characterizing signal (701).

15. A sensing device suitable for use in a multi-modal integration system as claimed in claim 7, the sensing device comprising:

- coupling means

- for receiving signals representing physical reality from at least one sensing element (301, 301 ', 302) ; and - for communicating bi-directionally (409-414, 309-314) with a multi-modal integration unit; and

- at least one processor or process (303-308, 416-418) adapted

- to receive control signals (409, 411, 413, 309, 311, 313) from a multi-modal integration unit (317, 417, 702-704) for controlling processing and/or sensing; and

- responsive to the control signals, to provide signals (410, 412, 414, 310, 312, 314) representing a list of candidate characterizations of said physical reality to the multi- modal integration unit.

16. The device of claim 15, wherein the data representing physical reality comprises video data.

17. The device of claim 16, wherein the control signals bias the video data to a portion of a field of view.

18. A method for training a data processing system, comprising executing the following operations in at least one data processing device:

- entering a training phase, comprising:

- receiving candidate characterization signals from a plurality of previously trained sensing devices, which devices include trained processors, which candidate characterization signals are derived from an initial physical reality setting;

- retrieving signals representing ground truths about the physical reality; and

- tuning training parameters to achieve the ground truths by evaluating optimization criteria and the candidate characterization signals; and - after completion of the training phase, entering a normal operation phase including:

- receiving further candidate characterization signals from the plurality of previously trained sensing devices;

- creating a tentative final characterization signal;

- feeding back at least one control signal to at least one of the sensing devices, which control signal is adapted to cause a change in training and/or performance of the at least one of the sensing devices; and

- repeating the receiving of further candidate characterization signal, creating, and feeding back until a characterization criterion is met.