WO2002042242A2 - Candidate level multi-modal integration system - Google Patents

Candidate level multi-modal integration system Download PDF

Info

Publication number
WO2002042242A2
WO2002042242A2 PCT/EP2001/013414 EP0113414W WO0242242A2 WO 2002042242 A2 WO2002042242 A2 WO 2002042242A2 EP 0113414 W EP0113414 W EP 0113414W WO 0242242 A2 WO0242242 A2 WO 0242242A2
Authority
WO
WIPO (PCT)
Prior art keywords
characterization
modal
signals
candidate
sensing
Prior art date
Application number
PCT/EP2001/013414
Other languages
French (fr)
Other versions
WO2002042242A3 (en
Inventor
Antonio Colmenarez
Srinivas Gutta
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to KR1020027009315A priority Critical patent/KR20020070491A/en
Priority to EP01989488A priority patent/EP1340187A2/en
Priority to JP2002544381A priority patent/JP2004514970A/en
Publication of WO2002042242A2 publication Critical patent/WO2002042242A2/en
Publication of WO2002042242A3 publication Critical patent/WO2002042242A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition

Definitions

  • the invention relates to the field integrating signals representing sensed data from multiple sensing modalities, also known as multi-modal integration; and in particular to integration of data from multiple sensing modalities that preprocess data and make at least a tentative characterization or labeling of that data.
  • Fig. 1 gives a conceptual view of decision level integration as applied to taking data from a "scene" 101.
  • the scene is sensed and processed by at least two separate modules at 102 and 103.
  • Each module includes a sensing operation 104, a feature extraction operation 105, and a recognition operation 106.
  • Each module yields a uni-modal ("UM") "decision” 107, which characterizes or labels data gathered from the scene.
  • UM uni-modal
  • characterization is intended to be a generic term, which includes both the concepts of "decision” and "label.”
  • Feature extraction 105 normally involves applying a mathematical transformation or predetermined algorithm to the data acquired in the sensing step.
  • Recognition 106 normally involves a type of processing that requires some training, for instance through use of a neural network.
  • a multi-modal integration unit (“MMI”) applies multi-modal heuristics and or rules to decide how to yield a final multi- modal decision, which characterizes or labels some aspect of the scene based on the disparate data gathered and processed in the processes 102 and 103.
  • MMI multi-modal integration unit
  • Decision level integration has the advantage of simplicity of implementation. It can incorporate uni-modal systems that are independently studied, developed, and updated. These systems thus can operate as pre-processors. Also the communication channels between the uni-modal systems and the MMI are one-way and have little bandwidth.
  • Decision-level integration is limited in the level of cooperation that can be implemented between different modalities.
  • correlation between modalities is not fully exploited; therefore, information from one modality cannot be used to improve decisions made on the others. For instance, when the decisions from two redundant modalities do not agree, the most confident one is to be taken and the other is to be discarded, resulting in no overall improvement, if not degrading the results obtained with one modality, because of the competition with the others.
  • the independent uni-modal systems create sets of characterization pairs, each pair including a respective candidate characterization and confidence level.
  • the MMI receives and processes the sets of characterization pairs and supplies at least one final characterization of the signals.
  • the final characterization is chosen from at least one of the characterization pairs.
  • the object is achieved in that the MMI receives candidate characterizing signals from the uni-modal contributors and provides at least one control signal thereto.
  • the control signal controls processing and/or sensing.
  • the control signal is derived from the candidate characterizing signals.
  • the object is achieved in a training method.
  • the method includes a training phase and a normal operation phase.
  • candidate characterization signals and ground truths are received.
  • the candidate characterization signals are from a plurality of previously trained sensing devices, which devices include trained processors, and the candidate characterization signals result from an initial physical reality setting.
  • training parameters are tuned to achieve ground truths about the physical reality, by evaluating optimization criteria and the candidate characterization signals.
  • further candidate characterization signals are received from the plurality of previously trained sensing devices.
  • a tentative final characterization signal is created.
  • at least one control signal is fed back to the at least one of the sensing devices.
  • the control signal is adapted to cause a change in training and/or performance of a sensing device.
  • the steps of the normal operation phase are repeated until a characterization criterion is met.
  • the object of the invention is achieved in a uni-modal sensing device which provides characterization information upwards to a multi-modal integration unit, and receives multi-modal contextual information down from the multi-modal integration unit.
  • Fig. 2 shows the general concept of feature level processing.
  • a scene is presented.
  • at 202 at least two different types of sensing occur.
  • all sensed data is subjected to some type of feature extraction to yield a feature vector.
  • the feature vector is then processed to yield some kind of multi-modal recognition at 204, with a multi-modal decision being output at 205.
  • feature extraction typicallyresults from applying some sort of mathematical transformation or predefined algorithm to the sensed data; while recognition is usually an operation requiring some kind of training, such as use of a neural network.
  • Fig. 1 shows a prior art multi-modal integration architecture.
  • Fig. 2 shows a prior art multi-modal integration architecture.
  • Fig. 3 shows a system in accordance with the invention.
  • Fig. 4 shows a system in accordance with the invention.
  • Fig. 5 illustrates a list of symbols used to explain the invention.
  • Fig. 6 is a flowchart describing development of a candidate list.
  • Fig. 7 is a schematic diagram of a super-system including several MMI devices.
  • Fig. 8 is a flow chart describing operation of an MMI. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Application areas
  • sensing video data may include gathering and characterizing information about any number of things such as fingerprints and facial images including: feature positions, feature appearance, and profile shapes.
  • One camera may be used to gather more than one type of data about a scene, with different processing modules within a connected processor using the data in different ways.
  • the modules that gather different types of information from an image then effectively become different sensing devices, even though they may physically be housed within a single processor.
  • signals from other types of sensors may need to be combined.
  • Other types of sensors that might be useful in multi-modal integration applications include infrared and range sensors.
  • user entry devices including keyboards and pointer devices such as mice, stylus type sensors, track balls, and so forth, can be used as uni-modal sensing devices.
  • Other areas where multi-modal integration may be useful include acoustic localization via microphone arrays and the use of echo cancellation by direct input of a known source of audio/noise. Even text data might be used in some applications.
  • FIG. 3 shows an architecture of a system in accordance with the invention. Again, there is a scene 101, which is sensed by sensors 301, 301', and 302. These sensors are shown to be a microphone and a video camera, but they might be any sensors appropriate to a desired application area, including user entry devices such as keyboards, mice, touch screens, or any other user entry device. At 303, 304, and 305, features are extracted from signals derived from the sensors.
  • the extracted features are processed and recognized.
  • candidate decisions are presented to the MMI 317.
  • control signals in the form of multi-modal contextual information, are provided back down to boxes 306, 307, and 308.
  • two sets of features are shown as being extracted from the video data, at 304 and 305. For instance, facial feature data might be extracted at 305, while gesture feature data might be extracted at 304. Boxes 305 and 308 function together as a separate sensing device from boxes 304 and 307.
  • the video camera 302 is actually connected to two sensing devices. In other words, a single sensing element can be connected with any number of sensing devices.
  • the plurality of microphone in the array 301 and 301 ' function together with a single pair of boxes 303 and 306. Boxes 303 and 306 thus function together as a third sensing device, for instance to collect position data.
  • more than one sensing element can feed a single sensing device.
  • Additional sensing devices might be added ⁇ whether coupled to the existing sensing elements or to additional sensing elements. There can be any number of sensing elements and sensing devices.
  • control data fed back at 309, 311, and 313 will affect the performance and/or training of the respective sensing devices. For instance, control signals to a video sensing device might bias what part of the picture the sensing device looks at .
  • Fig. 3 the sensing devices are shown in the same processor 316 with the MMI 317.
  • Fig. 4 an alternative embodiment is shown, where the sensing devices 416, 417, and 418 are housed separately from the MMI 417.
  • the connections 409-414 that supply the candidate decisions are now external leads.
  • Boxes 303-305 do feature extraction on the data received from the scene.
  • the output of boxes 303-305 will be in the form of feature vectors per formula (3) from Fig. 5.
  • Boxes 306-308 produce candidate lists in accordance with the invention.
  • the field of discriminating functions is well- developed, for instance as described in K.
  • the discriminating functions will normally be probability distributions, denominated "P" herein.
  • P probability distributions
  • those of ordinary skill in the art will be able to devise other discriminating functions in accordance with the needs of whatever application area is chosen.
  • each sensing device should produce a candidate list per formula (1) of Fig 5, where * - is a variable representing a candidate from a uni-modal sensing device
  • Fig. 6 is a flow-chart showing more of the operation of the individual recognition units, 306-308 within the sensing devices. The labels of the flowchart make reference to the formula numbers from Fig. 5.
  • the list of multi-modal contextual information of Fig. 5 in the form of an initialized list of default values for formula (2) is received from the MMI on lines 313, 311, and 309.
  • formula (5) is applied to get the candidates (1).
  • Formula (5) expresses multiplication of the results of formula (2), received from the MMI, with a probability based discriminating function, per formula (4).
  • some criterion is evaluated. The criterion could be that some fixed number of iterations have been completed, or that no change in the candidate list (6) has been achieved since the last iteration, or any other suitable criterion devised by the skilled artisan.
  • the current list of candidate pairs per formula (6) is sent to the MMI 317, 417.
  • the candidate pair list includes the candidates from formula (1) together with the confidence level from formula (4).
  • the candidate pair list is an example of the term "characterization pairs" used elsewhere herein, and is provided to the MMI on lines 310, 312, and 314.
  • new multi-modal contextual information is received from the MMI at 606 in the form of formula (2), based on the new proposed candidate list and control is returned to 602.
  • a final set of candidates in the form of formula (6) is sent to the MMI.
  • the MMI 317, 417 in turn performs an evaluation of all the combinations of candidates from the uni-modal sensing devices.
  • Fig. 8 shows a flowchart of the operation of the MMI.
  • the candidate pair lists, per formula (6) are received from the uni- modal sensing devices.
  • Each uni-modal sensing device, k produces a list of candidate pairs, per equation (6).
  • a list of combinations of uni-modal candidates is formed as expressed in formula (7). The total number of combinations is L and the index numbering the combinations is c.
  • Each combination of candidates normally includes one uni-modal candidate from each of the uni-modal sensing devices.
  • Each combinations of uni-modal candidates is used to create a multi-modal characterization c of the scene.
  • the multi-modal characterization may be the same as one of the characterizations (1) coming from the uni- modal sensing devices. Alternatively, the multi-modal characterization may characterize some combination pattern derived from the patterns recognized by the uni-modal devices.
  • the multi-modal characterizations are analyzed according to a multi-modal discriminating function (8).
  • This function evaluates a product of a) super-multi-modal contextual information P(c); and b) a product of a probability function applied to each combination with a product of all of the probabilities of all the of the uni-modal decisions, per formula (4).
  • the super-multi-modal contextual information P(c) will first be initialized to some default value.
  • the value of P(c) can then be modified based on information received at a higher level from the MMI. This modified value will then be supplied as new super-multi-modal contextual information from the higher level.
  • super-candidates are chosen to be supplied from the MMI. These are a subset ⁇ c ⁇ of the possible combinations (7).
  • the super-candidates will be provided as another list of characterization pairs. This time the characterization pairs will have the format of formula (9).
  • a criterion is tested. This criterion may be a number of iterations, lack of change of the output (2) since the last iteration, lack of change of the multi- modal candidate pairs (9) since the last iteration, or any other suitable criterion devised by the skilled artisan. If the criterion is not met, then the multi-modal contextual information, per formula (2) is sent to the individual uni-modal devices at 804. The values sent to the uni- modal device will typically vary according to what type of data that device is gathering.
  • Fig. 7 shows a system with a super-MMI 701. In this case, there are three MMI' S 702-704, each of which corresponds to the MMI 317, 417 discussed before.
  • Each MMI is coupled with a plurality of uni-modal sensing devices 705.
  • the MMFs 702-704 send super-candidate lists, i.e. characterization pairs, per formula (9) via 707 to the super-MMI 701 and receive super-multi-modal contextual information P(c) via 706 from the super-MMI 701.
  • the super-MMI may produce further characterization pairs at 708, and can therefore be part of a super-super-MMI system, with another level of hierarchy.
  • the super-MMI 70 operates analogously to the MMI, treating the MMFs the way the MMI's treat uni-modal sensing devices.
  • Fig. 7 there are three MMFs (702) each with three uni-modal sensing devices (705).
  • the super-MMI might be coupled with at least one MMI and at least one free-standing uni-modal sensing device.

Abstract

A multi-modal integration system includes sensing devices and a multi-modal integration unit. The sensing devices provide lists of candidate pairs, each pair including a candidate characterization and a probability expressing a confidence with respect to that candidate. The multi-modal integration unit receives the lists from the sensing devices, and provides multi-modal contextual information to each sensing device responsive to the lists. The sensing devices then provide new lists of candidate pairs iteratively, using the new information from the multi-modal integration unit to alter sensing or other performance. A super-system including a hierarchy of these multi-modal integration units may be constructed.

Description

Candidate level multi-modal integration system
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates to the field integrating signals representing sensed data from multiple sensing modalities, also known as multi-modal integration; and in particular to integration of data from multiple sensing modalities that preprocess data and make at least a tentative characterization or labeling of that data.
2. Background of the Invention The field of ulti -modal integration has a number of applications. These will be discussed at greater length in the Detailed Description.
Prior art attempts at multi-modal integration have adopted several approaches. One example can be found in A. Jain et al., "A multim dal biometric system using fingerprint, face and speech", 2nd Int. Conf. on Audio and Nideo-Based Biometric Person Authentication (Washington, USA, 3/99) . The general approach of this work will be referred to herein as "decision level integration". Fig. 1 gives a conceptual view of decision level integration as applied to taking data from a "scene" 101. The scene is sensed and processed by at least two separate modules at 102 and 103. Each module includes a sensing operation 104, a feature extraction operation 105, and a recognition operation 106. Each module yields a uni-modal ("UM") "decision" 107, which characterizes or labels data gathered from the scene. Herein, the term "characterization" is intended to be a generic term, which includes both the concepts of "decision" and "label."
Feature extraction 105 normally involves applying a mathematical transformation or predetermined algorithm to the data acquired in the sensing step. Recognition 106 normally involves a type of processing that requires some training, for instance through use of a neural network. Afterwards, a multi-modal integration unit ("MMI") applies multi-modal heuristics and or rules to decide how to yield a final multi- modal decision, which characterizes or labels some aspect of the scene based on the disparate data gathered and processed in the processes 102 and 103. Decision level integration has the advantage of simplicity of implementation. It can incorporate uni-modal systems that are independently studied, developed, and updated. These systems thus can operate as pre-processors. Also the communication channels between the uni-modal systems and the MMI are one-way and have little bandwidth. Decision-level integration, however, is limited in the level of cooperation that can be implemented between different modalities. In general, correlation between modalities is not fully exploited; therefore, information from one modality cannot be used to improve decisions made on the others. For instance, when the decisions from two redundant modalities do not agree, the most confident one is to be taken and the other is to be discarded, resulting in no overall improvement, if not degrading the results obtained with one modality, because of the competition with the others.
SUMMARY OF THE INVENTION
It is an object of the invention to enhance multi-modal integration by improving cooperative use of data between uni-modal contributors to the multi-modal system, while retaining the advantages of preprocessing from independent uni-modal systems.
This object is achieved in that the independent uni-modal systems create sets of characterization pairs, each pair including a respective candidate characterization and confidence level. The MMI receives and processes the sets of characterization pairs and supplies at least one final characterization of the signals. The final characterization is chosen from at least one of the characterization pairs.
Alternatively, the object is achieved in that the MMI receives candidate characterizing signals from the uni-modal contributors and provides at least one control signal thereto. The control signal controls processing and/or sensing. The control signal is derived from the candidate characterizing signals.
In a second alternative, the object is achieved in a training method. The method includes a training phase and a normal operation phase.
In the training phase, candidate characterization signals and ground truths are received. The candidate characterization signals are from a plurality of previously trained sensing devices, which devices include trained processors, and the candidate characterization signals result from an initial physical reality setting. Then training parameters are tuned to achieve ground truths about the physical reality, by evaluating optimization criteria and the candidate characterization signals. In the normal operation phase, further candidate characterization signals are received from the plurality of previously trained sensing devices. A tentative final characterization signal is created. Then at least one control signal is fed back to the at least one of the sensing devices. The control signal is adapted to cause a change in training and/or performance of a sensing device. The steps of the normal operation phase are repeated until a characterization criterion is met.
Additionally, the object of the invention is achieved in a uni-modal sensing device which provides characterization information upwards to a multi-modal integration unit, and receives multi-modal contextual information down from the multi-modal integration unit.
In a related field, more cooperation between uni-modal devices is achieved without pre-processing. This field will be called "feature level integration" herein. An example of this field is to be found in U.S. Pat. No. 5,586,215. Fig. 2 shows the general concept of feature level processing. Again, at 101, a scene is presented. At 202, at least two different types of sensing occur. Then at 203, all sensed data, is subjected to some type of feature extraction to yield a feature vector. The feature vector is then processed to yield some kind of multi-modal recognition at 204, with a multi-modal decision being output at 205. Again, feature extraction typicallyresults from applying some sort of mathematical transformation or predefined algorithm to the sensed data; while recognition is usually an operation requiring some kind of training, such as use of a neural network.
BRIEF DESCRIPTION OF THE DRAWING
The invention will now be described by way of non-limiting example, with reference to the following figures: Fig. 1 shows a prior art multi-modal integration architecture.
Fig. 2 shows a prior art multi-modal integration architecture.
Fig. 3 shows a system in accordance with the invention.
Fig. 4 shows a system in accordance with the invention.
Fig. 5 illustrates a list of symbols used to explain the invention. Fig. 6 is a flowchart describing development of a candidate list.
Fig. 7 is a schematic diagram of a super-system including several MMI devices.
Fig. 8 is a flow chart describing operation of an MMI. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Application areas
There are a number of fields of application for multi-modal decision making. One is is lip-reading, where audio data for phoneme recognition is used in combination with video data in an effort to understand a speaker. Similarly, identification of a person might include a combination of audio and video data.
Yet another area where multi-modal information might be integrated could be in image processing, where different image aspects such as local or global 2-D shape, color characterizations, gray level appearance, and textural properties might all be considered in characterizing an image. In general, sensing video data may include gathering and characterizing information about any number of things such as fingerprints and facial images including: feature positions, feature appearance, and profile shapes.
One camera may be used to gather more than one type of data about a scene, with different processing modules within a connected processor using the data in different ways. The modules that gather different types of information from an image then effectively become different sensing devices, even though they may physically be housed within a single processor.
In other applications, signals from other types of sensors may need to be combined. Other types of sensors that might be useful in multi-modal integration applications include infrared and range sensors. In addition, user entry devices including keyboards and pointer devices such as mice, stylus type sensors, track balls, and so forth, can be used as uni-modal sensing devices. Other areas where multi-modal integration may be useful include acoustic localization via microphone arrays and the use of echo cancellation by direct input of a known source of audio/noise. Even text data might be used in some applications.
In general those of ordinary skill in the art may devise any number of applications for multi-modal integration.
Architecture of a System in Accordance with the Invention Figures 3 and 4 are illustrated in the context of a system that has both video and audio uni-modal systems, for instance a video conferencing system. However, the concepts of candidate level integration are equally applicable to many other applications, such as those listed in the preceding section. Fig. 3 shows an architecture of a system in accordance with the invention. Again, there is a scene 101, which is sensed by sensors 301, 301', and 302. These sensors are shown to be a microphone and a video camera, but they might be any sensors appropriate to a desired application area, including user entry devices such as keyboards, mice, touch screens, or any other user entry device. At 303, 304, and 305, features are extracted from signals derived from the sensors. At 306, 307, and 308 the extracted features are processed and recognized. At 310, 312, and 314, candidate decisions are presented to the MMI 317. At 309, 311, and 313, control signals, in the form of multi-modal contextual information, are provided back down to boxes 306, 307, and 308. In this system, two sets of features are shown as being extracted from the video data, at 304 and 305. For instance, facial feature data might be extracted at 305, while gesture feature data might be extracted at 304. Boxes 305 and 308 function together as a separate sensing device from boxes 304 and 307. Thus the video camera 302 is actually connected to two sensing devices. In other words, a single sensing element can be connected with any number of sensing devices.
In contradistinction with the video part of the system, the plurality of microphone in the array 301 and 301 ' function together with a single pair of boxes 303 and 306. Boxes 303 and 306 thus function together as a third sensing device, for instance to collect position data. Thus more than one sensing element can feed a single sensing device. Additional sensing devices might be added ~ whether coupled to the existing sensing elements or to additional sensing elements. There can be any number of sensing elements and sensing devices.
The control data fed back at 309, 311, and 313 will affect the performance and/or training of the respective sensing devices. For instance, control signals to a video sensing device might bias what part of the picture the sensing device looks at .
In Fig. 3, the sensing devices are shown in the same processor 316 with the MMI 317. In Fig. 4, an alternative embodiment is shown, where the sensing devices 416, 417, and 418 are housed separately from the MMI 417. The connections 409-414 that supply the candidate decisions are now external leads. Boxes 303-305 do feature extraction on the data received from the scene. The output of boxes 303-305 will be in the form of feature vectors per formula (3) from Fig. 5. Boxes 306-308 produce candidate lists in accordance with the invention. In the prior art, normally only a single decision was made based on the values of a discriminating function or functions. The field of discriminating functions is well- developed, for instance as described in K. Fukunaga, Introduction to Statistical Pattern Recognition (2d Ed., Academic Press, 10/99). If a single discriminating function was applied, it would typically have a number of local maxima. Then the decision would be the highest of those local maxima. If multiple discriminating functions were applied, or if a single discriminating function were applied repeatedly to various parts of the data, then the decision would be the highest value received from all the functions or applications of the function.
In the preferred embodiment, the discriminating functions will normally be probability distributions, denominated "P" herein. However, those of ordinary skill in the art will be able to devise other discriminating functions in accordance with the needs of whatever application area is chosen.
In accordance with the invention, it is desired to supply a candidate list from the sensing devices, on lines 310, 312, or 314. Each sensing device should produce a candidate list per formula (1) of Fig 5, where * - is a variable representing a candidate from a uni-modal sensing device
• k is an index variable numbering the uni-modal sensing devices
• i is an index variable
• Mk is the number of candidates to be produced for uni-modal sensing device number K. Fig. 6 is a flow-chart showing more of the operation of the individual recognition units, 306-308 within the sensing devices. The labels of the flowchart make reference to the formula numbers from Fig. 5.
At 601, the list of multi-modal contextual information of Fig. 5 in the form of an initialized list of default values for formula (2) is received from the MMI on lines 313, 311, and 309. At 602, formula (5) is applied to get the candidates (1). Formula (5) expresses multiplication of the results of formula (2), received from the MMI, with a probability based discriminating function, per formula (4). At 603, some criterion is evaluated. The criterion could be that some fixed number of iterations have been completed, or that no change in the candidate list (6) has been achieved since the last iteration, or any other suitable criterion devised by the skilled artisan. If the criterion is not met, then, at 604, the current list of candidate pairs per formula (6) is sent to the MMI 317, 417. The candidate pair list includes the candidates from formula (1) together with the confidence level from formula (4). The candidate pair list is an example of the term "characterization pairs" used elsewhere herein, and is provided to the MMI on lines 310, 312, and 314. At 606, new multi-modal contextual information is received from the MMI at 606 in the form of formula (2), based on the new proposed candidate list and control is returned to 602.
If the criterion is met, then at 605 a final set of candidates in the form of formula (6) is sent to the MMI. The MMI 317, 417 in turn performs an evaluation of all the combinations of candidates from the uni-modal sensing devices. Fig. 8 shows a flowchart of the operation of the MMI.
At 801, the candidate pair lists, per formula (6) are received from the uni- modal sensing devices. Each uni-modal sensing device, k, produces a list of candidate pairs, per equation (6). At 802, a list of combinations of uni-modal candidates is formed as expressed in formula (7). The total number of combinations is L and the index numbering the combinations is c.
Each combination of candidates normally includes one uni-modal candidate from each of the uni-modal sensing devices. Each combinations of uni-modal candidates is used to create a multi-modal characterization c of the scene. The multi-modal characterization may be the same as one of the characterizations (1) coming from the uni- modal sensing devices. Alternatively, the multi-modal characterization may characterize some combination pattern derived from the patterns recognized by the uni-modal devices. The multi-modal characterizations are analyzed according to a multi-modal discriminating function (8). This function evaluates a product of a) super-multi-modal contextual information P(c); and b) a product of a probability function applied to each combination with a product of all of the probabilities of all the of the uni-modal decisions, per formula (4). Analogously with the uni-modal systems, the super-multi-modal contextual information P(c) will first be initialized to some default value. Advantageously, the value of P(c) can then be modified based on information received at a higher level from the MMI. This modified value will then be supplied as new super-multi-modal contextual information from the higher level.
Based on the analysis set forth in formula (8), super-candidates are chosen to be supplied from the MMI. These are a subset {c } of the possible combinations (7). The super-candidates will be provided as another list of characterization pairs. This time the characterization pairs will have the format of formula (9).
Then, at 803, a criterion is tested. This criterion may be a number of iterations, lack of change of the output (2) since the last iteration, lack of change of the multi- modal candidate pairs (9) since the last iteration, or any other suitable criterion devised by the skilled artisan. If the criterion is not met, then the multi-modal contextual information, per formula (2) is sent to the individual uni-modal devices at 804. The values sent to the uni- modal device will typically vary according to what type of data that device is gathering. Fig. 7 shows a system with a super-MMI 701. In this case, there are three MMI' S 702-704, each of which corresponds to the MMI 317, 417 discussed before. Each MMI is coupled with a plurality of uni-modal sensing devices 705. The MMFs 702-704 send super-candidate lists, i.e. characterization pairs, per formula (9) via 707 to the super-MMI 701 and receive super-multi-modal contextual information P(c) via 706 from the super-MMI 701. The super-MMI may produce further characterization pairs at 708, and can therefore be part of a super-super-MMI system, with another level of hierarchy. The super-MMI 70 operates analogously to the MMI, treating the MMFs the way the MMI's treat uni-modal sensing devices.
In Fig. 7 there are three MMFs (702) each with three uni-modal sensing devices (705). However, those of ordinary skill in the art will appreciate that there could be other numbers of components. For instance, the super-MMI might be coupled with at least one MMI and at least one free-standing uni-modal sensing device. Alternatively, there might be two MMI's, each being fed by two uni-modal sensing devices - and so forth.
From reading the present disclosure, other modifications will be apparent to persons skilled in the art. Such modifications may involve other features that are already known in the design, manufacture and use of recognition of sensed data and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present application also includes any novel feature or novel combination of features disclosed herein either explicitly or implicitly or any generalization thereof, whether or not it mitigates any or all of the same technical problems as does the present invention. The applicants hereby give notice that new claims may be formulated to such features during the prosecution of the present application or any further application derived therefrom.
The word "comprising", "comprise", or "comprises" as used herein should not be viewed as excluding additional elements. The singular article "a" or "an" as used herein should not be viewed as excluding a plurality of elements.

Claims

CLAIMS:
1. A multi-modal integration unit (317, 417, 702, 703, 704) comprising:
- means for receiving (310, 312, 314, 410, 412, 414), from each of a plurality of sensing devices (416 -418, 303-308), a respective set of characterization pairs ((6)), each characterization pair comprising a respective candidate characterization ((1)) and a respective confidence ((4)) indication relating to the respective candidate characterization, which characterization pairs result from pre-processing in the sensing devices;
- means for processing the sets of characterization pairs and to supply (315) at least one final characterization ((9)) of signals received at the sensing devices, which final characterization is chosen from at least one of the characterization pairs.
2. A data processing system, comprising:
- a plurality of sensing devices that together include:
- at least one sensing element (301 , 301 ' , 302) adapted to receive input signals representing physical reality; and
- a plurality of processors or processes (303-308, 416-418) adapted to supply respective characterizing signals for each sensing device characterizing said input signals, the respective characterizing signals from each sensing device comprising a set of characterization pairs, each characterization pair comprising a respective candidate characterization and a respective confidence indication relating to the respective candidate characterization;
- a multi-modal integration unit as claimed in claim 1.
3. The system of claim 2, wherein the multi-modal integration unit employs a discriminating function to process the sets of characterization pairs.
4. The system of claim 3, wherein the discriminating function is a probability distribution.
5. A super-system comprising:
- at least one system as claimed in claim 2, wherein the at least one final characterization for each system comprises a respective further set of characterization pairs ((9));
- if there is only one such system, then at least one further uni-modal sensing device; and - at least one super-multi-modal integration unit (701) adapted
- to receive and process the further sets of characterization pairs along with signals from the at least one further sensing device and
- to supply at least one super-final characterization (708) of the signals, which super- final characterization is chosen from at least one characterization pair from the further set of characterization pairs .
6. A multi-modal integration unit (702- 704, 317, 417) comprising
- means for
- receiving (314, 312, 310, 410, 412, 414) respective candidate characterizing signals ((1), (6)) from each of a plurality of sensing devices (303-308, 416-418), which sensing devices comprise pre-processing capability, which candidate characterizing signals characterize a physical reality; and
- supplying (409, 411, 413, 309, 311, 313) at least one control signal to the sensing devices for controlling processing and/or sensing therein; and - means for processing the candidate characterizing signals to derive therefrom at least one final characterizing signal and the at least one control signal.
7. A data processing system comprising:
- a plurality of sensing devices each comprising: - at least one sensing element (301 , 301 ' , 302) adapted to receive input signals representing physical reality; and
- at least one processor or process (303-308,416-418) adapted
- to provide (414, 412, 410) at least one respective candidate characterizing signal characterizing said physical reality based on the input signals; and - to receive (413 , 411 , 409) control signals for controlling processing and/or sensing; and
- the multi-modal integration unit of claim 6.
8. The system of claim 7, wherein the control signals relate to biasing a selection of physical reality.
9. The system of claim 7 wherein the control signals are in the form of feedback to the sensing devices from the multi-modal integration unit.
10. The system of claim 7, wherein the respective candidate characterization signal includes a respective candidate list ((1)) from each sensing device.
11. The system of claim 10, wherein each respective candidate list comprises a set of characterization pairs ((6)), each characterization pair comprising a respective candidate characterization ((1)) and a respective confidence indication ((4)) relating to the respective candidate characterization.
12. The system of claim 7, wherein the multi-modal integration unit employs a discriminating function to process the respective candidate characterization signals.
13. The system of claim 12, wherein the discriminating function is a probability distribution.
14. A super-system comprising:
- at least one systems as claimed in claim 12,
- if there is only one system, then at least one further uni-modal sensing device; and
- at least one super-multi-modal integration unit (701) adapted - to receive and process the at least one final characterizing signal ((9), 707) from the at least one system and any signals from any further uni-modal sensing device, and
- to derive therefrom at least one super-final characterizing signal (701).
15. A sensing device suitable for use in a multi-modal integration system as claimed in claim 7, the sensing device comprising:
- coupling means
- for receiving signals representing physical reality from at least one sensing element (301, 301 ', 302) ; and - for communicating bi-directionally (409-414, 309-314) with a multi-modal integration unit; and
- at least one processor or process (303-308, 416-418) adapted
- to receive control signals (409, 411, 413, 309, 311, 313) from a multi-modal integration unit (317, 417, 702-704) for controlling processing and/or sensing; and
- responsive to the control signals, to provide signals (410, 412, 414, 310, 312, 314) representing a list of candidate characterizations of said physical reality to the multi- modal integration unit.
16. The device of claim 15, wherein the data representing physical reality comprises video data.
17. The device of claim 16, wherein the control signals bias the video data to a portion of a field of view.
18. A method for training a data processing system, comprising executing the following operations in at least one data processing device:
- entering a training phase, comprising:
- receiving candidate characterization signals from a plurality of previously trained sensing devices, which devices include trained processors, which candidate characterization signals are derived from an initial physical reality setting;
- retrieving signals representing ground truths about the physical reality; and
- tuning training parameters to achieve the ground truths by evaluating optimization criteria and the candidate characterization signals; and - after completion of the training phase, entering a normal operation phase including:
- receiving further candidate characterization signals from the plurality of previously trained sensing devices;
- creating a tentative final characterization signal;
- feeding back at least one control signal to at least one of the sensing devices, which control signal is adapted to cause a change in training and/or performance of the at least one of the sensing devices; and
- repeating the receiving of further candidate characterization signal, creating, and feeding back until a characterization criterion is met.
PCT/EP2001/013414 2000-11-22 2001-11-16 Candidate level multi-modal integration system WO2002042242A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020027009315A KR20020070491A (en) 2000-11-22 2001-11-16 Candidate level multi-modal integration system
EP01989488A EP1340187A2 (en) 2000-11-22 2001-11-16 Candidate level multi-modal integration system
JP2002544381A JP2004514970A (en) 2000-11-22 2001-11-16 Candidate level multimodal integration system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US71825500A 2000-11-22 2000-11-22
US09/718,255 2000-11-22

Publications (2)

Publication Number Publication Date
WO2002042242A2 true WO2002042242A2 (en) 2002-05-30
WO2002042242A3 WO2002042242A3 (en) 2002-11-28

Family

ID=24885400

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2001/013414 WO2002042242A2 (en) 2000-11-22 2001-11-16 Candidate level multi-modal integration system

Country Status (4)

Country Link
EP (1) EP1340187A2 (en)
JP (1) JP2004514970A (en)
KR (1) KR20020070491A (en)
WO (1) WO2002042242A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007072425A2 (en) 2005-12-20 2007-06-28 Koninklijke Philips Electronics, N.V. Device for detecting and warning of a medical condition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5586215A (en) * 1992-05-26 1996-12-17 Ricoh Corporation Neural network acoustic and visual speech recognition system
EP0921509A2 (en) * 1997-10-16 1999-06-09 Navigation Technologies Corporation System and method for updating, enhancing or refining a geographic database using feedback
US6009199A (en) * 1996-07-12 1999-12-28 Lucent Technologies Inc. Classification technique using random decision forests

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5586215A (en) * 1992-05-26 1996-12-17 Ricoh Corporation Neural network acoustic and visual speech recognition system
US6009199A (en) * 1996-07-12 1999-12-28 Lucent Technologies Inc. Classification technique using random decision forests
EP0921509A2 (en) * 1997-10-16 1999-06-09 Navigation Technologies Corporation System and method for updating, enhancing or refining a geographic database using feedback

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
"DECODING OF A CONSISTENT MESSAGE USING BOTH SPEECH AND HANDWRITING RECOGNITION" IBM TECHNICAL DISCLOSURE BULLETIN, IBM CORP. NEW YORK, US, vol. 36, no. 1, 1993, pages 415-418, XP000333898 ISSN: 0018-8689 *
A. JAIN ET AL.: "A Multimodal Biometric System using Fingerprints, Face and Speech" 2ND INT. CONF. ON AUDIO- AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION,, [Online] 23 - 24 March 1999, pages 182-187, XP002211742 Washington D. C., USA Retrieved from the Internet: <URL:http://www.cse.msu.edu/publications/t ech/TR/MSU-CPS-98-32.ps.Z> [retrieved on 2002-08-29] cited in the application *
A. ROSS ET AL: "Information Fusion in Biometrics" PROC. INT. CONF. ON AUDIO- AND VIDEO-BASED PERSON AUTHENTICATION (AVBPA), [Online] 6 June 2001 (2001-06-06) - 8 June 2201 (2201-06-08), pages 354-.359, XP002211743 Halmstad, Sweden Retrieved from the Internet: <URL:http://www.cse.msu.edu/publications/t ech/TR/MSU-CSE-01-18.ps> [retrieved on 2002-08-29] *
AALO V A ET AL: "Multilevel quantisation and fusion scheme for the decentralised detection of an unknown signal" IEE PROCEEDINGS: RADAR, SONAR & NAVIGATION, INSTITUTION OF ELECTRICAL ENGINEERS, GB, vol. 141, no. 1, 1 February 1994 (1994-02-01), pages 37-44, XP006002055 ISSN: 1350-2395 *
BORGHYS D ET AL: "MULTILEVEL DATA FUSION FOR THE DETECTION OF TARGETS USING MULTISPECTRAL IMAGE SEQUENCES" OPTICAL ENGINEERING, SOC. OF PHOTO-OPTICAL INSTRUMENTATION ENGINEERS. BELLINGHAM, US, vol. 37, no. 2, 1 February 1998 (1998-02-01), pages 477-484, XP000742662 ISSN: 0091-3286 *
DASARATHY B V: "FUSION STRATEGIES FOR ENHANCING DECISION RELIABILITY IN MULTISENSORENVIRONMENTS" OPTICAL ENGINEERING, SOC. OF PHOTO-OPTICAL INSTRUMENTATION ENGINEERS. BELLINGHAM, US, vol. 35, no. 3, 1 March 1996 (1996-03-01), pages 603-616, XP000597449 ISSN: 0091-3286 *
MCCULLOUGH C L ET AL: "Multi-level sensor fusion for improved target discrimination" DECISION AND CONTROL, 1996., PROCEEDINGS OF THE 35TH IEEE CONFERENCE ON KOBE, JAPAN 11-13 DEC. 1996, NEW YORK, NY, USA,IEEE, US, 11 December 1996 (1996-12-11), pages 3674-3675, XP010214092 ISBN: 0-7803-3590-2 *
PAVLOVIC V ET AL: "MULTIMODAL SPEAKER DETECTION USING ERROR FEEDBACK DYNAMIC BAYESIAN NETWORKS" PROCEEDINGS 2000 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION. CVPR 2000. HILTON HEAD ISLAND, SC, JUNE 13-15, 2000, PROCEEDINGS OF THE IEEE COMPUTER CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, LOS ALAMITOS, CA: IEEE COMP. SOC, vol. 2 OF 2, 13 June 2000 (2000-06-13), pages 34-41, XP001035625 ISBN: 0-7803-6527-5 *
SERPICO S B ET AL: "Structured neural networks for the classification of multisensor remote-sensing images" GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 1993. IGARSS '93. BETTER UNDERSTANDING OF EARTH ENVIRONMENT., INTERNATIONAL TOKYO, JAPAN 18-21 AUG. 1993, NEW YORK, NY, USA,IEEE, 18 August 1993 (1993-08-18), pages 907-909, XP010114470 ISBN: 0-7803-1240-6 *
TSE MIN CHEN ET AL: "A generalized look-ahead method for adaptive multiple sequential data fusion and decision making" MULTISENSOR FUSION AND INTEGRATION FOR INTELLIGENT SYSTEMS, 1999. MFI '99. PROCEEDINGS. 1999 IEEE/SICE/RSJ INTERNATIONAL CONFERENCE ON TAIPEI, TAIWAN 15-18 AUG. 1999, PISCATAWAY, NJ, USA,IEEE, US, 15 August 1999 (1999-08-15), pages 199-204, XP010366571 ISBN: 0-7803-5801-5 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007072425A2 (en) 2005-12-20 2007-06-28 Koninklijke Philips Electronics, N.V. Device for detecting and warning of a medical condition

Also Published As

Publication number Publication date
WO2002042242A3 (en) 2002-11-28
KR20020070491A (en) 2002-09-09
JP2004514970A (en) 2004-05-20
EP1340187A2 (en) 2003-09-03

Similar Documents

Publication Publication Date Title
Fierrez et al. Multiple classifiers in biometrics. Part 2: Trends and challenges
Al-Jarrah et al. Recognition of gestures in Arabic sign language using neuro-fuzzy systems
Chatzis et al. Multimodal decision-level fusion for person authentication
Raheja et al. Robust gesture recognition using Kinect: A comparison between DTW and HMM
Lee et al. Kinect-based Taiwanese sign-language recognition system
Kumar et al. Face and gait biometrics authentication system based on simplified deep neural networks
CN114764869A (en) Multi-object detection with single detection per object
CN115294658A (en) Personalized gesture recognition system and gesture recognition method for multiple application scenes
Li et al. Adaptive deep feature fusion for continuous authentication with data augmentation
El-Henawy et al. Online signature verification: state of the art
JP3998628B2 (en) Pattern recognition apparatus and method
Huang et al. Multimodal finger recognition based on asymmetric networks with fused similarity
Hiremath et al. Human age and gender prediction using machine learning algorithm
Bature et al. Boosted gaze gesture recognition using underlying head orientation sequence
Borgelt Objective functions for fuzzy clustering
Darwish et al. Hand gesture recognition for sign language: a new higher order fuzzy HMM approach
Sharma et al. Multimodal classification using feature level fusion and SVM
Khalifa et al. Multimodal biometric authentication using choquet integral and genetic algorithm
Gornale et al. Multimodal Biometrics Data Analysis for Gender Estimation Using Deep Learning
García et al. Dynamic facial landmarking selection for emotion recognition using Gaussian processes
WO2002042242A2 (en) Candidate level multi-modal integration system
Bodyanskiy et al. Kernel fuzzy kohonen’s clustering neural network and it’s recursive learning
Singh Review on multibiometrics: classifications, normalization and fusion levels
Li et al. Cross-people mobile-phone based airwriting character recognition
Nayak et al. Multimodal biometric face and fingerprint recognition using neural network

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): JP KR

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

WWE Wipo information: entry into national phase

Ref document number: 2001989488

Country of ref document: EP

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2002 544381

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 1020027009315

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 1020027009315

Country of ref document: KR

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): JP KR

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

WWP Wipo information: published in national office

Ref document number: 2001989488

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2001989488

Country of ref document: EP