CN112861726A

CN112861726A - D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter

Info

Publication number: CN112861726A
Application number: CN202110179052.XA
Authority: CN
Inventors: 李秀智; 王珩; 张祥银
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-28

Abstract

The invention discloses a D-S evidence theory multi-mode fusion man-machine interaction method based on a rule intention voter.A robot auditory system collects audio information, adjusts the posture of the robot auditory system and performs hardware noise reduction, and a visual system detects and identifies dynamic gestures by using a double-layer network and classifies gesture actions; and adding a full connection layer to the recognition network of voice and gestures, and outputting the intention understanding of the robot to the interactive object. The two modes realize the communication process of human-computer interaction in a parallel assistance mode, can receive more information, make accurate intention understanding, make vision and hearing more easily accepted by people, and make improvement on an interaction mechanism. And outputting and judging results of current information input by different modes. The synthesis result focuses more on the relation among deep-level information, solves the fusion among multiple modes, can adapt to the evidence conflict problem among different modes, focuses on a single result in the label, and is more suitable for human-computer interaction work.

Description

D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter

Technical Field

The present invention relates to the field of Human-computer interaction (HRI) technology and multimodal fusion. The method specifically comprises the following steps: the robot auditory system determines the sound source orientation by using MUSIC algorithm, and identifies the voice result by using end-to-end gating CNN through voice feature preprocessing of MFCC; the vision system uses a two-tier network to detect and recognize dynamic gestures, uses the 3D CNN and LSTM deep CNN framework to process timing information, and classifies gesture actions. And adding a full connection layer to the recognition network of the voice and the gesture, carrying out normalization processing, carrying out fusion on different modes based on a D-S evidence theory algorithm of the rule intention voter, and outputting the intention understanding of the robot on the interactive object.

Background

Human-computer interaction is a core problem of service robot research, a perception mode plays a fundamental role in the communication between a human and a robot, and people can interact with the robot through modes such as gestures, languages, bodies, expressions, touch and the like. The existing interactive modes are divided into a device interactive mode, a single interactive mode and a multi-mode interactive mode. Device interaction is used for conveying information to the robot by wearing an information collector device on an interaction object, but the flexibility and the comfort degree of interaction are limited. And the man-machine interaction under the single mode can be influenced by the surrounding environment, the self behavior of the interaction object can limit the accuracy of robot identification, and then the single perception mode limits the diversity of contents, so that the interaction process is single and tedious, and the comfort level is reduced.

The multi-modal interaction mode is the main research direction at present, information under different modes has redundancy and complementarity, the existing fusion scheme has a probability theory method, a neural network method and the like, most of the existing fusion scheme focuses on the complementarity under the different modes, and uncertain information cannot be solved by solving the conflict.

The evidence theory is called Dempster evidence synthesis rule, and solves the multi-value mapping problem by using upper and lower limit probability. The method has the capability of directly expressing 'uncertain' and 'unknown', and is widely applied to the fields of expert systems, information fusion and the like.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-modal man-machine interaction mechanism based on sound perception and vision, which adopts a D-S evidence theory fusion algorithm based on a rule intention voter:

(1) and preprocessing the audio signal based on the MFCC technology to extract voice characteristics, and sending the voice characteristics to a trained CNN end-to-end network architecture voice recognition classification.

(2) Adopt double-deck network structure to carry out the discernment to dynamic gesture, constitute by two network modules: a detector: a lightweight CNN architecture, running in real time for detecting gestures. A classifier: deep CNN architecture, where the detector queue is always in front of the classifier queue, uses a sliding window approach on the input video stream and the classifier will only be activated if gesture information is detected.

(3) Adding a full connection layer to the network output of the sound perception and the vision, outputting the confidence coefficient of each label, outputting the results of the two networks to an intention voter based on the rules to judge the relation between different modes, wherein the four conditions of no mode response, single mode, mutual complementation of the two modes and conflict of the two modes are shared, the first three conditions output pointing results, and when the two modes conflict, the conflict is solved through an improved D-S evidence theory and the correct result is output through the relevance.

The method comprises the following specific steps:

step1 speech recognition

Step1.1 the voice acquisition device of the invention is a six-microphone annular array, adds spatial domain and time domain attributes to the input of audio, can realize hardware noise reduction while judging the azimuth angle of a voice object, and strengthens a voice input signal. Determining azimuth angle by high-resolution spectrum estimation method, distance between microphones being d, wavelength of signal in space being lambda, wave-front signal from kth source signal to mth microphone being f_k(t) each microphone receives a noise of n_m(M is 1, 2, …, M), the signal received by the mth microphone is expressed as:

in the formula

Wherein theta is_kDirections of k signal sources; a is_kIs the response of the mth microphone to signal k. The received signal of the microphone array is written into a vector form as follows: x (t) ═ af (t) + n (t), because of noise uncorrelation between microphone arrays, the covariance matrix of the acceptance data x (t) is expressed as: s ═ E { X (t) X^*(t) }, where x represents conjugate transpose, the signal obtained by the microphone array is composed of source signal and noise, n is_gThe subspace formed by the smallest eigenvectors is called the noise subspace, and the orthogonal subspace is the signal subspace: span { v }_K+1，v_K+2，…，v_M}⊥span{a(θ₁)…a(θ_K) For a direction theta_KOf (2) a signal

Constructing a spatial spectrum function by utilizing orthogonal signal subspace and noise subspace, and performing spectrum peak search:

the theta corresponding to the maximum of the spectral function is the estimated value of the signal source direction, i.e. the result of sound source localization. The first six microphones in the present invention are weighted equally: v_x＝α₀·x₀+α₁·x₁+…+α₅·x₅V is the total output audio signal applied by the microphone array, α is the weight of each microphone, and { α is satisfied₀+α₁+…+α ₅1, only the relative spatial position between each microphone corresponds to the time sequence relation of the audio input signal, and satisfies the { alpha }₀＝α₁＝…＝α₅}. When the correlation matrix of the microphones is analyzed to determine the position direction { theta } of the sound source, the microphone x corresponding to the sound source direction_iThe weight is enhanced and the audio signal of the other azimuth is suppressed, i.e.

Step1.2 in the invention, Mel-scale frequency Cepstral coeffients (MFCC) are used for filtering a voice input signal, the influence of noise is reduced, a spectrogram of power normalization audio obtained by preprocessing, framing, windowing and fast Fourier transform and filtering through a triangular band-pass filter is used as the input of a voice recognition network model, and the signal energy output by each band-pass filter can be used as the basic characteristic of a signal and is sent into a voice recognition network.

Step1.3 the invention focuses on the speed of voice recognition, designs an end-to-end network architecture completely based on CNN based on Wav2letter, has 12 layers of convolution structures in total, extracts key features of voice filtered by MFCC in the first layer of the model, can be regarded as a nonlinear convolution, has a kernel width of 31280 and a step length of 320, uses a Gated Linear Unit (GLU) as an activation function, sets a loss function as a Connectionist Temporal Classification (CTC), and does not need to perform voice alignment work on data in advance when predicting the model.

Step2 dynamic gesture recognition

Step2.1 uses a two-layer network structure to recognize dynamic gestures, and consists of two modules: a detector: a lightweight CNN architecture, running in real time for detecting gestures. A classifier: deep CNN architecture. By using a sliding window method on the input video stream, the detector queue is always in front of the classifier queue, no gesture information is missed, the step s is 1, and the classifier is activated only when the gesture information is detected.

The Step2.2 Detector is structured as ResNet, and adds the original probability predicted by the detector to a queue of length k (q)_k) The size (k) of the queue is chosen to be 4, and these original values are median filtered to obtain a valueThe optimal result is obtained. The classifier is composed of 3D CNN and LSTM networks.

Step3 multimodal fusion strategy

Step3.1 adds a full connection layer to the network output of the sound sense and the vision, and carries out normalization processing to obtain confidence degrees { label (i), Con (V) of all the labels_i)}，{label(j)，Con(A_j) (i, j ═ 0, 1, …, n), the invention designs a rule-based intention voter, focusing more on the links between different modalities, including complementarity and conflicts, and outputs the results of the two networks to the rule-based intention voter, where each set of data in T would contain a label and corresponding gesture or voice confidence:

S＝{label(i，j)，Con(V_i)，Con(A_j)}(i，j＝0，1，…，n)

and T is a container for designing and storing data, and the output results of the two modes under the same label correspond to each other.

Step3.2 sets upper and lower thresholds ULN and UCL, can provide the prediction strength of the model to the event, and sets the prediction strength to 80% and 20% respectively, sets a flag bit to indicate the information relation between the two modes, and an intention voter can perform logical operation on the prediction results of the current two modes and output a flag value to indicate the relation between the current two modes. There are four cases of no mode response (flag ═ 0), single mode (flag ═ 1), dual mode complementary (flag ═ 2), and two mode collisions (flag ═ 3).

When Step3.2.1 has no modal response, namely, the vision and auditory systems do not detect the input signal of the corresponding template, only label (0) has an exact numerical value, and the output confidence degree corresponding to other labels, namely label, is less than UCL, at this time, the robot has no corresponding action:

the single mode of Step3.2.2 means that only one mode exists, visual or auditory action, and the other mode does not detect an input interaction signal, and at the moment, the robot operates to a single mode mechanism, and an output result is a recognition result of the mode:

step3.2.3 bimodal mutual complementation is the most application condition of multimodal, and visual sense and auditory sense simultaneously identify and detect input signals of the same label, so that the intention understanding of interactive objects can be effectively enhanced:

step3.2.4 conflict between the two modes, which is an uncertain event of seed production by a multi-mode fusion machine, and the prediction results of the two modes are different label values.

Step3.3 judges the current working mode of the robot and the relation between the two modes according to the current value of a flag bit flag, when the flag is 0, the input is not carried out, or the input signal is not in the understanding range of the robot, the corresponding robot does not act at the moment, when the flag is 1 and 2, the information in the single-mode working mode and the multi-mode working mode has complementarity, at the moment, an intention voter outputs a unique and exact value, when the flag is 3, the recognition results of different modes are in conflict when the robot works in a multi-mode mechanism, and aiming at the situation, the invention introduces a D-S evidence theory and improves the D-S evidence theory to be more suitable for the problem of man-machine interaction multi-mode conflict.

Labels label set by Step3.4 are mutually independent and meet the prior condition of a D-S evidence theory, all values of label form an identification framework theta, and when an uncertain event occurs, the output normalization processing of vision and hearing is basic probability distribution, namely BPA, and { Con (V) for short_i)，Con(A_j) Is converted into { m }_v(i)，m_A(j) And (4) dividing. Single-leafA union set label (i, j +1) with extremely small difference between the prediction probabilities of the labels is added into the vertical label combination: { label (i), label (j) } satisfies { | con (i) -con (j) ≦ epsilon ═ 0.2} i, j ═ 1, 2, … n, and a union of initial labels { label (1), label (2) …, label (n) } is added to the labels, satisfying all label categories { n < sum (label) < 2 ≦ 2ⁿThus, the Zaadh paradox is effectively solved on the premise of not influencing the prediction result. The BPA on the identification frame Θ satisfies:

and is

Wherein A with m (A) > 0 is called a focal element. Evidence theory introduces a trust function to study the merging, intersection, complement and inclusion problems of evidence in the view of set theory. The trust function Bel and the likelihood function Pl based on BPAm on the recognition framework Θ are:

the confidence function bel (a) and the likelihood function pl (a) form a confidence interval [ bel (a), pl (a) ], which represents the degree of certainty for a certain hypothesis. The key part of the evidence theory is the evidence synthesis formula, and the same evidence has a plurality of basic probability distribution functions (two distribution functions are used in the text because of the application of visual and auditory multiple modes) due to different data sources. The evidence theory synthesis formula is a synthesis method for performing orthogonal sum operation on a plurality of basic probability distribution functions, and the combination rule is as follows:

where K is a normalization constant, showing the degree of conflict among different evidences:

in the traditional D-S evidence theory, exponential explosion occurs under the condition that the number of labels is large, and meanwhile, in order to enable a fusion algorithm to be more suitable for man-machine interaction, a single instruction can only be output to a robot instruction, and the condition that the robot is troubled by an output instruction set cannot exist, so that the Dempster synthesis rule is improved. The Dempster evidence synthesis formula degenerates to the Bayes formula, so when synthesizing rules, only a single element in the recognition framework is focused on, subsets of other multiple hypotheses are ignored, and the single element is updated using Bayes approximation of a mass function:

if the number of the labels is other assumptions, the number is directly 0, the synthesis result is calculated by the Dempster after the mass function value is updated again, and the fusion output probability of all the labels is output to guide the action of the robot.

Compared with the prior art, the invention has the following advantages:

(1) the invention starts from the vision and the hearing of the robot, realizes the communication process of human-computer interaction in a parallel assistance mode in two modes, solves the limitation of a single mode, can receive more information and make accurate intention understanding, the vision and the hearing are more easily received by people, and the comfort degree of the interaction process can be obviously improved by improving the interaction mechanism.

(2) The invention designs the rule-based intention voter for the redundancy and complementarity of different modal information in the multi-modal fusion process, can correspond the output judgment results of different modes to the current information input one by one, and utilizes the flag bit flag to show the relation between the output judgment results and the current information input, and if the output judgment results are in conflict, the improved D-S evidence theory is used for carrying out information fusion. The synthesis result focuses more on the relation among deep-level information, the fusion among multiple modes is well solved, the evidence conflict problem among different modes can be adapted, and the single result in the label is focused on, so that the method is more suitable for human-computer interaction work.

Drawings

FIG. 1 is a flow chart of a multi-modal human-machine interaction technology based on vision and hearing;

FIG. 2 is a diagram of an end-to-end voice recognition network architecture;

FIG. 3 is a flow chart of a pre-processing and attention mechanism of a speech signal;

FIG. 4 is a diagram of a dynamic gesture recognition architecture formed by 3D CNN and LSTM;

FIG. 5 is a flow chart of the present invention.

Detailed Description

The specific experiment of the invention is carried out on a robot platform which is provided with a six-wheat annular array and a depth camera and is independently developed in a laboratory, an upper computer is a high-operation processor TX2 under an Intel flag, an operating system is Ubuntu, a conda installation environment is used for configuring a depth CNN under a Pythrch frame to complete voice and gesture recognition tasks, the whole program runs under a robot decentralized control frame ROS, and the experiment scene is carried out indoors. The embodiment of the present invention will be specifically described with reference to fig. 1 and 5.

Pre-training and fine-tuning of Step1 network model

The pre-training data set of Step1.1 speech recognition network is the public Chinese data set THCHS-30, after training, training is carried out on the self-defined data set, and 5 labels are set up. And the input end of the network enters the voice characteristic output end of the MFCC, and the voice characteristics of the high frequency band and the low frequency band are removed through filtering, so that the communication requirement is met.

Step1.2 dynamic gesture recognition two-layer network: the data set adopts an open dynamic gesture data set EgoGesture, fine adjustment is carried out on the data set after training, data enhancement is carried out by using modes of cutting, zooming, video frame circulation and the like, two types of labels with or without gestures are output by a detector, the classifier sets the labels corresponding to the voices to be 5, and 112x112 images are input.

Step1.3 adds a full connection layer to the voice recognition and dynamic gesture recognition network, and performs normalization processing to obtain confidence degrees of all the labels.

Acquisition of Step2 data

Acquisition of a Step2.1 voice signal: the input of the voice signal under the six-microphone annular array is as follows: v_x＝α₀·x₀+α₁·x₁+…+α₅·x₅V is the total output audio signal, α is the weight of each microphone, and satisfies { α }₀+α₁+…+α ₅1, satisfying { α }₀＝α₁＝…＝α₅}. After the awakening words are detected, a high-resolution spectrum estimation method is adopted, a correlation matrix of the microphone is analyzed to determine the position direction { theta } of the sound source, the microphone is opened, continuous voice input is received to perform voice recognition, and the microphone x corresponding to the sound source direction_iThe weight is enhanced and the audio signal at other azimuths is suppressed, i.e.

The effect of hardware noise reduction is achieved.

The Step2.2 robot positions and adjusts the posture of the robot according to the direction angle of the sound source, so that the position of the robot is right opposite to the interactive object, the visual field and the gesture information of the interactive object are acquired, a visual system can work, and meanwhile, the comfort level is improved.

Step2.3 camera sensor starts working, detector and classifier queue are synchronized, in order to avoid missing gesture, detector is in front of classifier queue, video frame is slid by step s ═ 1, original probability predicted by detector is added to a queue with length k (q)_k) The size of the queue (k) is chosen to be 4, and these original values are median filtered to get the optimum result. When the probability of detecting the gesture exceeds a set threshold, the classification network is activated.

Step3 multi-modal recognition

After a Step3.1 voice signal is input, filtering the input audio signal through MFCC, taking a spectrogram of power normalization audio obtained after preprocessing, framing, windowing and fast Fourier transform and filtering through a triangular band-pass filter as the input of a voice recognition network model, wherein the signal energy output by each band-pass filter can be used as the basic characteristic of the signal.

Step3.2 sends the extracted voice features into an end-to-end CNN for recognition, GLU is used as an activation function, and a loss function is CTC.

Dynamic gesture recognition under Step3.3 video: when the detector recognizes a gesture, the classifier starts working, the size of the image input into the classifier network is 112 × 112, and the start frame and the end frame are the minimum value between the number of frames in which the detector recognizes the gesture start and disappear and the set maximum number of frames. The optimizer was a random descent gradient method (SGD), the damping factor was 0.9, and the weight attenuation was 0.001.

Step4, adding a full connection layer in the recognition network of voice and gestures, and outputting the prediction probability values of all labels after normalization processing: { label (i), Con (V)_i)}，{label(j)，Con(A_j)}(i，j＝0，1，…，5)。

Step5 multimodal fusion

Step5.1 outputs the results of the two nets to a rule-based intention voter, where each set of data in T will contain a label and corresponding gesture or voice confidence: s ═ { label (i, j), Con (V)_i)，Con(A_j) J ═ 0, 1, …, 5. And the two modal output results under the same label correspond to each other one by one.

The upper threshold ULN and the lower threshold UCL of Step5.2 are respectively set to be 80% and 20%, a flag bit flag is set to indicate the information relation between the two modes, an intention voter can carry out logical operation on the prediction results of the current two modes, and a flag value is output to represent the relation between the current two modes.

When the Flag value is 0, Step6 indicates that there is no corresponding input, there is no corresponding output, and the robot does not operate.

Step7 indicates a single mode action if Flag value is 1; if the number is 2, the two modes complement each other and point to the same result.

Step8 indicates a conflict between the two modalities if Flag is 3, and an improved D-S evidence theory is needed to calculate the fusion result for an uncertain event.

Step8.1{Con(V_i)，Con(A_j) Is converted into { m }_v(i)，m_A(j) And (4) dividing. Adding a union set label (i, j +1) with a small difference between prediction probabilities of the labels into the independent labels: { label (i), label (j) } satisfies { | con (i) -con (j) ≦ epsilon ═ 0.2} i, j ═ 1, 2, … 5, and a union of initial labels { label (1), label (2) …, label (5) } is added to the labels, satisfying all label categories {5 < sum (label) < 2 ≦ 2⁵And (c) forming an identification frame theta.

Step8.2 computes updates for a single element using Bayes approximation of a mass function:

and outputting a result 0 by other union sets, calculating a synthetic result by the Dempster after the mass function value is updated again, and outputting the fusion output probability of all the labels.

Step9, inputting the output result of the multi-mode fusion into the robot, outputting parameters to the motor through the ROS interface by the robot, and making corresponding actions by the robot.

Claims

1. A D-S evidence theory multi-modal fusion man-machine interaction method based on a rule intention voter is characterized in that: firstly, a robot auditory system adopts a six-microphone annular array to collect audio information and utilizes an MUSIC algorithm to determine the direction of a sound source; then, adjusting the self posture, performing hardware noise reduction, performing voice feature preprocessing of MFCC, and recognizing a voice result by using end-to-end gating CNN; the visual system uses a double-layer network to detect and recognize dynamic gestures, the recognizer and the classifier act on a video frame by a sliding window method, step s is 1, a depth CNN frame of 3D CNN and LSTM is used for processing time sequence information, and gesture actions are classified; and finally, adding a full connection layer to the recognition network of the voice and the gesture, carrying out normalization processing, carrying out fusion on different modes by using a D-S evidence theory algorithm based on a rule intention voter, and outputting the intention understanding of the robot on the interactive object.

2. The D-S evidence theory multi-modal fusion human-computer interaction method based on the rule intention voter of claim 1, wherein: six wheat annular array are voice acquisition device, its characterized in that: adding spatial domain and time domain attributes to the input of the audio, judging the hardware noise reduction of the voice object azimuth, and strengthening the voice input signal; determining the azimuth angle by adopting a high-resolution spectrum estimation method, wherein theta corresponding to the maximum value of the spectrum function is an estimated value of the signal source direction;

the weights of the six microphones are the same: v_x＝α₀·x₀+α₁·x₁+…+α₅·x₅V is the total output audio signal applied by the microphone array, α is the weight of each microphone, and { α is satisfied₀+α₁+…+α₅1, each microphone has only relative spatial position, corresponding to the time sequence relation of the audio input signal, and satisfies { alpha }₀＝α₁＝…＝α₅}; when the correlation matrix of the microphones is analyzed to determine the position direction { theta } of the sound source, the microphone x corresponding to the sound source direction_iThe weight is enhanced and the audio signal of the other azimuth is suppressed, i.e.

Filtering a voice input signal by using a Mel cepstrum coefficient (MFCC), reducing noise influence, and taking a spectrogram of power normalization audio obtained by preprocessing, framing, windowing and fast Fourier transform and filtering by a triangular band-pass filter as the input of a voice recognition network model, wherein the signal energy output by each band-pass filter can be used as the basic characteristic of the signal and is sent into a voice recognition network;

focusing on the speed of voice recognition, designing an end-to-end network architecture completely based on CNN, sharing 12 layers of convolution structures, using a gated linear unit GLU as an activation function, setting a loss function as CTC, and enabling the network to perform voice alignment work on data in advance during model prediction.

3. The D-S evidence theory multi-modal fusion human-computer interaction method based on the rule intention voter of claim 1, wherein: six wheat annular array are voice acquisition device, its characterized in that: in the dynamic gesture recognition, a double-layer network structure is used for recognizing a dynamic gesture, a detector and a classifier use a sliding window method on an input video stream, and the stride s is 1;

adding the original probabilities predicted by the detector to a queue (qk), the size k of which is selected to be 4, median filtering these original values to obtain an optimal result, the classifier being activated only when gesture information is detected; the classifier network framework is formed by adopting 3D CNN and LSTM networks.

4. The D-S evidence theory multi-modal fusion human-computer interaction method based on the rule intention voter of claim 1, wherein: the six-wheat annular array is a voice acquisition device, in a multi-mode fusion strategy, a full connection layer is added to the network output of sound sense and vision, normalization processing is carried out, and confidence degrees { label (i) } and Con (V) of all labels are obtained_i)}，{label(j)，Con(A_j) (i, j ═ 0, 1, …, n), a rule-based intent voter was designed to which the results of the two networks were output, each set of data in T would contain a label and corresponding gesture or voice confidence: s ═ { label (i, j), Con (V)_i)，Con(A_j) (i, j ═ 0, 1, …, n), and the output results of the two modalities correspond to the same label; setting upper and lower thresholds ULN and UCL which are respectively set to be 80% and 20%, setting a flag bit flag to indicate the information relation between the two modes, and carrying out logic operation on the prediction results of the current two modes by an intention voter and outputting a flag value to represent the relation between the current two modes; the total modeless response flag is 0, the single modal flag is 1, the bimodal mutual complement flag is 2, and the two modal conflict flags are 3; judging the current working mode of the robot and the relation between the two modes according to the current value of a flag bit flag, wherein when the flag is 0, the robot is not input or is inputThe input signal is not in the understanding range of the robot, the corresponding robot does not act at the moment, when the flag is 1 and 2, the information is complementary in a single-mode working mode and a multi-mode working mode, the intention voter outputs a unique exact value, when the flag is 3, the fact that the robot works in a multi-mode mechanism and recognition results of different modes have conflict is shown, and for the situation, a D-S evidence theory is introduced and improved to be more suitable for the man-machine interaction multi-mode conflict problem; improved D-S evidence theory algorithm: { Con (V)_i)，Con(A_j) Is converted into { m }_v(i)，m_A(j) }; a union set label (i, j +1) with extremely small difference between the prediction probabilities of the labels is added into the independent label combination: { label (i), label (j) } satisfies { | con (i) ≦ epsilon ═ 0.2} i, j ═ 1, 2, … n, and a union of initial labels { label (1), label (2) …, label (n) } is added to the labels to constitute an identification frame Θ satisfying all label types { n < sum: (label) < 2 { (label) }ⁿ}; only a single value is needed in the human-computer interaction process, and a single element is updated by using Bayes approximation of a mass function:

if the number of the labels is other assumptions, the number is directly 0, the synthesis result is calculated by the Dempster after the mass function value is updated again, and the fusion output probability of all the labels is output to guide the robot to act.