CN112861726A - D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter - Google Patents

D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter Download PDF

Info

Publication number
CN112861726A
CN112861726A CN202110179052.XA CN202110179052A CN112861726A CN 112861726 A CN112861726 A CN 112861726A CN 202110179052 A CN202110179052 A CN 202110179052A CN 112861726 A CN112861726 A CN 112861726A
Authority
CN
China
Prior art keywords
label
voice
robot
intention
voter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110179052.XA
Other languages
Chinese (zh)
Inventor
李秀智
王珩
张祥银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110179052.XA priority Critical patent/CN112861726A/en
Publication of CN112861726A publication Critical patent/CN112861726A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Acoustics & Sound (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Signal Processing (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a D-S evidence theory multi-mode fusion man-machine interaction method based on a rule intention voter.A robot auditory system collects audio information, adjusts the posture of the robot auditory system and performs hardware noise reduction, and a visual system detects and identifies dynamic gestures by using a double-layer network and classifies gesture actions; and adding a full connection layer to the recognition network of voice and gestures, and outputting the intention understanding of the robot to the interactive object. The two modes realize the communication process of human-computer interaction in a parallel assistance mode, can receive more information, make accurate intention understanding, make vision and hearing more easily accepted by people, and make improvement on an interaction mechanism. And outputting and judging results of current information input by different modes. The synthesis result focuses more on the relation among deep-level information, solves the fusion among multiple modes, can adapt to the evidence conflict problem among different modes, focuses on a single result in the label, and is more suitable for human-computer interaction work.

Description

D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter
Technical Field
The present invention relates to the field of Human-computer interaction (HRI) technology and multimodal fusion. The method specifically comprises the following steps: the robot auditory system determines the sound source orientation by using MUSIC algorithm, and identifies the voice result by using end-to-end gating CNN through voice feature preprocessing of MFCC; the vision system uses a two-tier network to detect and recognize dynamic gestures, uses the 3D CNN and LSTM deep CNN framework to process timing information, and classifies gesture actions. And adding a full connection layer to the recognition network of the voice and the gesture, carrying out normalization processing, carrying out fusion on different modes based on a D-S evidence theory algorithm of the rule intention voter, and outputting the intention understanding of the robot on the interactive object.
Background
Human-computer interaction is a core problem of service robot research, a perception mode plays a fundamental role in the communication between a human and a robot, and people can interact with the robot through modes such as gestures, languages, bodies, expressions, touch and the like. The existing interactive modes are divided into a device interactive mode, a single interactive mode and a multi-mode interactive mode. Device interaction is used for conveying information to the robot by wearing an information collector device on an interaction object, but the flexibility and the comfort degree of interaction are limited. And the man-machine interaction under the single mode can be influenced by the surrounding environment, the self behavior of the interaction object can limit the accuracy of robot identification, and then the single perception mode limits the diversity of contents, so that the interaction process is single and tedious, and the comfort level is reduced.
The multi-modal interaction mode is the main research direction at present, information under different modes has redundancy and complementarity, the existing fusion scheme has a probability theory method, a neural network method and the like, most of the existing fusion scheme focuses on the complementarity under the different modes, and uncertain information cannot be solved by solving the conflict.
The evidence theory is called Dempster evidence synthesis rule, and solves the multi-value mapping problem by using upper and lower limit probability. The method has the capability of directly expressing 'uncertain' and 'unknown', and is widely applied to the fields of expert systems, information fusion and the like.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-modal man-machine interaction mechanism based on sound perception and vision, which adopts a D-S evidence theory fusion algorithm based on a rule intention voter:
(1) and preprocessing the audio signal based on the MFCC technology to extract voice characteristics, and sending the voice characteristics to a trained CNN end-to-end network architecture voice recognition classification.
(2) Adopt double-deck network structure to carry out the discernment to dynamic gesture, constitute by two network modules: a detector: a lightweight CNN architecture, running in real time for detecting gestures. A classifier: deep CNN architecture, where the detector queue is always in front of the classifier queue, uses a sliding window approach on the input video stream and the classifier will only be activated if gesture information is detected.
(3) Adding a full connection layer to the network output of the sound perception and the vision, outputting the confidence coefficient of each label, outputting the results of the two networks to an intention voter based on the rules to judge the relation between different modes, wherein the four conditions of no mode response, single mode, mutual complementation of the two modes and conflict of the two modes are shared, the first three conditions output pointing results, and when the two modes conflict, the conflict is solved through an improved D-S evidence theory and the correct result is output through the relevance.
The method comprises the following specific steps:
step1 speech recognition
Step1.1 the voice acquisition device of the invention is a six-microphone annular array, adds spatial domain and time domain attributes to the input of audio, can realize hardware noise reduction while judging the azimuth angle of a voice object, and strengthens a voice input signal. Determining azimuth angle by high-resolution spectrum estimation method, distance between microphones being d, wavelength of signal in space being lambda, wave-front signal from kth source signal to mth microphone being fk(t) each microphone receives a noise of nm(M is 1, 2, …, M), the signal received by the mth microphone is expressed as:
Figure BDA0002940921350000021
in the formula
Figure BDA0002940921350000022
Wherein theta iskDirections of k signal sources; a iskIs the response of the mth microphone to signal k. The received signal of the microphone array is written into a vector form as follows: x (t) ═ af (t) + n (t), because of noise uncorrelation between microphone arrays, the covariance matrix of the acceptance data x (t) is expressed as: s ═ E { X (t) X*(t) }, where x represents conjugate transpose, the signal obtained by the microphone array is composed of source signal and noise, n isgThe subspace formed by the smallest eigenvectors is called the noise subspace, and the orthogonal subspace is the signal subspace: span { v }K+1,vK+2,…,vM}⊥span{a(θ1)…a(θK) For a direction thetaKOf (2) a signal
Figure BDA0002940921350000023
Constructing a spatial spectrum function by utilizing orthogonal signal subspace and noise subspace, and performing spectrum peak search:
Figure BDA0002940921350000024
the theta corresponding to the maximum of the spectral function is the estimated value of the signal source direction, i.e. the result of sound source localization. The first six microphones in the present invention are weighted equally: vx=α0·x01·x1+…+α5·x5V is the total output audio signal applied by the microphone array, α is the weight of each microphone, and { α is satisfied01+…+α 51, only the relative spatial position between each microphone corresponds to the time sequence relation of the audio input signal, and satisfies the { alpha }0=α1=…=α5}. When the correlation matrix of the microphones is analyzed to determine the position direction { theta } of the sound source, the microphone x corresponding to the sound source directioniThe weight is enhanced and the audio signal of the other azimuth is suppressed, i.e.
Figure BDA0002940921350000025
Step1.2 in the invention, Mel-scale frequency Cepstral coeffients (MFCC) are used for filtering a voice input signal, the influence of noise is reduced, a spectrogram of power normalization audio obtained by preprocessing, framing, windowing and fast Fourier transform and filtering through a triangular band-pass filter is used as the input of a voice recognition network model, and the signal energy output by each band-pass filter can be used as the basic characteristic of a signal and is sent into a voice recognition network.
Step1.3 the invention focuses on the speed of voice recognition, designs an end-to-end network architecture completely based on CNN based on Wav2letter, has 12 layers of convolution structures in total, extracts key features of voice filtered by MFCC in the first layer of the model, can be regarded as a nonlinear convolution, has a kernel width of 31280 and a step length of 320, uses a Gated Linear Unit (GLU) as an activation function, sets a loss function as a Connectionist Temporal Classification (CTC), and does not need to perform voice alignment work on data in advance when predicting the model.
Step2 dynamic gesture recognition
Step2.1 uses a two-layer network structure to recognize dynamic gestures, and consists of two modules: a detector: a lightweight CNN architecture, running in real time for detecting gestures. A classifier: deep CNN architecture. By using a sliding window method on the input video stream, the detector queue is always in front of the classifier queue, no gesture information is missed, the step s is 1, and the classifier is activated only when the gesture information is detected.
The Step2.2 Detector is structured as ResNet, and adds the original probability predicted by the detector to a queue of length k (q)k) The size (k) of the queue is chosen to be 4, and these original values are median filtered to obtain a valueThe optimal result is obtained. The classifier is composed of 3D CNN and LSTM networks.
Step3 multimodal fusion strategy
Step3.1 adds a full connection layer to the network output of the sound sense and the vision, and carries out normalization processing to obtain confidence degrees { label (i), Con (V) of all the labelsi)},{label(j),Con(Aj) (i, j ═ 0, 1, …, n), the invention designs a rule-based intention voter, focusing more on the links between different modalities, including complementarity and conflicts, and outputs the results of the two networks to the rule-based intention voter, where each set of data in T would contain a label and corresponding gesture or voice confidence:
S={label(i,j),Con(Vi),Con(Aj)}(i,j=0,1,…,n)
and T is a container for designing and storing data, and the output results of the two modes under the same label correspond to each other.
Step3.2 sets upper and lower thresholds ULN and UCL, can provide the prediction strength of the model to the event, and sets the prediction strength to 80% and 20% respectively, sets a flag bit to indicate the information relation between the two modes, and an intention voter can perform logical operation on the prediction results of the current two modes and output a flag value to indicate the relation between the current two modes. There are four cases of no mode response (flag ═ 0), single mode (flag ═ 1), dual mode complementary (flag ═ 2), and two mode collisions (flag ═ 3).
When Step3.2.1 has no modal response, namely, the vision and auditory systems do not detect the input signal of the corresponding template, only label (0) has an exact numerical value, and the output confidence degree corresponding to other labels, namely label, is less than UCL, at this time, the robot has no corresponding action:
Figure BDA0002940921350000031
the single mode of Step3.2.2 means that only one mode exists, visual or auditory action, and the other mode does not detect an input interaction signal, and at the moment, the robot operates to a single mode mechanism, and an output result is a recognition result of the mode:
Figure BDA0002940921350000041
step3.2.3 bimodal mutual complementation is the most application condition of multimodal, and visual sense and auditory sense simultaneously identify and detect input signals of the same label, so that the intention understanding of interactive objects can be effectively enhanced:
Figure BDA0002940921350000042
step3.2.4 conflict between the two modes, which is an uncertain event of seed production by a multi-mode fusion machine, and the prediction results of the two modes are different label values.
Figure BDA0002940921350000043
Step3.3 judges the current working mode of the robot and the relation between the two modes according to the current value of a flag bit flag, when the flag is 0, the input is not carried out, or the input signal is not in the understanding range of the robot, the corresponding robot does not act at the moment, when the flag is 1 and 2, the information in the single-mode working mode and the multi-mode working mode has complementarity, at the moment, an intention voter outputs a unique and exact value, when the flag is 3, the recognition results of different modes are in conflict when the robot works in a multi-mode mechanism, and aiming at the situation, the invention introduces a D-S evidence theory and improves the D-S evidence theory to be more suitable for the problem of man-machine interaction multi-mode conflict.
Labels label set by Step3.4 are mutually independent and meet the prior condition of a D-S evidence theory, all values of label form an identification framework theta, and when an uncertain event occurs, the output normalization processing of vision and hearing is basic probability distribution, namely BPA, and { Con (V) for shorti),Con(Aj) Is converted into { m }v(i),mA(j) And (4) dividing. Single-leafA union set label (i, j +1) with extremely small difference between the prediction probabilities of the labels is added into the vertical label combination: { label (i), label (j) } satisfies { | con (i) -con (j) ≦ epsilon ═ 0.2} i, j ═ 1, 2, … n, and a union of initial labels { label (1), label (2) …, label (n) } is added to the labels, satisfying all label categories { n < sum (label) < 2 ≦ 2nThus, the Zaadh paradox is effectively solved on the premise of not influencing the prediction result. The BPA on the identification frame Θ satisfies:
Figure BDA0002940921350000044
and is
Figure BDA0002940921350000045
Wherein A with m (A) > 0 is called a focal element. Evidence theory introduces a trust function to study the merging, intersection, complement and inclusion problems of evidence in the view of set theory. The trust function Bel and the likelihood function Pl based on BPAm on the recognition framework Θ are:
Figure BDA0002940921350000046
the confidence function bel (a) and the likelihood function pl (a) form a confidence interval [ bel (a), pl (a) ], which represents the degree of certainty for a certain hypothesis. The key part of the evidence theory is the evidence synthesis formula, and the same evidence has a plurality of basic probability distribution functions (two distribution functions are used in the text because of the application of visual and auditory multiple modes) due to different data sources. The evidence theory synthesis formula is a synthesis method for performing orthogonal sum operation on a plurality of basic probability distribution functions, and the combination rule is as follows:
Figure BDA0002940921350000051
where K is a normalization constant, showing the degree of conflict among different evidences:
Figure BDA0002940921350000052
in the traditional D-S evidence theory, exponential explosion occurs under the condition that the number of labels is large, and meanwhile, in order to enable a fusion algorithm to be more suitable for man-machine interaction, a single instruction can only be output to a robot instruction, and the condition that the robot is troubled by an output instruction set cannot exist, so that the Dempster synthesis rule is improved. The Dempster evidence synthesis formula degenerates to the Bayes formula, so when synthesizing rules, only a single element in the recognition framework is focused on, subsets of other multiple hypotheses are ignored, and the single element is updated using Bayes approximation of a mass function:
Figure BDA0002940921350000053
Figure BDA0002940921350000054
if the number of the labels is other assumptions, the number is directly 0, the synthesis result is calculated by the Dempster after the mass function value is updated again, and the fusion output probability of all the labels is output to guide the action of the robot.
Compared with the prior art, the invention has the following advantages:
(1) the invention starts from the vision and the hearing of the robot, realizes the communication process of human-computer interaction in a parallel assistance mode in two modes, solves the limitation of a single mode, can receive more information and make accurate intention understanding, the vision and the hearing are more easily received by people, and the comfort degree of the interaction process can be obviously improved by improving the interaction mechanism.
(2) The invention designs the rule-based intention voter for the redundancy and complementarity of different modal information in the multi-modal fusion process, can correspond the output judgment results of different modes to the current information input one by one, and utilizes the flag bit flag to show the relation between the output judgment results and the current information input, and if the output judgment results are in conflict, the improved D-S evidence theory is used for carrying out information fusion. The synthesis result focuses more on the relation among deep-level information, the fusion among multiple modes is well solved, the evidence conflict problem among different modes can be adapted, and the single result in the label is focused on, so that the method is more suitable for human-computer interaction work.
Drawings
FIG. 1 is a flow chart of a multi-modal human-machine interaction technology based on vision and hearing;
FIG. 2 is a diagram of an end-to-end voice recognition network architecture;
FIG. 3 is a flow chart of a pre-processing and attention mechanism of a speech signal;
FIG. 4 is a diagram of a dynamic gesture recognition architecture formed by 3D CNN and LSTM;
FIG. 5 is a flow chart of the present invention.
Detailed Description
The specific experiment of the invention is carried out on a robot platform which is provided with a six-wheat annular array and a depth camera and is independently developed in a laboratory, an upper computer is a high-operation processor TX2 under an Intel flag, an operating system is Ubuntu, a conda installation environment is used for configuring a depth CNN under a Pythrch frame to complete voice and gesture recognition tasks, the whole program runs under a robot decentralized control frame ROS, and the experiment scene is carried out indoors. The embodiment of the present invention will be specifically described with reference to fig. 1 and 5.
Pre-training and fine-tuning of Step1 network model
The pre-training data set of Step1.1 speech recognition network is the public Chinese data set THCHS-30, after training, training is carried out on the self-defined data set, and 5 labels are set up. And the input end of the network enters the voice characteristic output end of the MFCC, and the voice characteristics of the high frequency band and the low frequency band are removed through filtering, so that the communication requirement is met.
Step1.2 dynamic gesture recognition two-layer network: the data set adopts an open dynamic gesture data set EgoGesture, fine adjustment is carried out on the data set after training, data enhancement is carried out by using modes of cutting, zooming, video frame circulation and the like, two types of labels with or without gestures are output by a detector, the classifier sets the labels corresponding to the voices to be 5, and 112x112 images are input.
Step1.3 adds a full connection layer to the voice recognition and dynamic gesture recognition network, and performs normalization processing to obtain confidence degrees of all the labels.
Acquisition of Step2 data
Acquisition of a Step2.1 voice signal: the input of the voice signal under the six-microphone annular array is as follows: vx=α0·x01·x1+…+α5·x5V is the total output audio signal, α is the weight of each microphone, and satisfies { α }01+…+α 51, satisfying { α }0=α1=…=α5}. After the awakening words are detected, a high-resolution spectrum estimation method is adopted, a correlation matrix of the microphone is analyzed to determine the position direction { theta } of the sound source, the microphone is opened, continuous voice input is received to perform voice recognition, and the microphone x corresponding to the sound source directioniThe weight is enhanced and the audio signal at other azimuths is suppressed, i.e.
Figure BDA0002940921350000061
The effect of hardware noise reduction is achieved.
The Step2.2 robot positions and adjusts the posture of the robot according to the direction angle of the sound source, so that the position of the robot is right opposite to the interactive object, the visual field and the gesture information of the interactive object are acquired, a visual system can work, and meanwhile, the comfort level is improved.
Step2.3 camera sensor starts working, detector and classifier queue are synchronized, in order to avoid missing gesture, detector is in front of classifier queue, video frame is slid by step s ═ 1, original probability predicted by detector is added to a queue with length k (q)k) The size of the queue (k) is chosen to be 4, and these original values are median filtered to get the optimum result. When the probability of detecting the gesture exceeds a set threshold, the classification network is activated.
Step3 multi-modal recognition
After a Step3.1 voice signal is input, filtering the input audio signal through MFCC, taking a spectrogram of power normalization audio obtained after preprocessing, framing, windowing and fast Fourier transform and filtering through a triangular band-pass filter as the input of a voice recognition network model, wherein the signal energy output by each band-pass filter can be used as the basic characteristic of the signal.
Step3.2 sends the extracted voice features into an end-to-end CNN for recognition, GLU is used as an activation function, and a loss function is CTC.
Dynamic gesture recognition under Step3.3 video: when the detector recognizes a gesture, the classifier starts working, the size of the image input into the classifier network is 112 × 112, and the start frame and the end frame are the minimum value between the number of frames in which the detector recognizes the gesture start and disappear and the set maximum number of frames. The optimizer was a random descent gradient method (SGD), the damping factor was 0.9, and the weight attenuation was 0.001.
Step4, adding a full connection layer in the recognition network of voice and gestures, and outputting the prediction probability values of all labels after normalization processing: { label (i), Con (V)i)},{label(j),Con(Aj)}(i,j=0,1,…,5)。
Step5 multimodal fusion
Step5.1 outputs the results of the two nets to a rule-based intention voter, where each set of data in T will contain a label and corresponding gesture or voice confidence: s ═ { label (i, j), Con (V)i),Con(Aj) J ═ 0, 1, …, 5. And the two modal output results under the same label correspond to each other one by one.
The upper threshold ULN and the lower threshold UCL of Step5.2 are respectively set to be 80% and 20%, a flag bit flag is set to indicate the information relation between the two modes, an intention voter can carry out logical operation on the prediction results of the current two modes, and a flag value is output to represent the relation between the current two modes.
When the Flag value is 0, Step6 indicates that there is no corresponding input, there is no corresponding output, and the robot does not operate.
Step7 indicates a single mode action if Flag value is 1; if the number is 2, the two modes complement each other and point to the same result.
Step8 indicates a conflict between the two modalities if Flag is 3, and an improved D-S evidence theory is needed to calculate the fusion result for an uncertain event.
Step8.1{Con(Vi),Con(Aj) Is converted into { m }v(i),mA(j) And (4) dividing. Adding a union set label (i, j +1) with a small difference between prediction probabilities of the labels into the independent labels: { label (i), label (j) } satisfies { | con (i) -con (j) ≦ epsilon ═ 0.2} i, j ═ 1, 2, … 5, and a union of initial labels { label (1), label (2) …, label (5) } is added to the labels, satisfying all label categories {5 < sum (label) < 2 ≦ 25And (c) forming an identification frame theta.
Step8.2 computes updates for a single element using Bayes approximation of a mass function:
Figure BDA0002940921350000071
and outputting a result 0 by other union sets, calculating a synthetic result by the Dempster after the mass function value is updated again, and outputting the fusion output probability of all the labels.
Step9, inputting the output result of the multi-mode fusion into the robot, outputting parameters to the motor through the ROS interface by the robot, and making corresponding actions by the robot.

Claims (4)

1. A D-S evidence theory multi-modal fusion man-machine interaction method based on a rule intention voter is characterized in that: firstly, a robot auditory system adopts a six-microphone annular array to collect audio information and utilizes an MUSIC algorithm to determine the direction of a sound source; then, adjusting the self posture, performing hardware noise reduction, performing voice feature preprocessing of MFCC, and recognizing a voice result by using end-to-end gating CNN; the visual system uses a double-layer network to detect and recognize dynamic gestures, the recognizer and the classifier act on a video frame by a sliding window method, step s is 1, a depth CNN frame of 3D CNN and LSTM is used for processing time sequence information, and gesture actions are classified; and finally, adding a full connection layer to the recognition network of the voice and the gesture, carrying out normalization processing, carrying out fusion on different modes by using a D-S evidence theory algorithm based on a rule intention voter, and outputting the intention understanding of the robot on the interactive object.
2. The D-S evidence theory multi-modal fusion human-computer interaction method based on the rule intention voter of claim 1, wherein: six wheat annular array are voice acquisition device, its characterized in that: adding spatial domain and time domain attributes to the input of the audio, judging the hardware noise reduction of the voice object azimuth, and strengthening the voice input signal; determining the azimuth angle by adopting a high-resolution spectrum estimation method, wherein theta corresponding to the maximum value of the spectrum function is an estimated value of the signal source direction;
the weights of the six microphones are the same: vx=α0·x01·x1+…+α5·x5V is the total output audio signal applied by the microphone array, α is the weight of each microphone, and { α is satisfied01+…+α51, each microphone has only relative spatial position, corresponding to the time sequence relation of the audio input signal, and satisfies { alpha }0=α1=…=α5}; when the correlation matrix of the microphones is analyzed to determine the position direction { theta } of the sound source, the microphone x corresponding to the sound source directioniThe weight is enhanced and the audio signal of the other azimuth is suppressed, i.e.
Figure FDA0002940921340000011
Filtering a voice input signal by using a Mel cepstrum coefficient (MFCC), reducing noise influence, and taking a spectrogram of power normalization audio obtained by preprocessing, framing, windowing and fast Fourier transform and filtering by a triangular band-pass filter as the input of a voice recognition network model, wherein the signal energy output by each band-pass filter can be used as the basic characteristic of the signal and is sent into a voice recognition network;
focusing on the speed of voice recognition, designing an end-to-end network architecture completely based on CNN, sharing 12 layers of convolution structures, using a gated linear unit GLU as an activation function, setting a loss function as CTC, and enabling the network to perform voice alignment work on data in advance during model prediction.
3. The D-S evidence theory multi-modal fusion human-computer interaction method based on the rule intention voter of claim 1, wherein: six wheat annular array are voice acquisition device, its characterized in that: in the dynamic gesture recognition, a double-layer network structure is used for recognizing a dynamic gesture, a detector and a classifier use a sliding window method on an input video stream, and the stride s is 1;
adding the original probabilities predicted by the detector to a queue (qk), the size k of which is selected to be 4, median filtering these original values to obtain an optimal result, the classifier being activated only when gesture information is detected; the classifier network framework is formed by adopting 3D CNN and LSTM networks.
4. The D-S evidence theory multi-modal fusion human-computer interaction method based on the rule intention voter of claim 1, wherein: the six-wheat annular array is a voice acquisition device, in a multi-mode fusion strategy, a full connection layer is added to the network output of sound sense and vision, normalization processing is carried out, and confidence degrees { label (i) } and Con (V) of all labels are obtainedi)},{label(j),Con(Aj) (i, j ═ 0, 1, …, n), a rule-based intent voter was designed to which the results of the two networks were output, each set of data in T would contain a label and corresponding gesture or voice confidence: s ═ { label (i, j), Con (V)i),Con(Aj) (i, j ═ 0, 1, …, n), and the output results of the two modalities correspond to the same label; setting upper and lower thresholds ULN and UCL which are respectively set to be 80% and 20%, setting a flag bit flag to indicate the information relation between the two modes, and carrying out logic operation on the prediction results of the current two modes by an intention voter and outputting a flag value to represent the relation between the current two modes; the total modeless response flag is 0, the single modal flag is 1, the bimodal mutual complement flag is 2, and the two modal conflict flags are 3; judging the current working mode of the robot and the relation between the two modes according to the current value of a flag bit flag, wherein when the flag is 0, the robot is not input or is inputThe input signal is not in the understanding range of the robot, the corresponding robot does not act at the moment, when the flag is 1 and 2, the information is complementary in a single-mode working mode and a multi-mode working mode, the intention voter outputs a unique exact value, when the flag is 3, the fact that the robot works in a multi-mode mechanism and recognition results of different modes have conflict is shown, and for the situation, a D-S evidence theory is introduced and improved to be more suitable for the man-machine interaction multi-mode conflict problem; improved D-S evidence theory algorithm: { Con (V)i),Con(Aj) Is converted into { m }v(i),mA(j) }; a union set label (i, j +1) with extremely small difference between the prediction probabilities of the labels is added into the independent label combination: { label (i), label (j) } satisfies { | con (i) ≦ epsilon ═ 0.2} i, j ═ 1, 2, … n, and a union of initial labels { label (1), label (2) …, label (n) } is added to the labels to constitute an identification frame Θ satisfying all label types { n < sum: (label) < 2 { (label) }n}; only a single value is needed in the human-computer interaction process, and a single element is updated by using Bayes approximation of a mass function:
Figure FDA0002940921340000012
if the number of the labels is other assumptions, the number is directly 0, the synthesis result is calculated by the Dempster after the mass function value is updated again, and the fusion output probability of all the labels is output to guide the robot to act.
CN202110179052.XA 2021-02-09 2021-02-09 D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter Pending CN112861726A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110179052.XA CN112861726A (en) 2021-02-09 2021-02-09 D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110179052.XA CN112861726A (en) 2021-02-09 2021-02-09 D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter

Publications (1)

Publication Number Publication Date
CN112861726A true CN112861726A (en) 2021-05-28

Family

ID=75988103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110179052.XA Pending CN112861726A (en) 2021-02-09 2021-02-09 D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter

Country Status (1)

Country Link
CN (1) CN112861726A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114029963A (en) * 2022-01-12 2022-02-11 北京具身智能科技有限公司 Robot operation method based on visual and auditory fusion
CN114265498A (en) * 2021-12-16 2022-04-01 中国电子科技集团公司第二十八研究所 Method for combining multi-modal gesture recognition and visual feedback mechanism
CN117718969A (en) * 2024-01-18 2024-03-19 浙江孚宝智能科技有限公司 Household robot control system and method based on visual and auditory fusion
CN117718969B (en) * 2024-01-18 2024-05-31 浙江孚宝智能科技有限公司 Household robot control system and method based on visual and auditory fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107150347A (en) * 2017-06-08 2017-09-12 华南理工大学 Robot perception and understanding method based on man-machine collaboration
CN111680620A (en) * 2020-06-05 2020-09-18 中国人民解放军空军工程大学 Human-computer interaction intention identification method based on D-S evidence theory
CN112083806A (en) * 2020-09-16 2020-12-15 华南理工大学 Self-learning emotion interaction method based on multi-modal recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107150347A (en) * 2017-06-08 2017-09-12 华南理工大学 Robot perception and understanding method based on man-machine collaboration
CN111680620A (en) * 2020-06-05 2020-09-18 中国人民解放军空军工程大学 Human-computer interaction intention identification method based on D-S evidence theory
CN112083806A (en) * 2020-09-16 2020-12-15 华南理工大学 Self-learning emotion interaction method based on multi-modal recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
阳平;陈香;李云;王文会;杨基海;: "一种基于融合多传感器信息的手语手势识别方法", 航天医学与医学工程, no. 04 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114265498A (en) * 2021-12-16 2022-04-01 中国电子科技集团公司第二十八研究所 Method for combining multi-modal gesture recognition and visual feedback mechanism
CN114265498B (en) * 2021-12-16 2023-10-27 中国电子科技集团公司第二十八研究所 Method for combining multi-mode gesture recognition and visual feedback mechanism
CN114029963A (en) * 2022-01-12 2022-02-11 北京具身智能科技有限公司 Robot operation method based on visual and auditory fusion
CN117718969A (en) * 2024-01-18 2024-03-19 浙江孚宝智能科技有限公司 Household robot control system and method based on visual and auditory fusion
CN117718969B (en) * 2024-01-18 2024-05-31 浙江孚宝智能科技有限公司 Household robot control system and method based on visual and auditory fusion

Similar Documents

Publication Publication Date Title
US11335347B2 (en) Multiple classifications of audio data
US20190354797A1 (en) Recurrent multimodal attention system based on expert gated networks
CN101101752B (en) Monosyllabic language lip-reading recognition system based on vision character
CN108804453B (en) Video and audio recognition method and device
EP3716266B1 (en) Artificial intelligence device and method of operating artificial intelligence device
CN110826466A (en) Emotion identification method, device and storage medium based on LSTM audio-video fusion
KR102281504B1 (en) Voice sythesizer using artificial intelligence and operating method thereof
CN112861726A (en) D-S evidence theory multi-mode fusion man-machine interaction method based on rule intention voter
CN110674483B (en) Identity recognition method based on multi-mode information
US20230068798A1 (en) Active speaker detection using image data
KR20210155401A (en) Speech synthesis apparatus for evaluating the quality of synthesized speech using artificial intelligence and method of operation thereof
CN113095249A (en) Robust multi-mode remote sensing image target detection method
CN112183107A (en) Audio processing method and device
KR20210153165A (en) An artificial intelligence device that provides a voice recognition function, an operation method of the artificial intelligence device
KR20210058152A (en) Control Method of Intelligent security devices
CN114242066A (en) Speech processing method, speech processing model training method, apparatus and medium
US10540972B2 (en) Speech recognition device, speech recognition method, non-transitory recording medium, and robot
KR102265874B1 (en) Method and Apparatus for Distinguishing User based on Multimodal
Feng et al. DAMUN: A domain adaptive human activity recognition network based on multimodal feature fusion
EP4030352A1 (en) Task-specific text generation based on multimodal inputs
KR20210048271A (en) Apparatus and method for performing automatic audio focusing to multiple objects
Wang et al. Multimodal human-robot interaction on service robot
US11681364B1 (en) Gaze prediction
CN114611546A (en) Multi-mobile sound source positioning method and system based on space and frequency spectrum time sequence information modeling
Aarabi et al. Robust speech processing using multi-sensor multi-source information fusion––an overview of the state of the art

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination