CN112462940A - Intelligent home multi-mode man-machine natural interaction system and method thereof - Google Patents

Intelligent home multi-mode man-machine natural interaction system and method thereof Download PDF

Info

Publication number
CN112462940A
CN112462940A CN202011339808.4A CN202011339808A CN112462940A CN 112462940 A CN112462940 A CN 112462940A CN 202011339808 A CN202011339808 A CN 202011339808A CN 112462940 A CN112462940 A CN 112462940A
Authority
CN
China
Prior art keywords
module
model
gesture
data set
voice recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011339808.4A
Other languages
Chinese (zh)
Inventor
奚雪峰
邵帮丽
崔志明
付保川
杨敬晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Golden Bit Information Technology Co ltd
Suzhou University of Science and Technology
Original Assignee
Suzhou Golden Bit Information Technology Co ltd
Suzhou University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Golden Bit Information Technology Co ltd, Suzhou University of Science and Technology filed Critical Suzhou Golden Bit Information Technology Co ltd
Priority to CN202011339808.4A priority Critical patent/CN112462940A/en
Priority to PCT/CN2021/078420 priority patent/WO2022110564A1/en
Publication of CN112462940A publication Critical patent/CN112462940A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/40Image enhancement or restoration using histogram techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/259Fusion by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Probability & Statistics with Applications (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention relates to an intelligent home multi-mode man-machine natural interaction system and method.A gesture recognition model pre-training module trains a built network model by utilizing a gesture data set conforming to a scene and stores the trained gesture recognition model; the voice recognition model pre-training module is used for sequentially training an acoustic model and a language model by utilizing a Chinese voice data set and storing the trained voice recognition model; the gesture recognition module predicts the acquired gestures by using the saved gesture recognition model; the voice recognition module is used for calling the stored voice recognition model to recognize the collected audio; and the multi-mode fusion module is used for fusing two mode results of the gesture recognition module and the voice recognition module to obtain a final instruction. The gesture recognition mode and the voice recognition mode are fused, and the household equipment is allowed to receive instructions in various forms, so that the accuracy of the instructions is improved.

Description

Intelligent home multi-mode man-machine natural interaction system and method thereof
Technical Field
The invention relates to an intelligent home multi-mode man-machine natural interaction system and method, and belongs to the field of intelligent home man-machine interaction.
Background
The multimode fusion is mainly used for realizing the model fusion among different modes, and aims to output information characteristics obtained by a plurality of information channels by using a total model, so that the model can obtain more comprehensive characteristic information due to the fact that the information of the plurality of modes is learned, and can still normally work even if a certain mode fails or is lost, correct information output is obtained, and the robustness of the model is greatly improved. Because the models used for fusion are not related, the respective errors of the models do not affect each other, and therefore, the accumulation of errors is not caused.
The research purpose of gesture recognition is to design a system which can be driven by gestures and can react differently with the change of the gestures. The gesture detection and segmentation are the primary tasks, the conventional method is to detect hand movements through the combination of skin color, shape, pixel value, motion and other visual characteristics of the hand, then perform gesture tracking to provide interframe coordinates of the appearance position of the hand or fingers, so as to generate a track of the hand movements for the subsequent recognition stage, and the final goal to be realized by gesture recognition is to explain the semantics of the gesture to be expressed.
The nature of speech recognition is statistical pattern recognition, relying on two models, an acoustic model, which is the corresponding transformation of words and pinyin, and a language model, which is the probability of the occurrence of a word in an entire sentence. The acoustic model can classify the acoustic features of the speech and correspond the acoustic features to units similar to phonemes, the language model can splice the phonemes obtained by the acoustic model into a complete sentence, and finally, the final result can be obtained by performing some text processing operations on the recognition result.
The smart home has been developed to a certain degree, but the existing smart home still has some problems in human-computer interaction, and infrared remote control by a remote controller or a mobile phone is operated by a key or a touch screen, and a third-party mobile device is needed, so that the smart home is not convenient enough; the voice assistant is used for controlling the household equipment, the source of input data is single, the flexibility of the limbs of a user is not fully utilized, the problem of receiving fuzzy input cannot be solved, and the like. The development of gesture recognition and speech recognition and multimodal techniques provides a solution for this.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides an intelligent home multi-mode man-machine natural interaction system and method.
The purpose of the invention is realized by the following technical scheme:
the intelligent home multi-mode man-machine natural interaction system is characterized in that: the system comprises a gesture recognition model pre-training module, a voice recognition model pre-training module, a gesture recognition module, a voice recognition module and a multi-mode fusion module, wherein the gesture recognition model pre-training module is used for training the built network model by utilizing a gesture data set and storing the trained gesture recognition model; the voice recognition model pre-training module loads a Chinese voice data set, trains an acoustic model and a language model in sequence, and stores the trained voice recognition model; the gesture recognition module predicts the acquired gestures by utilizing the gesture recognition model stored by the gesture recognition model pre-training module; the voice recognition module calls the voice recognition model stored by the voice recognition model pre-training module to recognize the collected audio; and the multi-mode fusion module fuses two mode results of the gesture recognition module and the voice recognition module to obtain a final instruction.
Furthermore, in the multi-modal man-machine natural interaction system for smart home, the gesture recognition model pre-training module comprises a data set building module, a data preprocessing module, a model building module and a model training module, the data set building module is preset with five types of labels, namely closing, opening, raising up, lowering down and no nothing, respectively and correspondingly acquiring gesture pictures with the same quantity, and the data scale is enlarged by using a data enhancement method to provide data support for gesture recognition model training; the data preprocessing module obtains the standardized input of the model through denoising, skin color segmentation, binarization processing, morphological processing and contour extraction; the model building module is used for building a network model and extracting the characteristics of the gesture picture; the model training module takes the data set of the data set building module as the input of the network model of the model building module in batches, updates the model parameters by using a back propagation algorithm and stores the trained gesture recognition model.
Further, in the multi-modal man-machine natural interaction system for smart home, the data set building module collects self-defined pictures of five instructions by using the camera, adds salt and pepper noise, adds gaussian noise, reduces picture brightness, improves picture brightness, rotates and overturns at random angles by using a data enhancement method, and expands the data set, so that the data set is built; the data preprocessing module is used for denoising, skin color segmentation and binarization processing, morphological processing and contour extraction, Gaussian filtering is adopted to achieve denoising, each pixel in an image is scanned by a convolution template, the weighted average gray value of the pixel points in the neighborhood of the pixel points is determined and is used for replacing the value of the pixel point at the center, and if the size of the two-dimensional template is mxn, the point (x, y) on the convolution template has the following formula:
Figure BDA0002798293880000031
wherein, σ is the standard deviation of normal distribution, and the smaller the value is, the clearer the image is; m and n represent the dimensions of the roll-up template;
the first of the two skin color segmentation methods is skin color segmentation based on an adaptive threshold method, and a gray level histogram is calculated and normalized; then calculating the mean value of the gray levels; then, calculating a zero order moment u [ i ] and a first order moment v [ i ] according to the histogram; then, the maximum inter-class variance f [ i ] is calculated, and at this time, the gray value of this variance is the adaptive threshold, and the formula is as follows:
Figure BDA0002798293880000032
the other is based on the skin color segmentation of an HSV color space, and the operation of the SkinMask mode is to acquire a gesture block diagram and convert the gesture block diagram into the HSV color space; obtaining HSV values of all pixel points of the picture, namely splitting a two-dimensional matrix into three two-dimensional matrices; finally, according to the skin color range, defining a mask with H, S, V values, setting a judgment condition, and setting the mask to be black within the skin color range; after the skin color segmentation is finished, carrying out binarization processing operation on the selected image, wherein a binarization algorithm is calculated by using the following formula, wherein T is a threshold value:
Figure BDA0002798293880000041
performing morphology processing on residual black points obtained by skin color segmentation or white points left on a background, and performing erosion and expansion operation, wherein the expansion operation is an operation of solving a local maximum value, and the erosion operation is an operation of solving a minimum value;
adopting a method of extracting a gesture outline from skin color, removing a pseudo outline and positioning the maximum outline of an area after obtaining a preprocessed image; then calculating the characteristics of each order moment, perimeter, area, mass center, shortest and longest path length and circumscribed rectangle of each contour; then acquiring an outer envelope of each contour and a set of defect points; secondly, calculating a feature vector of the contour based on the centroid after removing the false contour for the second time; finally, points which may be fingers in the contour are located in sequence.
Further, in the multi-modal man-machine natural interaction system for smart home, the voice recognition model pre-training module comprises a data set loading module, an acoustic model building module, a language model building module and a model training module, and the data set loading module downloads a Chinese voice data set and specifies a file path; the acoustic model building module builds a deep convolutional neural network by referring to VGG based on Keras and TensorFlow frames, combines continuous same symbols into the same symbol by combining with CTC decoding, and then removes a mute separation marker to obtain an actual voice pinyin symbol sequence; the language model building module is used for converting the pinyin sequence obtained by the acoustic model building module into a final character result and outputting the final character result; and the model training module inputs the data obtained by the data set loading module into the acoustic model building module in sequence, the language model building module trains, and stores the trained voice recognition model.
Further, in the multi-modal man-machine natural interaction system for smart home, the gesture recognition module includes a gesture collection module, a model calling module and a visualization module, and the gesture collection module is used for obtaining a new single gesture input; the model calling module calls a model trained by the gesture recognition model pre-training module, and takes the gesture acquired by the gesture acquisition module as input to obtain a gesture prediction result; and the visualization module displays the prediction result in a new window.
Further, in the multi-modal man-machine natural interaction system for the smart home, the voice recognition module comprises a recording module, a model calling module and a text mapping module, and the recording module collects audio within a limited time and stores the audio as a wav file; the model calling module calls a model file stored in the voice recognition model pre-training module, and takes the wav file stored in the recording module as new input of the model to obtain a result of recognizing the voice into characters; and the text mapping module performs similarity calculation on the character result and Chinese corresponding to each label preset in the gesture recognition model pre-training module, and selects the corresponding label with the maximum similarity value as an instruction result corresponding to the voice recognition.
Further, in the intelligent home multi-mode man-machine natural interaction system, the multi-mode fusion module fuses two mode results of the gesture recognition module and the voice recognition module, and the class with the highest probability in the two classifiers of gesture recognition and voice recognition is predicted based on a voting method to obtain a final instruction.
The invention discloses a multi-mode man-machine natural interaction method for smart home, which comprises the following steps:
a) firstly, acquiring a gesture picture by adopting OpenCV, expanding a data set by using a data enhancement method, and preprocessing and standardizing the picture in the data set for input; a CNN model used by a gesture recognition part is built and consists of twelve layers, a Resnet50 model packaged in a keras is called, two network models are respectively trained by utilizing a preprocessed data set, and the trained gesture recognition model is stored;
b) then, an acoustic model is built, a deep convolutional neural network built based on Keras and TensorFlow frames is combined with CTC decoding; the language model adopts a bigram model; training acoustic models and language models respectively by using a THCHS30 Chinese voice data set, and storing the trained voice recognition models;
c) acquiring a current gesture picture of a user, sequentially carrying out Gaussian denoising, carrying out skin color segmentation of a binary mode based on an adaptive threshold method or a SkinMask mode based on an HSV color space, then carrying out binarization processing, extracting a target from a background and a noise area of the image, corroding and expanding, finally extracting a gesture outline from the skin color, and inputting the processed picture as CNN and Resnet50 models respectively to obtain an instruction corresponding to the current gesture predicted by the two models;
d) collecting the audio of a user and storing the audio as a wav file, performing frame windowing operation on the wav file to obtain a spectrogram, inputting the spectrogram serving as a trained acoustic model, decoding the spectrogram by combining CTC to obtain a Chinese pinyin sequence, and inputting the Chinese pinyin sequence serving as a language model to obtain a character combination corresponding to the pinyin sequence, namely a voice recognition result;
e) and performing similarity calculation on the character result of the voice recognition and each label in the gesture recognition, mapping the voice result to the gesture label, and then performing weighted voting on the gesture recognition result and the voice recognition mapping result to obtain the category with the highest probability as a final instruction.
Furthermore, in the method for multimode man-machine natural interaction for smart home, in the step a), the adopted data enhancement methods include adding salt and pepper noise, adding gaussian noise, reducing the brightness of the picture, improving the brightness of the picture, rotating and overturning at random angles to expand a data set; the method comprises the steps of denoising a picture in a data set by adopting Gaussian filtering, carrying out skin color segmentation by utilizing a Binary mode based on an adaptive threshold method and a skinnmask mode based on an HSV color space, carrying out binarization processing and morphological processing of corrosion and expansion, and finishing preprocessing of data by adopting a method of extracting a gesture outline from skin color.
Compared with the prior art, the invention has obvious advantages and beneficial effects, and is embodied in the following aspects:
the intelligent home multi-mode man-machine natural interaction system and the method thereof utilize the gestures and voice of people to control home equipment in a multi-instruction mode, overcome the defect of low accuracy of single modality, improve the accuracy of instructions and enable man-machine interaction to be more natural;
starting from a human perception mode, the household equipment can receive various instructions, and a user can control the household equipment in various modes, so that the dependence on the traditional keys is eliminated, and the non-contact control is realized;
speech recognition fuses with two kinds of modals of gesture recognition, overcomes gesture recognition and receives the illumination easily and speech recognition easily receives the limitation of environmental noise influence to the mistake between the modality can not superpose, mutual noninterference, and when certain modality became invalid, home equipment still can work.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1: schematic diagram of the system of the present invention;
FIG. 2: the invention is a schematic structure diagram of the system;
FIG. 3: a schematic diagram of the structural principle of a gesture recognition model pre-training module;
FIG. 4 a: a predefined gesture (open) schematic;
FIG. 4 b: a predefined gesture (raise) diagram;
FIG. 4 c: a predefined gesture (turn down) diagram;
FIG. 4 d: a predefined gesture (close) schematic;
FIG. 5: a flow diagram of a data preprocessing module;
FIG. 6: a schematic diagram of the structural principle of a pre-training module of the speech recognition model;
FIG. 7: a schematic diagram of a gesture recognition module architecture principle;
FIG. 8: a schematic diagram of the speech recognition module architecture principle;
FIG. 9: and the multi-modal fusion module is a schematic diagram.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the directional terms and the sequence terms, etc. are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Aiming at the limitations of the existing contact type household equipment control method, such as the fact that fingers are wet or dirty and inconvenient to regulate, considering the maturity of the development of gesture recognition and voice recognition technology and the importance of intelligent household man-machine interaction, the method is applied to household equipment control in the field of intelligent household, taking an air conditioner as an example, a non-contact method is adopted, a decision fusion method is adopted for multi-mode fusion, the involved models are not affected mutually, and the application requirements are met.
As shown in fig. 1-2, the intelligent home multi-modal man-machine natural interaction system comprises a gesture recognition model pre-training module 1, a voice recognition model pre-training module 2, a gesture recognition module 3, a voice recognition module 4 and a multi-modal fusion module 5; the gesture recognition model pre-training module 1 and the voice recognition model pre-training module 2 respectively construct two pre-training models of gesture recognition and voice recognition, the gesture recognition module 3 and the voice recognition module 4 call the pre-training models to perform on-site acquisition and prediction, and the multi-mode fusion module 5 fuses results of the two modes according to a weighted voting method.
The gesture recognition model pre-training module 1 comprises a data set building module 101, a data preprocessing module 102, a model building module 103 and a model training module 104; constructing a data set module 101, wherein five preset types of labels, namely close, open, up, down and nothing, are respectively and correspondingly acquired with the same number of gesture pictures, and a data enhancement method is utilized to enlarge the data scale and provide data support for the training of a gesture recognition model; the data preprocessing module 102 obtains the standardized input of the model through denoising, skin color segmentation, binarization processing, morphological processing and contour extraction; the model building module 103 is used for building a network model and extracting the characteristics of the gesture picture; the model training module 104 is used for taking the data sets of the data set constructing module 101 as the input of the network model of the model constructing module 103 in batches, updating model parameters by using a back propagation algorithm and storing the trained gesture recognition model;
the flow of the gesture recognition model pre-training module 1 is shown in fig. 3, the data set construction module 101 starts to construct a gesture data set, and a camera is used for collecting a self-defined gesture, as shown in fig. 4a to 4d, ok is correspondingly opened as shown in fig. 4a, V is correspondingly increased as shown in fig. 4b, fist is correspondingly decreased as shown in fig. 4c, and a vertical palm is correspondingly closed as shown in fig. 4 d; additionally defining a kind of "nothing", that is, a disturbing picture not conforming to the above 4 gestures; then adding salt and pepper noise, adding Gaussian noise, reducing the brightness of the picture, improving the brightness of the picture, rotating and overturning at random angles by adopting a data enhancement method, and expanding the data set, wherein the final data set comprises 28105 pictures with gestures, the total number of the five gestures is five, and each gesture 5621 provides data support for model training;
the data preprocessing module 102 preprocesses the data in the data set constructing module 101 to obtain standardized input, as shown in fig. 5, where the data preprocessing includes denoising, skin color segmentation, binarization processing, morphological processing, contour extraction, and other operations. Firstly, denoising is realized by adopting Gaussian filtering, and the specific operation of the Gaussian filtering is as follows: scanning each pixel in the image by using a convolution template and determining a weighted average gray value of pixel points in the neighborhood of the pixel point to replace the value of the pixel point at the center; assuming that the two-dimensional template size is mxn, the point (x, y) on the convolution template has the following formula:
Figure BDA0002798293880000101
wherein, σ is the standard deviation of normal distribution, and the smaller the value is, the clearer the image is; m and n represent the dimensions of the roll-up template.
The skin color segmentation is a method for screening, detecting and separating a pixel region where human skin is located in an image, and comprises two skin color segmentation methods, namely the skin color segmentation based on an adaptive threshold method, wherein the specific operation comprises the steps of firstly calculating a gray level histogram and normalizing; then calculating the mean value of the gray levels; then, calculating a zero order moment u [ i ] and a first order moment v [ i ] according to the histogram; then, the maximum inter-class variance f [ i ] is calculated, at this time, the gray value of the obtained variance is the adaptive threshold, and the formula is as follows:
Figure BDA0002798293880000102
the SkinMask mode is based on an HSV color space, and the operation of the SkinMask mode is to acquire a gesture block diagram and convert the gesture block diagram into the HSV color space; obtaining HSV values of all pixel points of the picture, namely splitting a two-dimensional matrix into three two-dimensional matrices; and finally, defining a mask of H, S and V values according to the skin color range, setting a judgment condition, and setting the mask to be black within the skin color range. It can be seen from the model that when white is increased, the parameter V is kept constant and the parameter S is decreased, and when the light is sufficient, the model is very effective. Then, the selected image is subjected to binarization processing, pixels in the image can be divided into two types according to the gray value, and a binarization algorithm is calculated by the following formula:
Figure BDA0002798293880000103
the specific method is that a threshold value T is set in advance, pixels of the image are divided according to the threshold value T, and when the gray level of the pixels is smaller than the threshold value T, the pixels are represented as black; when the gray scale is greater than or equal to the threshold T, white is represented.
Two operations are morphologically processed, erosion and dilation, respectively, dilation being the operation of finding a local maximum and erosion being the operation of finding a minimum.
Adopting a method of extracting a gesture outline from skin color, removing a pseudo outline and positioning the maximum outline of an area after obtaining a preprocessed image; then calculating the characteristics of each order moment, perimeter, area, mass center, shortest and longest path length and circumscribed rectangle of each contour; then acquiring an outer envelope of each contour and a set of defect points; secondly, calculating a feature vector of the contour based on the centroid after removing the false contour for the second time; finally, sequentially positioning points which may be fingers in the outline;
then a model building module 103 builds a network model for extracting picture features, the CNN model is composed of twelve layers, namely two convolution layers, a pooling layer and a full-connection layer, two dropout layers are used for relieving overfitting, a flatten layer is used for connecting the convolution layers and the full-connection layer, and four activation functions, and 15 rounds of training are carried out by using the CNN model; in addition, a Resnet50 model of keras packaging is directly called, the number of network layers is 50, the input size is adjusted to be 200 x 200, and 10 rounds of training are carried out by taking the preprocessed picture data as input; the model training module 104 takes 20% of the data sets in the data set building module 101 as a test set, extracts 20% of the data sets as a verification set, and finally obtains a data set with 17987 pictures for training, and stores the two trained models.
The pre-training module 2 of the speech recognition model is shown in fig. 6, a data set loading module 201 downloads and loads a speech data set, the THCHS30 contains more than 1 ten thousand Chinese speech files, the total duration exceeds 30 hours, the sampling frequency is 16kHz, and the sampling size is 16 bits; the acoustic model building module 202 is used for building a deep convolution neural network by referring to the VGG on the basis of Keras and TensorFlow frames in order to obtain an actual voice pinyin symbol sequence; the language model building module 203 is used for obtaining words with the maximum probability corresponding to each pinyin by using a statistical language model, converting the pinyin into a final identification text and outputting the final identification text, and converting the pinyin obtained by the acoustic model building module 202 into the final identification text and outputting the final identification text; and the model training module 204 is used for inputting the data obtained by the data set loading module 201 into the acoustic model building module 202 in sequence, training the data by the language model building module 203, and storing the trained models.
The gesture prediction process of the gesture recognition module 3 is shown in fig. 7, and it is assumed that the gesture captured by the camera in the gesture capture module 301 is "fist making", as shown in fig. 4 c; the mask mode is a new capture mode, background content is deleted by shooting a background image, and gestures are captured in a mode of subtracting the background content from new frame content of the ROI window; after the prediction mode is started, various labels during model training can appear to be compared with gestures captured by a camera, at the moment, a change gesture can be selected, a palm stretching gesture is arranged, a fist making gesture can also be kept unchanged, and the change gesture is a palm stretching gesture shown in fig. 4 d; the model calling module 302 directly calls the model trained by the gesture recognition model pre-training module 1, the gesture collected by the gesture collecting module 301 is subjected to Gaussian denoising, skin color segmentation and binarization processing, morphology processing and contour extraction as model input, and the category with the highest probability of result selection of the ResNet50 model and the CNN model is displayed in the visualization module 303.
The voice recognition process of the voice recognition module 4 is as shown in fig. 8, the recording module 401 collects the audio within a limited time, and it is assumed that the collected audio is "turn off the air conditioner" and is stored as a wav file; the model calling module 402 calls the model files stored in the voice recognition model pre-training module 2, the wav files stored in the recording module 401 are used as new model inputs, the acoustic model and CTC decoding are performed to obtain a pinyin sequence 'guan 1 bi4 kong1 tiao 2', a character result 'air conditioner' corresponding to pinyin is obtained through a language model, cosine similarity calculation is performed on the character result and characters corresponding to five gesture labels preset in the gesture recognition model pre-training module 1, and the label corresponding to the largest similarity value is selected as a result;
as shown in fig. 9, the multi-modal fusion module 5 fuses two modal results in the gesture recognition module 3 and the voice recognition module 4, based on the voting method, the result weight of the Resnet50 in the gesture recognition is 0.5, the result weight of the CNN model is 0.3, the result weight of the voice recognition result mapped to the gesture tag is 0.2, the three results are weighted and summed, and the class with the highest probability is selected as the final air conditioning instruction. At this time, the gesture recognition and voice recognition prediction categories are closed, so that the final instruction is a closing instruction, if different results occur, the three results are multiplied by respective weights, the probability values of the same label are added, and the category with the highest probability is finally selected as the final instruction.
In conclusion, the intelligent home multi-mode man-machine natural interaction system and the method thereof control the home equipment by utilizing the gestures and the voice of the people to carry out multi-instruction control, overcome the defect of low accuracy of a single mode, and enable man-machine interaction to be more natural; starting from a human perception mode, the household equipment can receive various instructions, and a user can control the household equipment in various modes, so that the dependence on the traditional key is eliminated, and the non-contact control is realized; the two modes of voice recognition and gesture recognition are fused, so that the limitations that the gesture recognition is easily influenced by illumination and the voice recognition is easily influenced by environmental noise are overcome, errors among the modes cannot be superposed and do not interfere with each other, and when one mode fails, the household equipment can still work; the multi-mode fusion is applied to the control of household equipment, and the correctness of the instruction is improved.
The interaction process is carried out by adopting a method of fusing two modes of voice recognition and gesture recognition, and the intelligent home man-machine interaction is carried out by adopting a non-contact multi-mode fusion method.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and shall be covered by the scope of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (9)

1. Intelligent house multimode man-machine natural interaction system, its characterized in that: the gesture recognition system comprises a gesture recognition model pre-training module (1), a voice recognition model pre-training module (2), a gesture recognition module (3), a voice recognition module (4) and a multi-mode fusion module (5), wherein the gesture recognition model pre-training module (1) is used for training a built network model by utilizing a gesture data set and storing the trained gesture recognition model; the voice recognition model pre-training module (2) loads a Chinese voice data set, trains an acoustic model and a language model in sequence, and stores the trained voice recognition model; the gesture recognition module (3) predicts the acquired gestures by utilizing the gesture recognition model stored by the gesture recognition model pre-training module (1); the voice recognition module (4) calls the voice recognition model stored by the voice recognition model pre-training module (2) to recognize the collected audio; and the multi-mode fusion module (5) fuses two mode results of the gesture recognition module (3) and the voice recognition module (4) to obtain a final instruction.
2. The smart home multi-modal human-computer natural interaction system of claim 1, wherein: the gesture recognition model pre-training module (1) comprises a data set building module (101), a data preprocessing module (102), a model building module (103) and a model training module (104), the data set building module (101) is preset with five types of labels, namely closing close, opening open, increasing up, decreasing down and no nothing, respectively and correspondingly collects gesture pictures with the same quantity, and a data enhancement method is utilized to enlarge the data scale and provide data support for the gesture recognition model training; the data preprocessing module (102) obtains the standardized input of the model through denoising, skin color segmentation, binarization processing, morphological processing and contour extraction; the model building module (103) is used for building a network model and extracting the characteristics of the gesture picture; and the model training module (104) takes the data sets of the data set building module (101) as the input of the network model of the model building module (103) in batches, updates the model parameters by using a back propagation algorithm and stores the trained gesture recognition model.
3. The smart home multi-modal human-computer natural interaction system of claim 2, wherein: the data set building module (101) collects self-defined pictures of five instructions by using a camera, adds salt and pepper noise, Gaussian noise, reduces the brightness of the pictures, improves the brightness of the pictures, rotates and overturns at random angles by using a data enhancement method, expands the data set and completes the building of the data set; a data preprocessing module (102) which comprises denoising, skin color segmentation and binarization processing, morphological processing and outline extraction processes, wherein the denoising is realized by adopting Gaussian filtering, each pixel in an image is scanned by a convolution template, the weighted average gray value of the pixel points in the neighborhood of each pixel is determined and is used for replacing the value of the pixel point at the center, and if the size of the two-dimensional template is mxn, the point (x, y) on the convolution template has the following formula:
Figure FDA0002798293870000021
wherein, σ is the standard deviation of normal distribution, and the smaller the value is, the clearer the image is; m and n represent the dimensions of the roll-up template;
the first of the two skin color segmentation methods is skin color segmentation based on an adaptive threshold method, and a gray level histogram is calculated and normalized; then calculating the mean value of the gray levels; then, calculating a zero order moment u [ i ] and a first order moment v [ i ] according to the histogram; then, the maximum inter-class variance f [ i ] is calculated, at this time, the gray value of the obtained variance is the adaptive threshold, and the formula is as follows:
Figure FDA0002798293870000022
the other is based on the skin color segmentation of an HSV color space, and the operation of the SkinMask mode is to acquire a gesture block diagram and convert the gesture block diagram into the HSV color space; obtaining HSV values of all pixel points of the picture, namely splitting a two-dimensional matrix into three two-dimensional matrices; finally, according to the skin color range, defining a mask with H, S, V values, setting a judgment condition, and setting the mask to be black within the skin color range; after the skin color segmentation is finished, carrying out binarization processing operation on the selected image, wherein a binarization algorithm is calculated by using the following formula, wherein T is a threshold value:
Figure FDA0002798293870000023
performing morphology processing on residual black points obtained by skin color segmentation or white points left on a background, and performing erosion and expansion operation, wherein the expansion operation is an operation of solving a local maximum value, and the erosion operation is an operation of solving a minimum value;
adopting a method of extracting a gesture outline from skin color, removing a pseudo outline and positioning the maximum outline of an area after obtaining a preprocessed image; then calculating the characteristics of each order moment, perimeter, area, mass center, shortest and longest path length and circumscribed rectangle of each contour; then acquiring an outer envelope of each contour and a set of defect points; secondly, calculating a feature vector of the contour based on the centroid after removing the false contour for the second time; finally, points which may be fingers in the contour are located in sequence.
4. The smart home multi-modal human-computer natural interaction system of claim 1, wherein: the voice recognition model pre-training module (2) comprises a data set loading module (201), an acoustic model building module (202), a language model building module (203) and a model training module (204), wherein the data set loading module (201) is used for downloading a Chinese voice data set and specifying a file path; the acoustic model building module (202) is used for obtaining an actual phonetic alphabet symbol sequence; the language model building module (203) is used for converting the pinyin sequence obtained by the acoustic model building module (202) into a final character result and outputting the final character result; and the model training module (204) inputs the data obtained by the data set loading module (201) into the acoustic model building module (202) in sequence, the language model building module (203) trains, and the trained speech recognition model is stored.
5. The smart home multi-modal human-computer natural interaction system of claim 1, wherein: the gesture recognition module (3) comprises a gesture acquisition module (301), a model calling module (302) and a visualization module (303), wherein the gesture acquisition module (301) is used for acquiring a new single gesture input; the model calling module (302) calls the model trained by the gesture recognition model pre-training module (1), and takes the gesture acquired by the gesture acquisition module (301) as input to obtain a gesture prediction result; and the visualization module (303) displays the prediction result in a new window.
6. The smart home multi-modal human-computer natural interaction system of claim 1, wherein: the voice recognition module (4) comprises a recording module (401), a model calling module (402) and a text mapping module (403), wherein the recording module (401) collects audio within a limited time and stores the audio as a wav file; the model calling module (402) calls the model file stored in the voice recognition model pre-training module (2), and takes the wav file stored in the recording module (401) as the new input of the model to obtain the result of recognizing the voice into characters; the text mapping module (403) calculates the similarity between the character result and the Chinese corresponding to each label preset in the gesture recognition model pre-training module (1), and selects the corresponding label with the maximum similarity as the instruction result corresponding to the voice recognition.
7. The smart home multi-modal human-computer natural interaction system of claim 1, wherein: the multi-mode fusion module (5) fuses two mode results of the gesture recognition module (3) and the voice recognition module (4), and the class with the highest probability in the two classifiers of the gesture recognition and the voice recognition is predicted based on a voting method to obtain a final instruction.
8. The intelligent home multi-mode man-machine natural interaction method is characterized by comprising the following steps: the method comprises the following steps:
a) firstly, acquiring a gesture picture by adopting OpenCV, expanding a data set by using a data enhancement method, and preprocessing and standardizing the picture in the data set for input; a CNN model used by a gesture recognition part is built and consists of twelve layers, a Resnet50 model packaged in a keras is called, two network models are respectively trained by utilizing a preprocessed data set, and the trained gesture recognition model is stored;
b) then, an acoustic model is built, a deep convolutional neural network built based on Keras and TensorFlow frames is combined with CTC decoding; the language model adopts a bigram model; training acoustic models and language models respectively by using a THCHS30 Chinese voice data set, and storing the trained voice recognition models;
c) acquiring a current gesture picture of a user, sequentially carrying out Gaussian denoising, carrying out skin color segmentation of a binary mode based on an adaptive threshold method or a SkinMask mode based on an HSV color space, then carrying out binarization processing, extracting a target from a background and a noise area of the image, corroding and expanding, finally extracting a gesture outline from the skin color, and inputting the processed picture as CNN and Resnet50 models respectively to obtain an instruction corresponding to the current gesture predicted by the two models;
d) collecting the audio of a user and storing the audio as a wav file, performing frame windowing operation on the wav file to obtain a spectrogram, inputting the spectrogram serving as a trained acoustic model, decoding the spectrogram by combining CTC to obtain a Chinese pinyin sequence, and inputting the Chinese pinyin sequence serving as a language model to obtain a character combination corresponding to the pinyin sequence, namely a voice recognition result;
e) and performing similarity calculation on the character result of the voice recognition and each label in the gesture recognition, mapping the voice result to the gesture label, and then performing weighted voting on the gesture recognition result and the voice recognition mapping result to obtain the category with the highest probability as a final instruction.
9. The smart home multi-modal human-computer natural interaction method according to claim 8, characterized in that: step a), adopting a data enhancement method comprising adding salt and pepper noise, adding Gaussian noise, reducing the brightness of the picture, improving the brightness of the picture, rotating and overturning at random angles to expand a data set; the method comprises the steps of denoising a picture in a data set by adopting Gaussian filtering, carrying out skin color segmentation by utilizing a Binary mode based on an adaptive threshold method and a skinnmask mode based on an HSV color space, carrying out binarization processing and morphological processing of corrosion and expansion, and finishing preprocessing of data by adopting a method of extracting a gesture outline from skin color.
CN202011339808.4A 2020-11-25 2020-11-25 Intelligent home multi-mode man-machine natural interaction system and method thereof Pending CN112462940A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011339808.4A CN112462940A (en) 2020-11-25 2020-11-25 Intelligent home multi-mode man-machine natural interaction system and method thereof
PCT/CN2021/078420 WO2022110564A1 (en) 2020-11-25 2021-03-01 Smart home multi-modal human-machine natural interaction system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011339808.4A CN112462940A (en) 2020-11-25 2020-11-25 Intelligent home multi-mode man-machine natural interaction system and method thereof

Publications (1)

Publication Number Publication Date
CN112462940A true CN112462940A (en) 2021-03-09

Family

ID=74808312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011339808.4A Pending CN112462940A (en) 2020-11-25 2020-11-25 Intelligent home multi-mode man-machine natural interaction system and method thereof

Country Status (2)

Country Link
CN (1) CN112462940A (en)
WO (1) WO2022110564A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095446A (en) * 2021-06-09 2021-07-09 中南大学 Abnormal behavior sample generation method and system
CN113190107A (en) * 2021-03-16 2021-07-30 青岛小鸟看看科技有限公司 Gesture recognition method and device and electronic equipment
CN113299132A (en) * 2021-06-08 2021-08-24 上海松鼠课堂人工智能科技有限公司 Student speech skill training method and system based on virtual reality scene
CN113311939A (en) * 2021-04-01 2021-08-27 江苏理工学院 Intelligent sound box control system based on gesture recognition
CN113849068A (en) * 2021-09-28 2021-12-28 中国科学技术大学 Gesture multi-mode information fusion understanding and interacting method and system
CN114610157A (en) * 2022-03-23 2022-06-10 北京拙河科技有限公司 Gesture interaction based method and system
CN115145402A (en) * 2022-09-01 2022-10-04 深圳市复米健康科技有限公司 Intelligent toy system with network interaction function and control method
CN117718969B (en) * 2024-01-18 2024-05-31 浙江孚宝智能科技有限公司 Household robot control system and method based on visual and auditory fusion

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329578A (en) * 2022-08-19 2022-11-11 南京邮电大学 Three-dimensional modeling system and modeling method based on multi-mode fusion
CN116258655B (en) * 2022-12-13 2024-03-12 合肥工业大学 Real-time image enhancement method and system based on gesture interaction
CN116434027A (en) * 2023-06-12 2023-07-14 深圳星寻科技有限公司 Artificial intelligent interaction system based on image recognition
CN117316158B (en) * 2023-11-28 2024-04-12 科大讯飞股份有限公司 Interaction method, device, control equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107342076A (en) * 2017-07-11 2017-11-10 华南理工大学 A kind of intelligent home control system and method for the abnormal voice of compatibility
CN109814722A (en) * 2019-02-25 2019-05-28 苏州长风航空电子有限公司 A kind of multi-modal man-machine interactive system and exchange method
CN109902577A (en) * 2019-01-25 2019-06-18 华中科技大学 A kind of construction method of lightweight gestures detection convolutional neural networks model and application
CN110362210A (en) * 2019-07-24 2019-10-22 济南大学 The man-machine interaction method and device of eye-tracking and gesture identification are merged in Virtual assemble
CN110554774A (en) * 2019-07-22 2019-12-10 济南大学 AR-oriented navigation type interactive normal form system
CN111158491A (en) * 2019-12-31 2020-05-15 苏州莱孚斯特电子科技有限公司 Gesture recognition man-machine interaction method applied to vehicle-mounted HUD
CN111554279A (en) * 2020-04-27 2020-08-18 天津大学 Multi-mode man-machine interaction system based on Kinect

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281435A1 (en) * 2009-04-30 2010-11-04 At&T Intellectual Property I, L.P. System and method for multimodal interaction using robust gesture processing
CN102339129B (en) * 2011-09-19 2013-12-25 北京航空航天大学 Multichannel human-computer interaction method based on voice and gestures
CN102824092A (en) * 2012-08-31 2012-12-19 华南理工大学 Intelligent gesture and voice control system of curtain and control method thereof
CN104965592A (en) * 2015-07-08 2015-10-07 苏州思必驰信息科技有限公司 Voice and gesture recognition based multimodal non-touch human-machine interaction method and system
CN111709295A (en) * 2020-05-18 2020-09-25 武汉工程大学 SSD-MobileNet-based real-time gesture detection and recognition method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107342076A (en) * 2017-07-11 2017-11-10 华南理工大学 A kind of intelligent home control system and method for the abnormal voice of compatibility
CN109902577A (en) * 2019-01-25 2019-06-18 华中科技大学 A kind of construction method of lightweight gestures detection convolutional neural networks model and application
CN109814722A (en) * 2019-02-25 2019-05-28 苏州长风航空电子有限公司 A kind of multi-modal man-machine interactive system and exchange method
CN110554774A (en) * 2019-07-22 2019-12-10 济南大学 AR-oriented navigation type interactive normal form system
CN110362210A (en) * 2019-07-24 2019-10-22 济南大学 The man-machine interaction method and device of eye-tracking and gesture identification are merged in Virtual assemble
CN111158491A (en) * 2019-12-31 2020-05-15 苏州莱孚斯特电子科技有限公司 Gesture recognition man-machine interaction method applied to vehicle-mounted HUD
CN111554279A (en) * 2020-04-27 2020-08-18 天津大学 Multi-mode man-machine interaction system based on Kinect

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
崔家礼;解威;王一丁;贾瑞明;: "基于形状特征的静态手势数字识别", 北方工业大学学报, no. 03 *
杨焕峥: ""基于深度学习的中文语音识别模型设计与实现"", 《湖南邮电职业技术学院学报》, pages 24 - 27 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190107A (en) * 2021-03-16 2021-07-30 青岛小鸟看看科技有限公司 Gesture recognition method and device and electronic equipment
CN113190107B (en) * 2021-03-16 2023-04-14 青岛小鸟看看科技有限公司 Gesture recognition method and device and electronic equipment
CN113311939A (en) * 2021-04-01 2021-08-27 江苏理工学院 Intelligent sound box control system based on gesture recognition
CN113299132A (en) * 2021-06-08 2021-08-24 上海松鼠课堂人工智能科技有限公司 Student speech skill training method and system based on virtual reality scene
CN113095446A (en) * 2021-06-09 2021-07-09 中南大学 Abnormal behavior sample generation method and system
CN113849068A (en) * 2021-09-28 2021-12-28 中国科学技术大学 Gesture multi-mode information fusion understanding and interacting method and system
CN113849068B (en) * 2021-09-28 2024-03-29 中国科学技术大学 Understanding and interaction method and system for multi-modal information fusion of gestures
CN114610157A (en) * 2022-03-23 2022-06-10 北京拙河科技有限公司 Gesture interaction based method and system
CN115145402A (en) * 2022-09-01 2022-10-04 深圳市复米健康科技有限公司 Intelligent toy system with network interaction function and control method
CN117718969B (en) * 2024-01-18 2024-05-31 浙江孚宝智能科技有限公司 Household robot control system and method based on visual and auditory fusion

Also Published As

Publication number Publication date
WO2022110564A1 (en) 2022-06-02

Similar Documents

Publication Publication Date Title
CN112462940A (en) Intelligent home multi-mode man-machine natural interaction system and method thereof
CN111931701B (en) Gesture recognition method and device based on artificial intelligence, terminal and storage medium
CN111339990B (en) Face recognition system and method based on dynamic update of face features
CN111709310B (en) Gesture tracking and recognition method based on deep learning
CN104049754B (en) Real time hand tracking, posture classification and Interface Control
CN110796018B (en) Hand motion recognition method based on depth image and color image
CN108537147A (en) A kind of gesture identification method based on deep learning
CN111126280B (en) Gesture recognition fusion-based aphasia patient auxiliary rehabilitation training system and method
Vishwakarma et al. An efficient interpretation of hand gestures to control smart interactive television
WO2021208617A1 (en) Method and apparatus for recognizing station entering and exiting, terminal, and storage medium
CN112198966B (en) Stroke identification method and system based on FMCW radar system
CN112001394A (en) Dictation interaction method, system and device based on AI vision
CN109558855B (en) A kind of space gesture recognition methods combined based on palm contour feature with stencil matching method
Raees et al. Image based recognition of Pakistan sign language
CN114937179A (en) Junk image classification method and device, electronic equipment and storage medium
CN114445853A (en) Visual gesture recognition system recognition method
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN114639150A (en) Emotion recognition method and device, computer equipment and storage medium
CN113177531A (en) Speaking identification method, system, equipment and medium based on video analysis
CN111582382B (en) State identification method and device and electronic equipment
Nath et al. Embedded sign language interpreter system for deaf and dumb people
CN117173677A (en) Gesture recognition method, device, equipment and storage medium
Gaikwad et al. Recognition of American sign language using image processing and machine learning
US20230074386A1 (en) Method and apparatus for performing identity recognition on to-be-recognized object, device and medium
Jindal et al. Sign Language Detection using Convolutional Neural Network (CNN)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination