WO2022110564A1 - 智能家居多模态人机自然交互系统及其方法 - Google Patents

智能家居多模态人机自然交互系统及其方法 Download PDF

Info

Publication number
WO2022110564A1
WO2022110564A1 PCT/CN2021/078420 CN2021078420W WO2022110564A1 WO 2022110564 A1 WO2022110564 A1 WO 2022110564A1 CN 2021078420 W CN2021078420 W CN 2021078420W WO 2022110564 A1 WO2022110564 A1 WO 2022110564A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
model
gesture
speech recognition
data set
Prior art date
Application number
PCT/CN2021/078420
Other languages
English (en)
French (fr)
Inventor
奚雪峰
邵帮丽
崔志明
付保川
杨敬晶
Original Assignee
苏州科技大学
苏州金比特信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州科技大学, 苏州金比特信息科技有限公司 filed Critical 苏州科技大学
Publication of WO2022110564A1 publication Critical patent/WO2022110564A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/40Image enhancement or restoration using histogram techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/259Fusion by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Definitions

  • the invention relates to a multi-modal human-machine natural interaction system and a method for a smart home, and belongs to the field of smart home human-machine interaction.
  • Multi-modal fusion is mainly used to achieve model fusion between different modalities.
  • the purpose is to use a general model to output information features obtained from multiple information channels, so that the model can obtain more comprehensive information due to the learning of multiple modalities. It can work normally even when a certain mode is invalid or missing, and obtain the correct information output, which greatly improves the robustness of the model. Because the models used for fusion are often unrelated, the respective errors of these models will not affect each other, so there will be no accumulation of errors.
  • gesture recognition research is to design a system that can be driven solely by gestures and respond differently to changes in gestures.
  • Gesture detection and segmentation are the primary tasks.
  • the conventional method is to detect hand movements through the combination of visual features such as skin color, shape, pixel value, motion, etc. of the hand, and then perform gesture tracking to provide the frame-to-frame coordinates of the appearance position of the hand or finger.
  • the trajectory of the hand movement is generated for the subsequent recognition stage.
  • the final goal of gesture recognition is to interpret the semantics of the gesture.
  • the essence of speech recognition is statistical pattern recognition, which relies on two models, an acoustic model and a language model, the former is the corresponding conversion of words and pinyin, and the latter is the probability of words appearing in the entire sentence.
  • the acoustic model can classify the acoustic features of the speech and correspond them to phoneme-like units, while the language model can splicing the phonemes obtained by the acoustic model into a complete sentence, and finally perform some text processing operations on the recognition results. to get the final result.
  • the smart home has developed to a certain extent, but there are still some problems in the existing human-computer interaction of the smart home.
  • the infrared remote control with the remote control or the mobile phone, and the operation through the buttons or the touch screen require the help of a third-party mobile device, which is not convenient enough. ; Relying on the voice assistant to control the home equipment, the input data source is single, the human body flexibility is not fully utilized, and the problem of receiving fuzzy input cannot be solved. Gesture and speech recognition and the development of multimodal technology provide a solution for this.
  • the purpose of the present invention is to overcome the deficiencies of the prior art, and to provide a multi-modal human-machine natural interaction system for smart home and a method thereof.
  • the smart home multi-modal human-computer natural interaction system is characterized by: including a gesture recognition model pre-training module, a speech recognition model pre-training module, a gesture recognition module, a speech recognition module and a multi-modal fusion module, the gesture recognition model pre-training module.
  • the training module uses the gesture data set to train the built network model, and saves the trained gesture recognition model;
  • the speech recognition model pre-training module loads the Chinese speech data set, trains the acoustic model and the language model in turn, and saves the trained model.
  • a speech recognition model; the gesture recognition module uses the gesture recognition model saved by the gesture recognition model pre-training module to predict the collected gestures;
  • the speech recognition module calls the speech recognition model saved by the speech recognition model pre-training module to collect the collected gestures.
  • the audio is recognized; the multimodal fusion module fuses the two modal results of the gesture recognition module and the speech recognition module to obtain the final command.
  • the gesture recognition model pre-training module includes a data set construction module, a data preprocessing module, a model construction module and a model training module, and the construction data set module Module, the preset five types of labels, namely close close, open open, increase up, decrease down, and nothing, respectively, collect the same number of gesture pictures, and use the method of data enhancement to expand the data scale to train the gesture recognition model Provide data support;
  • the data preprocessing module obtains the standardized input of the model after denoising, skin color segmentation, binarization processing, morphological processing and contour extraction;
  • the model building module builds a network model for extracting gestures Picture features;
  • the model training module uses the data sets of the data set building module as the input of the network model of the model building module in batches, uses the back-propagation algorithm to update the model parameters, and saves the trained gesture recognition model.
  • the building data set module uses a camera to collect pictures of five kinds of self-defined instructions, and uses a data enhancement method to add salt and pepper noise, add Gaussian noise, and reduce pictures.
  • is the standard deviation of the normal distribution, the smaller the value, the clearer the image;
  • m and n represent the size of the convolution template;
  • the first is skin color segmentation based on the adaptive threshold method.
  • the grayscale histogram is calculated and normalized; then the mean value of the grayscale is calculated; then the zero-order moment u[i] and the first-order moment are calculated according to the histogram.
  • the other is skin color segmentation based on the HSV color space.
  • the operation of the SkinMask mode is to first obtain the gesture block diagram and convert it to the HSV space; then obtain the HSV value of each pixel of the picture, that is, a two-dimensional matrix is divided into three two dimensional matrix; finally define the masks of H, S, and V values according to the skin color range, set the judgment conditions, and set the mask to black if not within the skin color range; after the skin color segmentation is completed, the selected image is subjected to a binarization operation.
  • the binarization algorithm is calculated with the following formula, where T is the threshold:
  • Morphological processing performs corrosion and expansion operations on the remaining black spots in the skin color segmentation, or the white spots left on the background.
  • the expansion is the operation of finding the local maximum value
  • the corrosion is the operation of finding the minimum value
  • the method of extracting gesture contour by skin color after obtaining the preprocessed image, first remove the pseudo contour and locate the largest contour of the area; then calculate each order moment, perimeter, area, centroid, shortest and longest diameter, and circumscribed rectangle of each contour. Then, the outer envelope of each contour and the set of defect points are obtained; then the pseudo contour is removed twice and the feature vector based on the centroid of the contour is calculated; finally, the points in the contour that may be fingers are located in turn.
  • the speech recognition model pre-training module includes a data set loading module, an acoustic model building module, a language model building module and a model training module, and the data set Load the module, download the Chinese speech data set and specify the file path;
  • the acoustic model building module is based on Keras and TensorFlow frameworks, with reference to VGG to build a deep convolutional neural network, combined with CTC decoding to combine consecutive identical symbols into the same symbol, and then Then remove the mute separation marker to obtain the actual phonetic phonetic symbol sequence;
  • the language model building module is to convert the phonetic sequence obtained by the acoustic model building module into the final text result and output;
  • the model training module the data set
  • the data obtained by the loading module is sequentially input into the acoustic model building module, the language model building module is trained, and the trained speech recognition model is saved.
  • the gesture recognition module includes a gesture acquisition module, a model calling module and a visualization module, and the gesture acquisition module is used to obtain a new single gesture input;
  • the model invocation module invokes the model trained by the gesture recognition model pre-training module, and uses the gesture collected by the gesture acquisition module as an input to obtain a gesture prediction result;
  • the visualization module displays the prediction result in a new window.
  • the speech recognition module includes a recording module, a model calling module and a text mapping module, and the recording module collects audio within a time limit and saves it as a wav file;
  • the model calling module calls
  • the model file saved in the speech recognition model pre-training module uses the wav file saved in the recording module as the new input of the model to obtain the result of speech recognition into text;
  • the text mapping module combines the text result with the preset in the gesture recognition model pre-training module.
  • the Chinese corresponding to each label of is used for similarity calculation, and the corresponding label with the largest similarity value is selected as the instruction result corresponding to the speech recognition.
  • the multi-modal fusion module fuses the two modal results of the gesture recognition module and the speech recognition module, and completes the predicted gesture recognition and The speech recognizes the class with the highest probability of the two classifiers, resulting in the final instruction.
  • the method for multi-modal human-machine natural interaction in the smart home of the present invention includes the following steps:
  • c) Collect the user's current gesture picture, perform Gaussian denoising in turn, segment the skin color based on the binary mode of the adaptive threshold method or the SkinMask mode based on the HSV color space, and then perform binarization processing to remove the target from the background and noise areas of the image. Then, after corrosion and expansion, the gesture outline is extracted from the skin color, and the processed images are input as CNN and Resnet50 models respectively, and the instructions corresponding to the current gesture predicted by the two models are obtained;
  • the data enhancement methods adopted include adding salt and pepper noise, adding Gaussian noise, reducing picture brightness, improving picture brightness, rotating at random angles and flipping. , to expand the data set; Gaussian filtering is used to denoise the pictures in the data set, and then the Binary mode based on the adaptive threshold method and the SkinMask mode based on the HSV color space are used for skin color segmentation, and then the binarization processing and corrosion are performed. And the morphological processing of the expansion, and finally the method of extracting the contour of the gesture using the skin color to complete the data preprocessing.
  • the present invention has significant advantages and beneficial effects, which are embodied in the following aspects:
  • the smart home multi-modal human-computer natural interaction system and method thereof of the present invention utilizes human gestures and voices to control home equipment with multiple instructions, overcomes the defect of low accuracy of a single mode, improves the accuracy of instructions, and enables man-machine more natural interaction;
  • the home equipment can accept a variety of instructions, and the user can control the home equipment in a variety of ways, get rid of the dependence on traditional buttons, and achieve non-contact control;
  • Fig. 1 The principle schematic diagram of the system of the present invention
  • Figure 2 schematic diagram of the architecture of the system of the present invention
  • Figure 3 Schematic diagram of the architecture of the gesture recognition model pre-training module
  • Figure 4a Schematic diagram of predefined gesture (open);
  • Figure 4b Schematic diagram of a predefined gesture (raise up);
  • Figure 4c Schematic diagram of predefined gesture (turn down);
  • Figure 4d Schematic diagram of predefined gesture (close);
  • FIG. 5 Schematic flow diagram of the data preprocessing module
  • Figure 6 Schematic diagram of the architecture of the pre-training module of the speech recognition model
  • Figure 7 Schematic diagram of the architecture of the gesture recognition module
  • Figure 8 Schematic diagram of the architecture of the speech recognition module
  • Figure 9 Schematic diagram of the principle of the multimodal fusion module.
  • the present invention For home equipment control in the field of smart homes, taking air conditioners as an example, the non-contact method is adopted, and the method of decision fusion is used for multi-modal fusion.
  • the models involved in the fusion do not affect each other and meet the application requirements.
  • the smart home multi-modal human-computer natural interaction system includes gesture recognition model pre-training module 1, speech recognition model pre-training module 2, gesture recognition module 3, speech recognition module 4 and multi-modal fusion Module 5:
  • Gesture recognition model pre-training module 1 and speech recognition model pre-training module 2 respectively construct two pre-training models for gesture recognition and speech recognition, and gesture recognition module 3 and speech recognition module 4 call the pre-trained models for on-site collection and prediction,
  • the multimodal fusion module 5 fuses the results of the two modalities according to the weighted voting method.
  • the gesture recognition model pre-training module 1 includes a data set building module 101, a data preprocessing module 102, a model building module 103 and a model training module 104;
  • the building data set module 101 has five preset labels, namely close close, open open, Increase up, decrease down, and nothing corresponds to collecting the same number of gesture pictures, and use the data enhancement method to expand the data scale to provide data support for the gesture recognition model training;
  • data preprocessing module 102 after denoising, skin color segmentation , binarization processing, morphological processing and contour extraction to obtain the standardized input of the model;
  • the model building module 103 builds a network model for extracting gesture picture features;
  • the model training module 104 builds the data of the data set module 101 The set is divided into batches as the input of the network model of the model building module 103, and the model parameters are updated by using the back-propagation algorithm, and the trained gesture recognition model is saved;
  • the process of the gesture recognition model pre-training module 1 is shown in Figure 3.
  • the construction data set module 101 starts to construct the gesture data set, and uses the camera to collect custom gestures, as shown in Figures 4a to 4d, and "ok” corresponds to open Figure 4a , "V” corresponds to increase as shown in Figure 4b, "clenched fist” corresponds to lower as shown in Figure 4c, and “vertical palm” corresponds to close command as shown in Figure 4d; an additional “nothing” is defined, that is, the interference pictures that do not meet the above 4 gestures ; Then use the method of data enhancement to add salt and pepper noise, add Gaussian noise, reduce the brightness of the picture, increase the brightness of the picture, rotate and flip at a random angle, and expand the data set.
  • the final data set includes 28105 pictures of gestures, a total of five kinds Gestures, 5621 for each gesture, provide data support for model training;
  • the data preprocessing module 102 preprocesses the data in the building data set module 101 to obtain standardized input, as shown in Figure 5, the data preprocessing includes operations such as denoising, skin color segmentation, binarization processing, morphological processing, and contour extraction.
  • Gaussian filtering is used to achieve denoising.
  • the specific operation of Gaussian filtering is: scan each pixel in the image with a convolution template and determine the weighted average gray value of the pixels in its neighborhood to replace the pixel at the center. If the size of the two-dimensional template is m ⁇ n, the point (x, y) on the convolution template has the following formula:
  • is the standard deviation of the normal distribution, and the smaller the value, the clearer the image; m and n represent the size of the convolution template.
  • Skin color segmentation is to screen, detect and separate the pixel area where the human skin is located in the image.
  • One is skin color segmentation based on the adaptive threshold method.
  • the specific operation is to first calculate the grayscale histogram and normalize it; Calculate the mean value of the gray level; then calculate the zero-order moment u[i] and the first-order moment v[i] according to the histogram; then calculate the maximum inter-class variance f[i], at this time, the gray value of the obtained variance is is the adaptive threshold, and its formula is as follows:
  • the operation of the SkinMask mode is to first obtain the gesture block diagram and convert it to the HSV space; then obtain the HSV value of each pixel of the picture, that is, a two-dimensional matrix is divided into three two-dimensional Matrix; finally, define the mask of H, S, and V values according to the skin color range, set the judgment conditions, and set the mask to black if not within the skin color range. It can be seen from the model that when increasing the white, the parameter V will remain the same and the parameter S will continue to decrease. This mode is very effective when there is sufficient light. Then the selected image is binarized, and the pixels in the image can be divided into two types according to this gray value.
  • the binarization algorithm is calculated by the following formula:
  • the specific method is to set a threshold T in advance, and divide the pixels of the image against this threshold.
  • the grayscale of the pixel is less than the threshold T, it is represented as black; when the grayscale is greater than or equal to the threshold T, it is represented as white.
  • Morphology deals with two operations, namely erosion and dilation. Dilation is an operation for finding a local maximum, and erosion is an operation for finding a minimum value.
  • the method of extracting gesture contour by skin color after obtaining the preprocessed image, first remove the false contour and locate the largest contour of the area; then calculate each order moment, perimeter, area, centroid, shortest and longest diameter, circumscribed rectangle of each contour. Then obtain the outer envelope of each contour and the set of defect points; then remove the pseudo contour twice and calculate the feature vector based on the centroid of the contour; finally, locate the points in the contour that may be fingers in turn;
  • the model building module 103 builds a network model for extracting image features.
  • the CNN model consists of two convolutional layers, one pooling layer, two fully connected layers, two dropout layers for alleviating overfitting, and one flatten layer.
  • the layer is used to connect the convolutional layer and the fully-connected layer. It consists of four activation functions and a total of twelve layers.
  • the CNN model is used for 15 rounds of training.
  • the Resnet50 model encapsulated by keras is directly called. The number of network layers is 50, and the input is adjusted.
  • the size is 200*200, and the preprocessed image data is used as input for 10 rounds of training; the model training module 104 uses 20% of the data set in the construction data set module 101 as the test set, and then extracts 20% as the verification set, and finally obtains A dataset of 17987 images is used for training, and the two trained models are saved.
  • the speech recognition model pre-training module 2 downloads and loads the speech data set.
  • THCHS30 contains more than 10,000 Chinese speech files with a total duration of more than 30 hours, the sampling frequency is 16kHz, and the sampling frequency is 16kHz.
  • the size is 16 bits; the acoustic model building module 202, in order to obtain the actual phonetic pinyin symbol sequence, based on the Keras and TensorFlow framework, and referring to VGG to build a deep convolutional neural network; the language model building module 203, using a statistical language model to obtain the corresponding pinyin The word with the maximum probability of , converts the pinyin into the final recognition text and outputs it, converts the pinyin obtained by the acoustic model building module 202 into the final recognition text and outputs; the model training module 204 sequentially converts the data obtained by the data set loading module 201 Input the acoustic model building module 202, the language model building module 203 performs training, and saves the trained model.
  • the gesture prediction process of the gesture recognition module 3 is shown in Figure 7. It is assumed that the gesture captured by the camera in the gesture acquisition module 301 is "make a fist", as shown in Figure 4c; the mask mode is a new capture method, which is obtained by shooting a background image. Delete the background content, and capture the gesture by subtracting the background content from the new frame content of the ROI window; after the prediction mode is turned on, various labels during model training will appear to be compared with the gesture captured by the camera.
  • the model calling module 302 directly calls the gesture recognition model pre-training module 1 to train
  • the gestures collected by the gesture acquisition module 301 are subjected to Gaussian denoising, skin color segmentation and binarization processing, morphological processing, and contour extraction as model input.
  • the results of the ResNet50 model and the CNN model select the category with the highest probability.
  • the visualization module 303 is displayed.
  • the speech recognition process of the speech recognition module 4 is shown in Figure 8.
  • the recording module 401 collects audio within a time limit, assuming that the collection is "turn off the air conditioner” at this time, and save it as a wav file;
  • the model calling module 402 calls the speech recognition model pre-training module 2 in the
  • the saved model file takes the wav file saved by the recording module 401 as the new input of the model.
  • the pinyin sequence "guan1 bi4 kong1 tiao2" is obtained, and then the text result corresponding to the pinyin "turn off the air conditioner” is obtained through the language model. ”, the text results and the text corresponding to the five preset gesture labels in the gesture recognition model pre-training module 1 are used for cosine similarity calculation, and the corresponding label with the largest similarity value is selected as the result;
  • the multimodal fusion module 5 fuses the results of the two modalities in the gesture recognition module 3 and the speech recognition module 4. Based on the voting method, the result weight of Resnet50 in gesture recognition is 0.5, and the weight of the CNN model result is 0.3, the weight of the result mapped from the speech recognition result to the gesture label is 0.2, and the three results are weighted and summed, and the class with the highest probability is selected as the final air conditioning instruction. At this time, both gesture recognition and speech recognition prediction categories are off, so the final command is the off command. If there are different results, multiply the above three results by their respective weights, add the probability values of the same label, and finally select the probability The highest category serves as the final order.
  • the smart home multi-modal human-computer natural interaction system and method of the present invention utilizes human gestures and voices to control household equipment with multiple instructions, overcomes the defect of low accuracy of a single mode, and enables human-computer interaction. More natural; starting from the way of human perception, home equipment can accept a variety of instructions, users can control home equipment in a variety of ways, get rid of the dependence on traditional buttons, and achieve contactless control; two modes of voice recognition and gesture recognition Combined, it overcomes the limitation that gesture recognition is easily affected by light and speech recognition is easily affected by environmental noise, and errors between modalities will not overlap and interfere with each other. When a certain modal fails, home equipment can still work; Multimodal fusion is applied to the control of home equipment to improve the correctness of instructions.
  • the two modalities of speech recognition and gesture recognition are used for the interaction process, and the non-contact multi-modal fusion method is used for smart home human-computer interaction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Probability & Statistics with Applications (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本发明涉及一种智能家居多模态人机自然交互系统及方法,手势识别模型预训练模块,利用符合场景的手势数据集训练搭建的网络模型,并保存训练好的手势识别模型;语音识别模型预训练模块,利用中文语音数据集,依次训练声学模型和语言模型,并保存训练好的语音识别模型;手势识别模块,利用保存好的手势识别模型对采集的手势进行预测;语音识别模块,调用保存好的语音识别模型对采集的音频进行识别;多模态融合模块,对手势识别模块和语音识别模块两种模态结果进行融合,得出最终指令。将手势识别与语音识别两种模态融合,允许家居设备接收多种形式的指令,以提高指令的正确性。

Description

智能家居多模态人机自然交互系统及其方法 技术领域
本发明涉及一种智能家居多模态人机自然交互系统及其方法,属于智能家居人机交互领域。
背景技术
多模态融合主要用来实现不同模态间的模型融合,目的是用一个总的模型输出多个信息渠道获取的信息特征,这样由于学习到多个模态的信息,模型就能获得更全面的特征信息,并且做到即使某种模态失效或缺失时仍能正常工作,得到正确信息输出,大大提高模型的鲁棒性。因为被用来融合的这些模型之间往往并不相关,所以这些模型各自的错误也不会互相影响,因此不会造成错误的累加情况出现。
手势识别的研究目的是设计出可以单纯依靠手势驱动的系统,随着手势的变化而做出不同的反应。手势检测与分割是首要任务,常规方法是通过手的肤色、形状、像素值、运动等等视觉特征的组合来检测手部动作,然后进行手势跟踪提供手或手指外观位置的帧间坐标,从而产生手部运动的轨迹以便进行后续识别阶段,手势识别最后要实现的目标就是对所做手势想表达的语义进行解释。
语音识别本质是统计模式识别,依赖于两个模型,声学模型和语言模型,前者是文字和拼音的对应转换,而后者是字词在整个句子中出现的概率。声学模型可以对语音的声学特征进行分类,并将其对应到类似音素的单元,而语言模型可以把声学模型获得的音素拼接成一个完整句 子,最后对识别的结果进行一些文本处理操作,就可以得到最终的结果。
智能家居已经发展到一定程度,但现有的智能家居人机交互仍存在着一些问题,借助遥控器或者手机进行的红外遥控,通过按键或者触屏进行操作,需要借助第三方移动设备,不够便捷;依托语音助手控制家居设备,输入数据来源单一,没有充分利用人的肢体灵活性,不能解决接收模糊输入问题等。手势识别与语音识别以及多模态技术的发展为此提供一种解决方案。
发明内容
本发明的目的是克服现有技术存在的不足,提供一种智能家居多模态人机自然交互系统及其方法。
本发明的目的通过以下技术方案来实现:
智能家居多模态人机自然交互系统,其特点是:包含手势识别模型预训练模块、语音识别模型预训练模块、手势识别模块、语音识别模块和多模态融合模块,所述手势识别模型预训练模块,利用手势数据集训练搭建的网络模型,并保存训练好的手势识别模型;所述语音识别模型预训练模块,加载中文语音数据集,依次训练声学模型和语言模型,并保存训练好的语音识别模型;所述手势识别模块,利用手势识别模型预训练模块保存的手势识别模型对采集的手势进行预测;所述语音识别模块,调用语音识别模型预训练模块保存的语音识别模型对采集的音频进行识别;所述多模态融合模块,对手势识别模块和语音识别模块两种模态结果进行融合,得出最终指令。
进一步地,上述的智能家居多模态人机自然交互系统,其中,所述手势识别模型预训练模块包含构建数据集模块、数据预处理模块、模型构建模块和模型训练模块,所述构建数据集模块,预设的五类标签,即关闭close、打开open、调高up、调低down、无nothing各自对应采集同等数 量的手势图片,并利用数据增强的方法扩大数据规模,为手势识别模型训练提供数据支撑;所述数据预处理模块,经过去噪、肤色分割、二值化处理、形态学处理和轮廓提取,得到模型的标准化输入;所述模型构建模块,搭建网络模型,用于提取手势图片特征;所述模型训练模块,将构建数据集模块的数据集分批次作为模型构建模块的网络模型的输入,利用反向传播算法更新模型参数,并保存训练好的手势识别模型。
进一步地,上述的智能家居多模态人机自然交互系统,其中,构建数据集模块利用摄像头采集自定义的五种指令的图片,利用数据增强的方法,添加椒盐噪声、添加高斯噪声、降低图片亮度、提高图片亮度、以随机角度旋转以及翻转,对数据集进行扩充,从而完成数据集的构建;数据预处理模块,去噪、肤色分割和二值化处理,形态学处理、轮廓提取,采用高斯滤波实现去噪,用卷积模板扫描图像中的每一个像素并确定其邻域内像素点的加权平均灰度值,用以替代中心处的像素点的值,如果二维模板大小为m×n,则卷积模板上的点(x,y)有如下公式:
Figure PCTCN2021078420-appb-000001
其中,σ是正态分布的标准差,其值越小图像越清晰;m和n表示卷积模板的尺寸;
两种肤色分割第一种是基于自适应阈值法的肤色分割,先计算灰度直方图并归一化;再计算灰度的均值;接着根据直方图计算零阶矩u[i]和一阶矩v[i];之后计算最大类间方差f[i],此时,得出的这个方差的灰度值便是自适应阈值,其公式如下:
Figure PCTCN2021078420-appb-000002
另一种是基于HSV颜色空间的肤色分割,SkinMask模式的操作为先获取手势框图,将其转换到HSV空间;再获取图片每个像素点的HSV值,即一个二维矩阵拆成三个二维矩阵;最后根据肤色范围定义H、S、V值的遮罩,设置判断条件,未在肤色范围内把遮罩设为黑色即可;肤色分割完成后对选中的图像进行二值化处理操作,二值化算法用以下公式计算,其中T为阈值:
Figure PCTCN2021078420-appb-000003
形态学处理对肤色分割残存的黑点,或是背景上留有的白点,进行腐蚀和膨胀操作,膨胀是求局部最大值操作,腐蚀是求最小值的操作;
采用肤色提取手势轮廓的方法,在取得预处理的图像后先去除伪轮廓并定位面积的最大轮廓;再计算各个轮廓的各阶矩、周长、面积、质心、最短最长径长、外接矩形的特征;之后取得各个轮廓的外包络和缺陷点的集合;接着二次去除伪轮廓后计算轮廓基于质心的特征向量;最后对轮廓中可能是手指的点依次定位。
进一步地,上述的智能家居多模态人机自然交互系统,其中,所述语音识别模型预训练模块包含数据集加载模块、声学模型构建模块、语言模型构建模块以及模型训练模块,所述数据集加载模块,下载中文语音数据集并指定文件路径;所述声学模型构建模块,基于Keras和TensorFlow框架,参考VGG构建深度卷积神经网络,结合CTC解码将连续相同的符号合并为同一个符号,然后再去除静音分隔标记符,得到实际的语音拼音符号序列;所述语言模型构建模块,为将声学模型构建模块得到的拼音序列转换为最终的文字结果并输出;所述模型训练模块,将数据集加载模块得到的数据依次输入声学模型构建模块,语言模型构建模块进行训练,并保存训练好的语音识别模型。
进一步地,上述的智能家居多模态人机自然交互系统,其中,所述手 势识别模块包含手势采集模块、模型调用模块和可视化模块,所述手势采集模块,用于获取新的单个手势输入;所述模型调用模块,调用手势识别模型预训练模块训练好的模型,将手势采集模块采集的手势作为输入,得到手势预测结果;所述可视化模块,在新的窗口将预测结果显示出来。
进一步地,上述的智能家居多模态人机自然交互系统,其中,所述语音识别模块包含录音模块、模型调用模块和文本映射模块,录音模块限时采集音频,保存为wav文件;模型调用模块调用语音识别模型预训练模块中保存好的模型文件,将录音模块保存的wav文件作为模型新的输入,得到语音识别成文字的结果;文本映射模块将文字结果与手势识别模型预训练模块中预设的各个标签对应的中文作相似度计算,选取相似度值最大的对应的标签作为语音识别对应的指令结果。
进一步地,上述的智能家居多模态人机自然交互系统,其中,所述多模态融合模块对手势识别模块和语音识别模块的两种模态结果进行融合,基于投票方法完成预测手势识别和语音识别两个分类器中最高概率的类,得出最终指令。
本发明智能家居多模态人机自然交互的方法,包括以下步骤:
a)首先采用OpenCV获取手势图片,利用数据增强的方法扩充数据集,并对数据集中的图片进行预处理标准化输入;并搭建手势识别部分使用的CNN模型,共十二层组成,并调用keras内部封装好的Resnet50模型,利用预处理好的数据集分别训练两个网络模型,并保存训练好的手势识别模型;
b)接着搭建声学模型,基于Keras和TensorFlow框架搭建的深层卷积神经网络,并结合CTC解码;语言模型采用bigram模型;利用THCHS30中文语音数据集,分别对声学和语言模型进行训练,并保存训练好的语音识别模型;
c)采集用户当前手势图片,依次进行高斯去噪,基于自适应阈值法的binary模式或者基于HSV颜色空间的SkinMask模式的肤色分割,接着进行二值化处理,把目标从图像的背景和噪声区中提取出来,再经过腐蚀、膨胀,最后肤色提取手势轮廓,将处理后的图片分别作为CNN、Resnet50模型输入,得到两个模型预测的当前手势对应的指令;
d)采集用户的音频保存为wav文件,对wav文件进行分帧加窗操作,得到语谱图,将得到的语谱图作为训练好的声学模型输入,结合CTC解码,得到汉语拼音序列,然后将汉语拼音序列作为语言模型输入,得到拼音序列对应的文字组合,即语音识别结果;
e)将语音识别的文字结果与手势识别中各个标签做相似度计算,从而将语音结果映射到手势标签中,然后对手势识别结果与语音识别映射结果进行加权投票,得到最高概率的类别作为最终指令。
更进一步地,上述的智能家居多模态人机自然交互的方法,步骤a),采用的数据增强方法有添加椒盐噪声、添加高斯噪声、降低图片亮度、提高图片亮度、以随机角度旋转以及翻转,用以扩充数据集;对数据集中的图片采用高斯滤波实现去噪,然后利用基于自适应阈值法的Binary模式和基于HSV颜色空间的SkinMask模式进行肤色分割,接着再进行二值化处理以及腐蚀和膨胀的形态学处理,最后采用肤色提取手势轮廓的方法,完成数据的预处理。
本发明与现有技术相比具有显著的优点和有益效果,具体体现在以下方面:
①本发明智能家居多模态人机自然交互系统及其方法,利用人的手势和语音来多指令控制家居设备,克服单种模态准确率不高的缺陷,提高指令准确性,使人机交互更加自然;
②从人的感知方式出发,使家居设备可以接受多种指令,用户以多种 方式控制家居设备,摆脱对传统按键的依赖,做到无接触控制;
③语音识别与手势识别两种模态相融合,克服手势识别容易受到光照影响以及语音识别易受环境噪声影响的局限,并且模态间的错误不会叠加,互不干扰,某一种模态失效时,家居设备仍然能够工作。
本发明的其他特征和优点将在随后的说明书阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明具体实施方式了解。本发明的目的和其他优点可通过在所写的说明书以及附图中所特别指出的结构来实现和获得。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1:本发明系统的原理示意图;
图2:本发明系统的架构示意图;
图3:手势识别模型预训练模块架构原理示意图;
图4a:预定义手势(打开)示意图;
图4b:预定义手势(调高)示意图;
图4c:预定义手势(调低)示意图;
图4d:预定义手势(关闭)示意图;
图5:数据预处理模块的流程示意图;
图6:语音识别模型预训练模块架构原理示意图;
图7:手势识别模块架构原理示意图;
图8:语音识别模块架构原理示意图;
图9:多模态融合模块原理示意图。
具体实施方式
下面将结合本发明实施例中附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施例。基于本发明的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。同时,在本发明的描述中,方位术语和次序术语等仅用于区分描述,而不能理解为指示或暗示相对重要性。
针对现有的接触式的家居设备控制方法的局限性,如手指潮湿或有污渍不便于调控,考虑到手势识别与语音识别技术发展的成熟度,以及智能家居人机交互的重要性,本发明应用于智能家居领域的家居设备控制,以空调为例,采用无接触的方法,并采用决策融合的方法进行多模态融合,融合涉及的模型互不影响,满足应用需求。
如图1~2所示,智能家居多模态人机自然交互系统,包含手势识别模型预训练模块1、语音识别模型预训练模块2、手势识别模块3、语音识别模块4和多模态融合模块5;手势识别模型预训练模块1、语音识别模型预训练模块2分别构建手势识别、语音识别两个预训练模型,手势识别模块3及语音识别模块4调用预训练的模型进行现场采集预测,多模态融合模块5对两种模态的结果按照加权投票的方法进行融合。
手势识别模型预训练模块1包含构建数据集模块101、数据预处理模块102、模型构建模块103和模型训练模块104;构建数据集模块101,预设的五类标签,即关闭close、打开open、调高up、调低down、无nothing各自对应采集同等数量的手势图片,并利用数据增强的方法扩大数据规模,为手势识别模型训练提供数据支撑;数据预处理模块102,经过去噪、肤色分割、二值化处理、形态学处理和轮廓提取,得到模型的标准化输入;所述模型构建模块103,搭建网络模型,用于提取手势图片特征;模型训练模块104,将构建数据集模块101的数据集分批次作为模型构建模块103的网络模型的输入,利用反向传播算法更新模型参数,并保存训练好的手势识别模型;
手势识别模型预训练模块1的流程如图3所示,构建数据集模块101开始构建手势数据集,利用摄像头采集自定义的手势,如图4a~4d所示,“ok”对应打开如图4a,“V”对应调高如图4b,“握拳”对应调低如图4c,“竖掌”对应关闭指令如图4d;额外定义一种“nothing”,即不符合以上4种手势的干扰图片;然后采用数据增强的方法,添加椒盐噪声、添加高斯噪声、降低图片亮度、提高图片亮度、以随机角度旋转以及翻转,对数据集进行扩充,最终数据集包括28105张手势的图片,共计五种手势,每种手势5621张,为模型训练提供数据支撑;
数据预处理模块102对构建数据集模块101中的数据进行预处理,得到标准化输入,如图5,数据预处理包括去噪,肤色分割,二值化处理,形态学处理,轮廓提取等操作。首先采用高斯滤波来实现去噪,高斯滤波的具体操作是:用卷积模板扫描图像中的每一个像素并确定其邻域内像素点的加权平均灰度值,用以替代中心处的像素点的值;设二维模板大小为m×n,则卷积模板上的点(x,y)有如下公式:
Figure PCTCN2021078420-appb-000004
其中,σ是正态分布的标准差,其值越小图像越清晰;m和n表示卷积模板的尺寸。
肤色分割是对图像中人体皮肤所在像素区域进行筛选检测分离,两种肤色分割的方法,一种是基于自适应阈值法的肤色分割,具体操作为先计算灰度直方图并归一化;再计算灰度的均值;接着根据直方图计算零阶矩u[i]和一阶矩v[i];之后计算最大类间方差f[i],此时,得出的方差的灰度值便是自适应阈值,其公式如下:
Figure PCTCN2021078420-appb-000005
另一种基于HSV颜色空间的SkinMask模式,SkinMask模式的操作为先获取手势框图,将其转换到HSV空间;再获取图片每个像素点的HSV值,即一个二维矩阵拆成三个二维矩阵;最后根据肤色范围定义H,S,V值的遮罩,设置判断条件,不在肤色范围内把遮罩设为黑色即可。从模型中可以便看出,当不断增加白色时,参数V会保持不变而参数S会不断减小,当光线充足时,此模式非常有效。然后对选中的图像进行二值化处理,可以根据这个灰度值将图像中的像素分成两种,二值化算法用以下公式计算:
Figure PCTCN2021078420-appb-000006
具体方法就是事先设定一个阈值T,将图像的像素对照这个阈值进行划分,当像素的灰度小于阈值T时,就表示为黑色;当灰度大于或等于阈值T时,表示为白色。
形态学处理两种操作,分别是腐蚀和膨胀,膨胀是求局部最大值操作,腐蚀是求最小值的操作。
采用肤色提取手势轮廓的方法,在取得预处理的图像后先去除伪轮廓并定位面积的最大轮廓;再计算各个轮廓的各阶矩、周长、面积、质心、 最短最长径长、外接矩形的特征;之后取得各个轮廓的外包络和缺陷点的集合;接着二次去除伪轮廓后计算轮廓基于质心的特征向量;最后就是对轮廓中可能是手指的点依次定位;
然后模型构建模块103搭建网络模型,用于提取图片特征,CNN模型由两层卷积层,一层池化层,两层全连接层,两层dropout层用于缓解过拟合,一层flatten层用于连接卷积层和全连接层,四个激活函数,共十二层组成,用该CNN模型训练了15轮;另外,直接调用keras封装的Resnet50模型,网络层数为50,调整输入大小为200*200,将预处理的图片数据作为输入进行10轮的训练;模型训练模块104将构建数据集模块101中的数据集中20%作为测试集,再抽取20%作为验证集,最后得到共有17987张图片的数据集用于训练,保存训练好的两个模型。
语音识别模型预训练模块2,如图6所示,数据集加载模块201下载语音数据集并加载,THCHS30内含1万余条中文语音文件,总时长超过30个小时,采样频率为16kHz,采样大小为16bits;声学模型构建模块202,为得到实际的语音拼音符号序列,基于Keras和TensorFlow框架,参考VGG构建深度卷积神经网络;语言模型构建模块203,使用统计语言模型,得出各拼音对应的最大概率的字,将拼音转换为最终的识别文本并输出,将声学模型构建模块202得到的拼音转换为最终的识别文本并输出;模型训练模块204,将数据集加载模块201得到的数据依次输入声学模型构建模块202,语言模型构建模块203进行训练,并保存训练好的模型。
手势识别模块3的手势预测流程如图7所示,假设手势采集模块301中摄像头捕捉到的手势是“握拳”,如图4c所示;遮罩模式是新的捕获方式,通过拍摄背景图像来删除背景内容,从ROI窗口的新帧内容中减去背景内容的方式捕获手势;预测模式开启后就会出现模型训练时的各种标签以待和摄像头捕获到的手势比较,此时可选择变化手势,摆好“伸掌”,也可以保持“握拳”手势不变,此处展示变换为“伸掌”手势,如图4d 所示;模型调用模块302,直接调用手势识别模型预训练模块1训练好的模型,将手势采集模块301采集的手势,经过高斯去噪,肤色分割和二值化处理,形态学处理,轮廓提取,作为模型输入,ResNet50模型与CNN模型的结果选取概率最大的类别在可视化模块303显示出来。
语音识别模块4的语音识别流程如图8所示,录音模块401限时采集音频,假设此时采集的是“关闭空调”,保存为wav文件;模型调用模块402调用语音识别模型预训练模块2中保存好的模型文件,将录音模块401保存的wav文件作为模型新的输入,经过声学模型和CTC解码,得到拼音序列“guan1 bi4 kong1 tiao2”,再经过语言模型得到拼音对应的文字结果“关闭空调”,将文字结果与手势识别模型预训练模块1中预设的五种手势标签对应的文字作余弦相似度计算,选择相似度值最大的对应的标签作为结果;
如图9所示,多模态融合模块5对手势识别模块3和语音识别模块4中两种模态结果进行融合,基于投票方法,手势识别中Resnet50的结果权重为0.5,CNN模型结果权重为0.3,语音识别结果映射到手势标签中的结果权重为0.2,对三种结果加权求和,选择最高概率的类,作为最终的空调指令。此时,手势识别与语音识别预测类别都是关闭,因此最终指令是关闭指令,如果出现不同结果,则将上述三种结果乘上各自权重,并将同一标签的概率值相加,最终选取概率最高的类别作为最终指令。
综上所述,本发明智能家居多模态人机自然交互系统及其方法,利用人的手势和语音来多指令控制家居设备,克服单种模态准确率不高的缺陷,使人机交互更加自然;从人的感知方式出发,使家居设备可以接受多种指令,用户以多种方式控制家居设备,摆脱对传统按键的依赖,做到无接触控制;语音识别与手势识别两种模态相融合,克服手势识别容易受到光照影响以及语音识别易受环境噪声影响的局限,并且模态间的错误不会叠加,互不干扰,某一种模态失效时,家居设备仍然能够工作;将多模态 融合应用于家居设备的控制上,提高指令的正确性。
采用语音识别与手势识别两种模态相融合的方法进行交互过程,非接触式的多模态融合的方法进行智能家居人机交互。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
上述仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (9)

  1. 智能家居多模态人机自然交互系统,其特征在于:包含手势识别模型预训练模块(1)、语音识别模型预训练模块(2)、手势识别模块(3)、语音识别模块(4)和多模态融合模块(5),所述手势识别模型预训练模块(1),利用手势数据集训练搭建的网络模型,并保存训练好的手势识别模型;所述语音识别模型预训练模块(2),加载中文语音数据集,依次训练声学模型和语言模型,并保存训练好的语音识别模型;所述手势识别模块(3),利用手势识别模型预训练模块(1)保存的手势识别模型对采集的手势进行预测;所述语音识别模块(4),调用语音识别模型预训练模块(2)保存的语音识别模型对采集的音频进行识别;所述多模态融合模块(5),对手势识别模块(3)和语音识别模块(4)两种模态结果进行融合,得出最终指令。
  2. 根据权利要求1所述的智能家居多模态人机自然交互系统,其特征在于:所述手势识别模型预训练模块(1)包含构建数据集模块(101)、数据预处理模块(102)、模型构建模块(103)和模型训练模块(104),所述构建数据集模块(101),预设的五类标签,即关闭close、打开open、调高up、调低down、无nothing各自对应采集同等数量的手势图片,并利用数据增强的方法扩大数据规模,为手势识别模型训练提供数据支撑;所述数据预处理模块(102),经过去噪、肤色分割、二值化处理、形态学处理和轮廓提取,得到模型的标准化输入;所述模型构建模块(103),搭建网络模型,用于提取手势图片特征;所述模型训练模块(104),将构建数据集模块(101)的数据集分批次作为模型构建模块(103)的网络模型的输入,利用反向传播算法更新模型参数,并保存训练好的手势识别模型。
  3. 根据权利要求2所述的智能家居多模态人机自然交互系统,其特征在于:构建数据集模块(101)利用摄像头采集自定义的五种指令的图 片,利用数据增强的方法,添加椒盐噪声、添加高斯噪声、降低图片亮度、提高图片亮度、以随机角度旋转以及翻转,对数据集进行扩充,从而完成数据集的构建;数据预处理模块(102),包括去噪、肤色分割和二值化处理,形态学处理、轮廓提取流程,采用高斯滤波实现去噪,用卷积模板扫描图像中的每一个像素并确定其邻域内像素点的加权平均灰度值,用以替代中心处的像素点的值,如果二维模板大小为m×n,则卷积模板上的点(x,y)有如下公式:
    Figure PCTCN2021078420-appb-100001
    其中,σ是正态分布的标准差,其值越小图像越清晰;m和n表示卷积模板的尺寸;
    两种肤色分割第一种是基于自适应阈值法的肤色分割,先计算灰度直方图并归一化;再计算灰度的均值;接着根据直方图计算零阶矩u[i]和一阶矩v[i];之后计算最大类间方差f[i],此时,得出的方差的灰度值便是自适应阈值,其公式如下:
    Figure PCTCN2021078420-appb-100002
    另一种是基于HSV颜色空间的肤色分割,SkinMask模式的操作为先获取手势框图,将其转换到HSV空间;再获取图片每个像素点的HSV值,即一个二维矩阵拆成三个二维矩阵;最后根据肤色范围定义H、S、V值的遮罩,设置判断条件,未在肤色范围内把遮罩设为黑色即可;肤色分割完成后对选中的图像进行二值化处理操作,二值化算法用以下公式计算,其中T为阈值:
    Figure PCTCN2021078420-appb-100003
    形态学处理对肤色分割残存的黑点,或是背景上留有的白点,进行腐蚀和膨胀操作,膨胀是求局部最大值操作,腐蚀是求最小值的操作;
    采用肤色提取手势轮廓的方法,在取得预处理的图像后先去除伪轮廓并定位面积的最大轮廓;再计算各个轮廓的各阶矩、周长、面积、质心、最短最长径长、外接矩形的特征;之后取得各个轮廓的外包络和缺陷点的集合;接着二次去除伪轮廓后计算轮廓基于质心的特征向量;最后对轮廓中可能是手指的点依次定位。
  4. 根据权利要求1所述的智能家居多模态人机自然交互系统,其特征在于:所述语音识别模型预训练模块(2)包含数据集加载模块(201)、声学模型构建模块(202)、语言模型构建模块(203)以及模型训练模块(204),所述数据集加载模块(201),下载中文语音数据集并指定文件路径;所述声学模型构建模块(202),为得到实际的语音拼音符号序列;所述语言模型构建模块(203),为将声学模型构建模块(202)得到的拼音序列转换为最终的文字结果并输出;所述模型训练模块(204),将数据集加载模块(201)得到的数据依次输入声学模型构建模块(202),语言模型构建模块(203)进行训练,并保存训练好的语音识别模型。
  5. 根据权利要求1所述的智能家居多模态人机自然交互系统,其特征在于:所述手势识别模块(3)包含手势采集模块(301)、模型调用模块(302)和可视化模块(303),所述手势采集模块(301),用于获取新的单个手势输入;所述模型调用模块(302),调用手势识别模型预训练模块(1)训练好的模型,将手势采集模块(301)采集的手势作为输入,得到手势预测结果;所述可视化模块(303),在新的窗口将预测结果显示出来。
  6. 根据权利要求1所述的智能家居多模态人机自然交互系统,其特征在于:所述语音识别模块(4)包含录音模块(401)、模型调用模块(402) 和文本映射模块(403),录音模块(401)限时采集音频,保存为wav文件;模型调用模块(402)调用语音识别模型预训练模块(2)中保存好的模型文件,将录音模块(401)保存的wav文件作为模型新的输入,得到语音识别成文字的结果;文本映射模块(403)将文字结果与手势识别模型预训练模块(1)中预设的各个标签对应的中文作相似度计算,选取相似度值最大的对应的标签作为语音识别对应的指令结果。
  7. 根据权利要求1所述的智能家居多模态人机自然交互系统,其特征在于:所述多模态融合模块(5)对手势识别模块(3)和语音识别模块(4)的两种模态结果进行融合,基于投票方法完成预测手势识别和语音识别两个分类器中最高概率的类,得出最终指令。
  8. 智能家居多模态人机自然交互的方法,其特征在于:包括以下步骤:
    a)首先采用OpenCV获取手势图片,利用数据增强的方法扩充数据集,并对数据集中的图片进行预处理标准化输入;并搭建手势识别部分使用的CNN模型,共十二层组成,并调用keras内部封装好的Resnet50模型,利用预处理好的数据集分别训练两个网络模型,并保存训练好的手势识别模型;
    b)接着搭建声学模型,基于Keras和TensorFlow框架搭建的深层卷积神经网络,并结合CTC解码;语言模型采用bigram模型;利用THCHS30中文语音数据集,分别对声学和语言模型进行训练,并保存训练好的语音识别模型;
    c)采集用户当前手势图片,依次进行高斯去噪,基于自适应阈值法的binary模式或者基于HSV颜色空间的SkinMask模式的肤色分割,接着进行二值化处理,把目标从图像的背景和噪声区中提取出来,再经过腐蚀、膨胀,最后肤色提取手势轮廓,将处理后的图片分别作为CNN、Resnet50 模型输入,得到两个模型预测的当前手势对应的指令;
    d)采集用户的音频保存为wav文件,对wav文件进行分帧加窗操作,得到语谱图,将得到的语谱图作为训练好的声学模型输入,结合CTC解码,得到汉语拼音序列,然后将汉语拼音序列作为语言模型输入,得到拼音序列对应的文字组合,即语音识别结果;
    e)将语音识别的文字结果与手势识别中各个标签做相似度计算,从而将语音结果映射到手势标签中,然后对手势识别结果与语音识别映射结果进行加权投票,得到最高概率的类别作为最终指令。
  9. 根据权利要求8所述的智能家居多模态人机自然交互的方法,其特征在于:步骤a),采用的数据增强方法有添加椒盐噪声、添加高斯噪声、降低图片亮度、提高图片亮度、以随机角度旋转以及翻转,用以扩充数据集;对数据集中的图片采用高斯滤波实现去噪,然后利用基于自适应阈值法的Binary模式和基于HSV颜色空间的SkinMask模式进行肤色分割,接着再进行二值化处理以及腐蚀和膨胀的形态学处理,最后采用肤色提取手势轮廓的方法,完成数据的预处理。
PCT/CN2021/078420 2020-11-25 2021-03-01 智能家居多模态人机自然交互系统及其方法 WO2022110564A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011339808.4A CN112462940A (zh) 2020-11-25 2020-11-25 智能家居多模态人机自然交互系统及其方法
CN202011339808.4 2020-11-25

Publications (1)

Publication Number Publication Date
WO2022110564A1 true WO2022110564A1 (zh) 2022-06-02

Family

ID=74808312

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078420 WO2022110564A1 (zh) 2020-11-25 2021-03-01 智能家居多模态人机自然交互系统及其方法

Country Status (2)

Country Link
CN (1) CN112462940A (zh)
WO (1) WO2022110564A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329578A (zh) * 2022-08-19 2022-11-11 南京邮电大学 基于多模态融合的三维建模系统及建模方法
CN116012937A (zh) * 2022-12-14 2023-04-25 杭州电子科技大学信息工程学院 一种交警手势识别方法
CN116258655A (zh) * 2022-12-13 2023-06-13 合肥工业大学 基于手势交互的实时图像增强方法及系统
CN116434027A (zh) * 2023-06-12 2023-07-14 深圳星寻科技有限公司 一种基于图像识别人工智能交互系统
CN117316158A (zh) * 2023-11-28 2023-12-29 科大讯飞股份有限公司 一种交互方法、装置、控制设备及存储介质
CN117718969A (zh) * 2024-01-18 2024-03-19 浙江孚宝智能科技有限公司 基于视觉听觉融合的家用机器人控制系统及其方法
CN117807557A (zh) * 2024-01-10 2024-04-02 广州和兴机电科技有限公司 数控机床的多模态交互控制方法及系统
CN117995193A (zh) * 2024-04-02 2024-05-07 山东天意装配式建筑装备研究院有限公司 一种基于自然语言处理的智能机器人语音交互方法
CN117718969B (zh) * 2024-01-18 2024-05-31 浙江孚宝智能科技有限公司 基于视觉听觉融合的家用机器人控制系统及其方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190107B (zh) * 2021-03-16 2023-04-14 青岛小鸟看看科技有限公司 手势识别方法、装置及电子设备
CN113311939A (zh) * 2021-04-01 2021-08-27 江苏理工学院 基于手势识别的智能音箱控制系统
CN113299132A (zh) * 2021-06-08 2021-08-24 上海松鼠课堂人工智能科技有限公司 基于虚拟现实场景的学生演讲技能训练方法及系统
CN113095446B (zh) * 2021-06-09 2021-09-03 中南大学 异常行为样本生成方法及系统
CN113849068B (zh) * 2021-09-28 2024-03-29 中国科学技术大学 一种手势多模态信息融合的理解与交互方法及其系统
CN114610157A (zh) * 2022-03-23 2022-06-10 北京拙河科技有限公司 一种基于手势交互的方法及系统
CN115145402A (zh) * 2022-09-01 2022-10-04 深圳市复米健康科技有限公司 具有网络交互功能的智能玩具系统及控制方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281435A1 (en) * 2009-04-30 2010-11-04 At&T Intellectual Property I, L.P. System and method for multimodal interaction using robust gesture processing
CN102339129A (zh) * 2011-09-19 2012-02-01 北京航空航天大学 一种基于语音和手势的多通道人机交互方法
CN102824092A (zh) * 2012-08-31 2012-12-19 华南理工大学 一种窗帘的智能手势和语音控制系统及其控制方法
CN104965592A (zh) * 2015-07-08 2015-10-07 苏州思必驰信息科技有限公司 基于语音和手势识别的多模态非触摸人机交互方法及系统
CN109814722A (zh) * 2019-02-25 2019-05-28 苏州长风航空电子有限公司 一种多模态人机交互系统及交互方法
CN111709295A (zh) * 2020-05-18 2020-09-25 武汉工程大学 一种基于SSD-MobileNet的实时手势检测和识别方法及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107342076B (zh) * 2017-07-11 2020-09-22 华南理工大学 一种兼容非常态语音的智能家居控制系统及方法
CN109902577A (zh) * 2019-01-25 2019-06-18 华中科技大学 一种轻量级手势检测卷积神经网络模型的构建方法及应用
CN110554774B (zh) * 2019-07-22 2022-11-04 济南大学 一种面向ar的导航式交互范式系统
CN110362210B (zh) * 2019-07-24 2022-10-11 济南大学 虚拟装配中融合眼动跟踪和手势识别的人机交互方法和装置
CN111158491A (zh) * 2019-12-31 2020-05-15 苏州莱孚斯特电子科技有限公司 一种应用于车载hud的手势识别人机交互方法
CN111554279A (zh) * 2020-04-27 2020-08-18 天津大学 一种基于Kinect的多模态人机交互系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281435A1 (en) * 2009-04-30 2010-11-04 At&T Intellectual Property I, L.P. System and method for multimodal interaction using robust gesture processing
CN102339129A (zh) * 2011-09-19 2012-02-01 北京航空航天大学 一种基于语音和手势的多通道人机交互方法
CN102824092A (zh) * 2012-08-31 2012-12-19 华南理工大学 一种窗帘的智能手势和语音控制系统及其控制方法
CN104965592A (zh) * 2015-07-08 2015-10-07 苏州思必驰信息科技有限公司 基于语音和手势识别的多模态非触摸人机交互方法及系统
CN109814722A (zh) * 2019-02-25 2019-05-28 苏州长风航空电子有限公司 一种多模态人机交互系统及交互方法
CN111709295A (zh) * 2020-05-18 2020-09-25 武汉工程大学 一种基于SSD-MobileNet的实时手势检测和识别方法及系统

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329578A (zh) * 2022-08-19 2022-11-11 南京邮电大学 基于多模态融合的三维建模系统及建模方法
CN116258655A (zh) * 2022-12-13 2023-06-13 合肥工业大学 基于手势交互的实时图像增强方法及系统
CN116258655B (zh) * 2022-12-13 2024-03-12 合肥工业大学 基于手势交互的实时图像增强方法及系统
CN116012937A (zh) * 2022-12-14 2023-04-25 杭州电子科技大学信息工程学院 一种交警手势识别方法
CN116434027A (zh) * 2023-06-12 2023-07-14 深圳星寻科技有限公司 一种基于图像识别人工智能交互系统
CN117316158A (zh) * 2023-11-28 2023-12-29 科大讯飞股份有限公司 一种交互方法、装置、控制设备及存储介质
CN117316158B (zh) * 2023-11-28 2024-04-12 科大讯飞股份有限公司 一种交互方法、装置、控制设备及存储介质
CN117807557A (zh) * 2024-01-10 2024-04-02 广州和兴机电科技有限公司 数控机床的多模态交互控制方法及系统
CN117718969A (zh) * 2024-01-18 2024-03-19 浙江孚宝智能科技有限公司 基于视觉听觉融合的家用机器人控制系统及其方法
CN117718969B (zh) * 2024-01-18 2024-05-31 浙江孚宝智能科技有限公司 基于视觉听觉融合的家用机器人控制系统及其方法
CN117995193A (zh) * 2024-04-02 2024-05-07 山东天意装配式建筑装备研究院有限公司 一种基于自然语言处理的智能机器人语音交互方法

Also Published As

Publication number Publication date
CN112462940A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
WO2022110564A1 (zh) 智能家居多模态人机自然交互系统及其方法
CN111931701B (zh) 基于人工智能的姿态识别方法、装置、终端和存储介质
CN102831439B (zh) 手势跟踪方法及系统
CN111709310B (zh) 一种基于深度学习的手势跟踪与识别方法
US20210271862A1 (en) Expression recognition method and related apparatus
KR101017936B1 (ko) 사용자의 제스춰 정보 인식을 기반으로 하여 디스플레이장치의 동작을 제어하는 시스템
CN110796018B (zh) 一种基于深度图像和彩色图像的手部运动识别方法
CN108595008B (zh) 基于眼动控制的人机交互方法
Kishore et al. Video audio interface for recognizing gestures of indian sign
CN113033398B (zh) 一种手势识别方法、装置、计算机设备及存储介质
Vishwakarma et al. An efficient interpretation of hand gestures to control smart interactive television
Miah et al. Rotation, Translation and Scale Invariant Sign Word Recognition Using Deep Learning.
Raees et al. Image based recognition of Pakistan sign language
CN109558855B (zh) 一种基于手掌轮廓特征与模版匹配法相结合的空间手势识别方法
CN112001394A (zh) 基于ai视觉下的听写交互方法、系统、装置
CN113608663B (zh) 一种基于深度学习和k-曲率法的指尖跟踪方法
CN111401322A (zh) 进出站识别方法、装置、终端及存储介质
CN114445853A (zh) 一种视觉手势识别系统识别方法
CN114937179A (zh) 垃圾图像分类方法、装置、电子设备及存储介质
Gaikwad et al. Recognition of American sign language using image processing and machine learning
Banerjee et al. A review on artificial intelligence based sign language recognition techniques
CN110147764A (zh) 一种基于机器学习的静态手势识别方法
ul Haq et al. New hand gesture recognition method for mouse operations
Bai et al. Dynamic hand gesture recognition based on depth information
Zheng et al. Review of lip-reading recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896084

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896084

Country of ref document: EP

Kind code of ref document: A1