CN105468145A

CN105468145A - Robot man-machine interaction method and device based on gesture and voice recognition

Info

Publication number: CN105468145A
Application number: CN201510795938.1A
Authority: CN
Inventors: 丁希仑; 齐静
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2015-11-18
Filing date: 2015-11-18
Publication date: 2016-04-06
Anticipated expiration: 2035-11-18
Also published as: CN105468145B

Abstract

The invention discloses a robot man-machine interaction method and device based on gesture and voice recognition belonging to the technical field of the man-machine interaction and robot. The device comprises a sensor module, a gesture recognition module, a voice recognition module, an information fusion module, an information confirmation module, a robot control module and an emergency help module; the method is: fusing a gesture recognition result and a voice recognition result by the information fusion module, sending the obtained final fusion result to the information confirmation module or the robot control module for realizing emergency help and action execution of the robot. The invention combines two channels of the gesture and voice, overcomes the defect of the single channel and can controls the robot better. The invention can be applied to remotely control the robot, enables the robot to work in fields such as space station, hazardous article treatment and public security instead of human beings, also can realize face to face direct interaction of the man and the robot and can be applied in fields such as medical treatment and domestic service.

Description

A kind of robot man-machine interaction method based on gesture and speech recognition and device

Technical field

The invention belongs to man-machine interaction and robotics, be specifically related to a kind of robot man-machine interaction method based on gesture identification and speech recognition and device.

Background technology

China just marches toward aging society, and the elderly needs to look after, and person between twenty and fifty need work to make a living, and have no time to look after old man.Robot can serve as part labour, as invalid pregnant etc. in the children that helps the elderly.In the process, good man-machine interaction plays an important role.

Man-machine interaction is according to different research fields, usually three kinds of different descriptions are had (see list of references [1]: ArvinAgah, Humaninteractionswithintelligentsystems:researchtaxonomy, ComputersandElectricalEngineering, 27,2001, pp71-107).From in broad terms, man-machine interaction refers to mutual (human-machineinteractions is called for short HMI) of the mankind and machinery compartment; From computer realm, man-machine interaction refers to mutual (human-computerinteraction is called for short HCI) between the mankind and computer system; From robot field, man-machine interaction refers to mutual (human-robotinteraction is called for short HRI) between the mankind and robot.Man-machine interaction, according to the difference of research object, can be divided into these three kinds different descriptions.Three is interrelated, and computing machine and robot are all a kind of special machines, so, can think that HMI comprises HCI and HRI; The normally specific computer system of core of robot, HRI can use the method for HCI, also can independent of HCI.

See list of references [2]: Gong Jiemin, Wang Xianqing, the progress of human-computer interaction technology and developing trend. Xian Electronics Science and Technology University's journal, 1998,25 (6): 782-786, list of references [3]: Liu Kejun, about man-machine interaction, the thinking of man-machine harmonious environment. computer utility, 2005,25 (10): 2226-2227. man-machine interactions experienced by elementary man-machine interaction stage, single channel man-machine interaction stage, the elementary man-machine interaction of future development that two or more passage of present forward combines uses the particular devices such as simple telepilot to realize man-machine conversation usually, be generally unidirectional, seldom there is the person-to-person feedback of machine, usually need people to coordinate robot just can complete particular task, single channel man-machine interaction uses single mode interaction technique, as voice, gesture, sense of touch, eye motion etc. are (see list of references [4]: Potamianos, G., Neti, C., LuettJ., etal., Audio-VisualAutomaticSpeechRecognition:AnOverview.Issues inVisualandAudio-VisualSpeechProcessing, 2004, 356-396. list of references [5]: Pavlovic, V.I., Sharma, R., Huang, T.S., VisualInterpretationofHandGesturesforHuman-ComputerInter action:AReview.IEEETransactionsonPatternAnalysisandMachi neIntelligence, 1997, 19 (7): 677-695. list of references [6]: Benali-Khoudja, M., Hafez, M., Alexandre, J.M., etal., TactileInterfaces:AState-of-the-ArtSurvey, Citeseer, 2004. lists of references [7]: Grauman, K., Betke, M., Lombardi, J., etal., CommunicationViaEyeBlinksandEyebrowRaises:Video-BasedHum an-ComputerInterfaces.UniversalAccessintheInformationSoc iety, 2003, 2 (4): 359-373.), this mode is conducive to the development of nature man-machine interaction, but each passage has oneself advantage and shortcoming, there is certain limitation.In order to learn from other's strong points to offset one's weaknesses, the more each channel information of good utilisation, two or more passage combines becomes the trend of man-machine interaction development.

As, voice and gesture are as daily mutual working gangway, but speech recognition, as the IBMViaVoice of IBM Corporation, the MicrosoftSpeechSDK of Microsoft, the CMUPocketSphinx etc. of Carnegie Mellon University, discrimination is to a certain extent by the impact of speaker's dialect, voice, intonation, surrounding environment etc.; The gesture identification of view-based access control model is without the need to specialized training user, do not need to wear specialized equipment yet, have directly perceived, contain much information, the feature of close friend naturally, meet the requirement of natural man-machine interaction, one of core technology of multimode man-machine interaction, but easily by the impact of illumination, complex background, partial occlusion etc.Static gesture is conventional a kind of information representation mode, and has multiple expression-form, but the photo needing resolution higher in Hand Gesture Segmentation and identifying, when staff is away from camera, the resolution of the staff region photo collected affects recognition effect.

Six sufficient legs/arms combined mobile robots can be used for the fields such as space station, nuclear power station, hypertoxic biochemical plant working, dangerous material process and public safety be anti-riot, also can be used for the field such as family, medical services.Six sufficient legs/arms combined mobile robots not only can be used for Long-distance Control, also can be used for the face-to-face direct interaction with robot, as, when six sufficient legs/arms combined mobile robots are rescued, not only can rescue by remote control robot, also can be on-the-spot after calamity, direct interaction face-to-face with robot, good man-machine interaction contributes to the better assisting users of robot and completes rescue task.

Six sufficient legs/arms combined mobile robots also can be used for the field such as medical treatment, home services, e.g., help old children invalid pregnant; Children are accompanied to play; Side old man, patient, physical disabilities, pregnant woman take medicine, pick up thing on the ground; When old man, patient, physical disabilities, pregnant woman fall down, or when there is emergency condition in healthy ordinary people, and as plundered, use gesture or voice help-asking, system just can notify household in the mode such as note, multimedia message in time, to take appropriate measure in time.

Summary of the invention

The object of the present invention is to provide a kind of robot man-machine interaction method based on gesture and speech recognition and device, gesture and voice two passages are combined, overcomes the deficiency of single channel, better control.The present invention can be used for remote control robot, allow robot replaces the mankind in space station, nuclear power station, hypertoxic biochemical plant working, dangerous material process, nuclear power station, the field operation such as hypertoxic biochemical plant working, also the face-to-face direct interaction of human and computer people can be carried out, for fields such as medical treatment, home services.

Installation environment detecting sensor on device provided by the invention, can monitoring of environmental, at breaking out of fire, CH ₄in content overproof or CO content overproof situation, system notifies predefined particular person in time in modes such as voice messaging, note, multimedia messages.Wherein, multimedia message is the scene photograph that camera is taken when abnormal conditions occur.

The man-machine interaction of indication of the present invention is mutual (HRI) between people (user) and robot.

First the present invention provides a kind of robot man-machine interaction method based on gesture and speech recognition, and described method comprises the steps:

The first step, has judged whether interactive object, if there is interactive object, then opens interactive mode, turns second step; If there is no interactive object, robot open detection pattern;

Second step, information inputs: gesture recognition module and sound identification module carry out information acquisition in real time, if the information of collecting, then performs the 3rd step, otherwise performs the 5th step;

3rd step, gesture recognition module, by RGB-D camera collection deep image information and RGB image information, is carried out the identification of Pre-defined gesture, and gesture identification result is sent to information fusion module; Meanwhile, sound identification module gathers audio-frequency information by the built-in Mike of RGB-D camera, is converted to particular text information as voice identification result, and voice identification result is sent to information fusion module by speech recognition software; Voice identification result and gesture identification result are carried out information fusion from semantic layer by information fusion module, obtain final fusion results;

4th step, perform and feedback, the final fusion results according to information fusion module sends command adapted thereto;

If final fusion results is control information, then send corresponding steering order to robot control module, control moves; If final fusion results is emergency information, then send out emergency information to validation of information module, validation of information module is broadcast to user by speech form, whether inquiry user will perform, obtain positive reply or not response in setting-up time, then validation of information module sends emergency information to emergent emergency module, turns the 6th step.

5th step, if gesture recognition module and sound identification module do not have information to input in setting-up time, and after this situation certain time, information fusion module sends emergency information to validation of information module, allow validation of information module carry out indicative voice inquiry according to current task, turn the first step;

6th step, emergent emergency:

Emergent emergency module receives the emergency information that validation of information module sends over, or temperature, CH that sensor assembly collects ₄when content or CO content exceed certain value, emergent emergency module sends voice, note and multimedia message emergency to the particular person of pre-registration.

The present invention also provides a kind of robot human-computer interaction device based on gesture and speech recognition, comprises sensor assembly, gesture recognition module, sound identification module, information fusion module, validation of information module, robot control module and emergent emergency module; Sensor assembly comprises RGB-D camera, Temperature Humidity Sensor, CH ₄detecting sensor and CO detecting sensor, the RGB image information of described RGB-D camera collection and deep image information send to gesture recognition module or emergent emergency module; Described RGB-D camera has built-in Mike, and the voice messaging that Mike gathers sends to sound identification module; The form that described validation of information module is broadcasted by loudspeaker broadcasts emergency information to user, confirms to obtain user; Described Temperature Humidity Sensor, CH ₄detecting sensor and CO detecting sensor are respectively used to gather CH in humiture, air ₄with CO gas concentration, and the data of collection are sent to emergent emergency module; Gesture recognition module carries out gesture identification according to the RGB image information of RGB-D camera collection and deep image information, obtains gesture identification result and sends to information fusion module; The voice messaging that Mike gathers by sound identification module carries out speech recognition, obtains specific character information, as voice identification result, and recognition result is sent to information fusion module; Information fusion module is that gesture identification result and voice identification result are carried out information fusion at semantic layer, generate a final fusion results, when final fusion results is control information, control information is then sent to robot control module by information fusion module, and control completes particular task; When final fusion results is emergency information, information fusion module then sends emergency information to validation of information module, whether the form inquiry user that validation of information module is broadcasted by loudspeaker will perform, obtain positive reply or not response in setting-up time, validation of information module sends emergency information to emergent emergency module; After emergent emergency module receives the emergent message that validation of information module transmits, or temperature, CH that sensor assembly collects ₄when content or CO content exceed certain value, then emergent emergency module sends voice, note and multimedia message emergency to the particular person of registration.

The invention has the advantages that:

(1) the robot man-machine interaction method based on gesture and speech recognition of the present invention, makes to use gesture and voice two passage carries out information fusion from semantic layer, overcomes the deficiency of single exchange channels, improves man-machine interaction effect.

(2) the robot man-machine interaction method based on gesture and speech recognition of the present invention, remote controlled robot, in hazardous location operation, also can be used for robot closely mutual.

(3) the robot man-machine interaction method based on gesture and speech recognition of the present invention, have detecting pattern and interactive mode, interactive mode is divided into again gesture mode, speech pattern and united mode.Wherein, detecting pattern can monitoring of environmental automatically.When situations such as breaking out of fire, gas leak, natural gas leakings, send note, voice messaging or multimedia message from the predefined particular person of trend.Interactive mode not only can with robot interactive, also can occur emergency condition time, for crying for help to particular person.With robot interactive process, user can select distinct interaction pattern according to different occasion, and can use specific instruction in speech pattern, speech pattern is converted to gesture mode or united mode.As user will talk with other people, think terminated speech control model, switch to single, can use voice command " gesture mode ", now, can only use gesture control.

(4) the robot man-machine interaction method based on gesture and speech recognition of the present invention, has certain versatility.The method of the invention not only can be used for teleoperation robot, also can be used for closely mutual robot.The method of the invention has certain portability, and user according to the specific function of specific robotic, can do suitable predefine in gesture recognition module and sound identification module, and voice and gesture just can be used to complete particular task.

(5) the robot man-machine interaction method based on gesture and speech recognition of the present invention, specific gesture is predefined and voice are distress signal in gesture recognition module and sound identification module, user is in some emergency case, as, house robbery, fall down after can not climb, the situation such as can only to call for help, by special sound or gesture emergency.

Accompanying drawing explanation

Fig. 1 is the structural framing figure of the robot human-computer interaction device based on gesture and speech recognition;

Fig. 2 is sensor module structure frame diagram;

Fig. 3 is the robot man-machine interaction method process flow diagram based on gesture and speech recognition.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in detail.

The invention provides a kind of robot human-computer interaction device based on gesture and speech recognition, as shown in Figure 1, described human-computer interaction device is based on the groovy version of the RobotOperatingSystem (ROS) that LinuxUbuntu12.04 installs, and human-computer interaction device of the present invention comprises: sensor assembly, gesture recognition module, sound identification module, information fusion module, validation of information module, robot control module and emergent emergency module.In the present invention, be different node by gesture recognition module, sound identification module, information fusion module, validation of information module, robot control module and emergent emergency module abstracts, namely the different node in ROS, then the communication of intermodule can be regarded as between different node and communicates, therefore intermodule is with topic form communication in ROS, the content of communication is the message in topic.

Sensor assembly comprises RGB-D camera, Temperature Humidity Sensor, CH ₄detecting sensor and CO detecting sensor, as shown in Figure 2.RGB-D camera can not only collect common RGB photo, can also collect depth image simultaneously.In depth image, the value of each pixel represents the distance between object corresponding to this pixel and camera, depth image can be regarded as a kind of single pass gray-scale map, the distance of what only the gray-scale value of each pixel represented the is object distance camera that this pixel is corresponding.Described RGB-D camera is Asus XtionProLive.The RGB image information of described RGB-D camera collection and deep image information send to gesture recognition module or emergent emergency module; Described RGB-D camera has built-in Mike, and the voice messaging that Mike gathers sends to sound identification module; The form that described validation of information module is broadcasted by loudspeaker broadcasts emergency information to user, confirms to obtain user.Described Temperature Humidity Sensor, CH ₄detecting sensor and CO detecting sensor are respectively used to gather CH in humiture, air ₄with CO gas concentration, and the data of collection are sent to emergent emergency module.

Gesture recognition module, the first kind of Pre-defined gesture and implication, then carry out gesture identification according to the RGB image information of RGB-D camera collection and deep image information, obtain gesture identification result and send to information fusion module.When gesture recognition module carries out gesture identification, OpenNI storehouse can be called and identify, also can use oneself specific Gesture Recognition Algorithm.Described Pre-defined gesture can carry out concrete restriction as required, and citing below provides several Pre-defined gesture.

Following predefine is done in gesture recognition module:

For six sufficient legs/arms combined mobile robots, the gesture based on single hand of image is done to give a definition: a finger, and representative allows robot lift one leg; Two fingers, representative allows robot lift two legs; Three fingers, representative allows robot lift three legs, if allow robot ambulation, then means and allows robot walk with " 3+3 " gait; Four fingers, representative allows robot carry out the switching of wheel leg, mode that original state is defaulted as " leg is capable "; Five fingers, representative of consumer requires that robot sends distress signals; The gesture (namely just stretching out thumb and little finger of toe) of numeral 6, representative allows robot six legs land; OK gesture, representative " good " "Yes" etc. confirms operation, is the affirmative acknowledgement (ACK) to validation of information module; Clench fist, representing the negative implications such as "no", is the negative acknowledge to validation of information module.

Do to give a definition to the gesture based on video: hand moves upward, representative allows robot go ahead; Hand moves downward, and representative allows robot walk backward; Hand is to left movement, and representative allows robot turn left; Hand moves right, and representative allows robot turn right; Hand pushes away forward, and representative allows robot stop motion; Hand draws circle clockwise, and representative allows robot original place turn clockwise; Hand draws circle counterclockwise, and representative allows robot original place be rotated counterclockwise.

Sound identification module, carries out speech recognition by the voice messaging that Mike gathers, obtains specific character information, as voice identification result, and recognition result is sent to information fusion module.Described speech recognition can adopt the CMUPocketSphinx software simulating of Carnegie Mellon University.

Following predefine is done at sound identification module:

User's predefine voice command in sound identification module is as follows: voice command " speech pattern ", represents that robot enters Voice command pattern; Voice command " gesture mode ", represents that robot enters gesture control model; Voice command " united mode ", represents that robot enters gesture and voice jointly control pattern; Voice command " help ", represents and allows robot send distress signals; Voice command " advance ", control is goed ahead; Voice command " retrogressing ", control is walked back; Voice command " left-hand rotation ", control turns left; Voice command " right-hand rotation ", control is turned right; Voice command " stopping ", control stop motion; Voice command " acceleration ", control pulling speed; Voice command " deceleration ", control underspeeds; Phonetic order "Yes", represents the affirmative acknowledgement (ACK) to validation of information in validation of information module; Phonetic order "no", represents the negative acknowledge to validation of information in validation of information module.

Information fusion module is that gesture identification result and voice identification result are carried out information fusion at semantic layer, generates a final fusion results and sends to robot control module or validation of information module.Specifically, when final fusion results is control information, control information is then sent to robot control module by information fusion module, and control completes particular task; When final fusion results is emergency information, information fusion module then sends emergency information to validation of information module, whether the form inquiry user that validation of information module is broadcasted by loudspeaker will perform, obtain positive reply or not response in setting-up time T (as 3 minutes), validation of information module sends emergency information to emergent emergency module.

Information fusion module, if only have gesture information to input, using gesture identification result as final fusion results; If only have voice messaging to input, using voice identification result as final fusion results; If existing voice messaging, gesture information is had again to input, and gesture identification result and voice identification result do not conflict, then the final fusion results of gesture and voice is optionally issued robot control module or validation of information module by information fusion module: when final fusion results is control information, final fusion results is issued robot control module, and control moves; When final fusion results is emergency information, sends emergency information and carry out validation of information to validation of information module, avoid maloperation.When gesture identification result and voice identification result conflict, two kinds of results are directly sent to validation of information module, then confirms which kind of performs operates by validation of information module to user.

Information fusion module carries out the fusion that information fusion divides three levels: data Layer, characteristic layer and decision-making level.The difference of these three levels is that the data type merged is different: data Layer directly merges the raw data gathered; Characteristic layer be the raw data that gathers after feature extraction, then carry out comprehensive treatment and analysis; Decision-making level, from the demand of concrete decision-making, directly for decision objective, makes full use of the characteristic information of extraction, is three grades of net results merged, directly affects level of decision-making.

Information fusion module in the present invention adopts decision-making level's information fusion.This kind of information fusion mode is applied the most general in these three kinds of amalgamation modes, it is the information integration based on semantic layer, described decision-making level's information fusion mode first to single channel interactive mode from Semantic Layer Comprehend, then by two or more channel information merge get up.It is generally applicable to loose coupling.Compared with feature fusion, the advantage of decision-making level's information fusion is: reliability and antijamming capability are comparatively strong, and the computing cost in processing procedure is less.Shortcoming is the correlativity can not looking for each passage from bottom data.

The control information obtained from information fusion module is converted to the actual motion of robot by robot control module, realizes the specific action of robot.

Validation of information module by final fusion results in the form of speech (loudspeaker broadcast) feed back to user, if obtain user's positive reply, then perform prompting task; If obtain user's negative reply, then do not perform prompting task; If user's nonreply in preset time t (as 2 minutes), then validation of information module inputs gesture or voice messaging by voice (loudspeaker broadcast) prompting user, if user's still nonreply, does not then perform prompting task.The text message synthetic speech information that voice message in validation of information module adopts speech synthesis technique (Text-to-Speech) to export, then by loudspeaker by sound feedback to user.

Emergent emergency module, after emergent emergency module receives the emergent message that validation of information module transmits, or temperature, CH that sensor assembly collects ₄when content or CO content exceed certain value, then emergent emergency module sends voice, note and multimedia message emergency to the particular person of pre-registration.Wherein, voice messaging is the voice messaging of the text message synthesis adopting speech synthesis technique (Text-to-Speech) to export.When multimedia message refers to that emergent emergency module receives emergency information, by the scene photograph of the RGB-D camera shooting in sensor assembly.

The situation that emergent emergency module obtains emergency information from validation of information module is: user is in some special circumstances, as, by house robbery, fall down after cannot get up, when the situations such as emergency of can not making a phone call occur, distress signals are sent by voice or Pre-defined gesture, net result after information fusion module merges, after validation of information module confirms, send emergency information to emergent emergency module, start to start after emergent emergency module receives emergency information, transmission voice messaging, note, multimedia message are cried for help to predefined particular person.

The present invention also provides a kind of robot man-machine interaction method based on gesture and speech recognition, and the method is divided into interactive mode and detecting pattern, and interactive mode is divided into: gesture mode, speech pattern and united mode.Wherein, detecting pattern is used for humiture, CH in testing environment ₄gas content or CO gas content.As humiture, CH in environment ₄when gas content or CO gas content exceed certain value, under detecting pattern, device of the present invention can send warning message to particular person, and warning message comprises voice messaging, note, photo (humiture, CH in environment ₄when content or CO content exceed certain value, the scene photograph of camera shooting).Interactive mode is user by the pattern of gesture or voice and robot interactive.In such a mode, if user is by means of only gesture and robot interactive, this pattern is called gesture mode, when being applicable to user's control, need the situation of talk with other people or neighbourhood noise larger etc., be not suitable for the occasion of use speech recognition; If user is by means of only voice and robot interactive, this pattern is called speech pattern, is applicable to gesture major part or the restricted occasion of gesture identification interactive mode such as to be blocked completely; If user is by voice and gesture two kinds of passages and robot interactive, this pattern is called united mode, this pattern systhesis makes to use gesture and voice two kinds of passages, and these two kinds of passages in some cases can be complementary, adds the scope of application that human and computer people is mutual to a certain extent.

Man-machine interaction method of the present invention flow process as shown in Figure 3, specifically comprises the following steps:

The first step, has judged whether interactive object, if there is interactive object, then opens interactive mode.When robot receives the voice of greeting, as " hello ", " good afternoon " etc., or when robot identifies a certain action, during as " waving ", then judge to have near robot interactive object, open interactive mode.If do not have interactive object, robot is in detecting pattern.

Second step, information inputs.Gesture recognition module and sound identification module carry out information acquisition in real time, if the information of collecting, then perform the 3rd step, otherwise perform the 5th step;

3rd step, gesture recognition module, by RGB-D camera collection deep image information and RGB information, through steps such as Iamge Segmentation, feature extraction, training identifications, is carried out the identification of Pre-defined gesture, and recognition result is sent to information fusion module.Meanwhile, built-in for RGB-D camera Mike is gathered audio-frequency information by sound identification module, be converted to particular text information (SpeechtoText) by speech recognition software (CMUPocketSphinx of Carnegie Mellon University), and voice identification result is sent to information fusion module.Voice identification result and gesture identification result are carried out information fusion from semantic layer by information fusion module, obtain final fusion results.

Information fusion module is carried out semantic layer information fusion, and is sent to robot control module or validation of information module according to final fusion results after receiving the recognition result that gesture recognition module and sound identification module send.After robot control module receives the control information that information fusion module sends, control moves, and completes particular task.After validation of information module receives the emergent distress signals that information fusion module sends, validation of information module confirms to user with regard to whether sending distress signals.

Man-machine interaction method of the present invention has Three models in an interactive mode: gesture mode, speech pattern and united mode, then information fusion module is corresponding also has Three models: gesture mode, speech pattern and united mode.

A, information fusion module are under gesture mode, using gesture identification result as final fusion results, and send out message to robot control module or validation of information module according to the kind of information fusion net result, namely, if final fusion results is control information, then message is sent to robot control module, control completes particular task; If final fusion results is emergency information, then emergency information is sent to validation of information module, be confirmed whether to send emergency information to user.Namely, if user has done predefined emergency gesture, information fusion module receives the emergent distress signals that gesture recognition module is sent, for avoiding maloperation, send out message and carry out validation of information to validation of information module, whether the form that validation of information module is broadcasted with loudspeaker plays current task, and point out user to select to execute the task.When user uses certain gestures, as " OK " gesture, when confirming to perform current task, validation of information module sends message to emergent emergency module, emergent emergency module just sends voice, note and multimedia message to predefined particular person after receiving the emergency information that validation of information module sends over.Wherein, multimedia message is emergent emergency module when receiving the emergent message that validation of information module sends, by the scene photograph of the RGB-D camera shooting in sensor assembly.

B, information fusion module are in voice mode, using voice identification result as final fusion results, and send out message to robot control module or validation of information module according to the kind of information fusion net result, namely, if final fusion results is control information, then message is sent to robot control module, control completes particular task; If final fusion results is emergency information, then message is sent to validation of information module, be confirmed whether to send emergency information to user.Namely, if user uses predefined emergency phonetic order, information fusion module receives the emergent distress signals that sound identification module sends, for avoiding maloperation, send out message and carry out validation of information to validation of information module, whether the form that validation of information module is broadcasted with loudspeaker plays current task, and point out user to select to execute the task.When user uses special sound instruction, as instructions such as "Yes", when confirming to perform current task, validation of information module sends message to emergent emergency module, emergent emergency module just sends voice, note and multimedia message to predefined particular person after receiving the information that validation of information module sends over.Wherein, multimedia message is emergent emergency module when receiving the message that validation of information module sends, by the scene photograph of the RGB-D camera shooting in sensor assembly.

C, information fusion module, under united mode, using the result of gesture and speech recognition as final fusion results, and send out message to robot control module or validation of information module according to the kind of information fusion net result.The acquisition time of gesture and voice channel information may be inconsistent, therefore, after information fusion module obtains a certain channel information, can wait for the input of other channel informations, and setting time-out restriction t (as 10s).After each channel information all inputted, then carry out information fusion, make corresponding decision, trigger corresponding steering order, the motion of control or send information to validation of information module.

If when there is difference in the decision-making of information fusion module design task, as, voice messaging and gesture information afoul time, information fusion module can send out message to validation of information module, the form playing task title successively that validation of information module is broadcasted with loudspeaker, select for user, and point out user tasks clear.When validation of information module receives the asserts feedback of user, start to execute the task.Here asserts feedback available speech is expressed, and also gestures available represents.Voice: as vocabulary such as "Yes"; Gesture: as, " OK " gesture.

4th step, perform and feedback, the final fusion results according to information fusion module sends command adapted thereto.Final fusion results after information fusion is divided into two classes: a class is control information, and a class is emergency information.If control information, then send corresponding steering order, control moves.If emergency information, then send out message to validation of information module, validation of information module is broadcast to user by speech form (loudspeaker broadcast), whether inquiry user will perform, obtain positive reply or not response in setting-up time T (as 3 minutes), then start contingency mode, turn the 6th step.Described positive reply can be expressed by the form of voice or gesture: voice are as "Yes" etc.; Gesture is as gestures such as " OK ".

5th step, in interactive mode, if second step is not response in setting-up time T (as 3 minutes), namely, gesture recognition module can't detect gesture information, sound identification module also can't detect voice messaging, and after this situation certain time T (as 3 minutes), information fusion module sends emergency information to validation of information module, validation of information module is allowed to carry out indicative voice inquiry according to current task, confirm there is no interactive object near robot, then validation of information module is by voice (loudspeaker broadcast) inquiry user, if in special time T (as 3 minutes), second step still can't detect voice messaging and gesture information, then be considered as near robot without interactive object, close interactive mode, open detection pattern, turn the first step.

Described detecting pattern, is used for testing environment situation, as long as robot energising, robot is just in detecting pattern always.In this mode, emergent emergency module is by Temperature Humidity Sensor, the CH in sensor assembly ₄detecting sensor and CO detecting sensor detected temperatures, CH ₄content or CO content, as temperature, CH ₄when content or CO content exceed certain value, after emergent emergency module detects the data that exceed standard, send voice messaging, note and multimedia message emergency to particular person.Wherein, when multimedia message refers to that emergency information sends, by the scene photograph of the RGB-D camera shooting in sensor assembly.

6th step, emergent emergency.Emergent emergency module receives the information that validation of information module sends over, or temperature, CH that sensor assembly collects ₄when content or CO content exceed certain value, emergent emergency module sends voice, note and multimedia message emergency to the particular person of pre-registration.Wherein, voice messaging is the voice messaging of the text message synthesis adopting speech synthesis technique (Text-to-Speech) to export.When multimedia message refers to that emergent emergency module receives emergency information, by the scene photograph of the RGB-D camera shooting in sensor assembly.

Fig. 3 is six sufficient legs/arms combined mobile robot man-machine interaction method process flow diagrams of the invention process example, is described in detail the present invention in conjunction with an embodiment:

Embodiment one:

Be below the present invention in an interactive mode, user and six sufficient legs/arms combined mobile robots are based on the implementation procedure of gesture and speech recognition man-machine interaction method:

(1) judge whether have interactive object near robot.When robot has detected voice messaging by Mike, as " hello " " good morning " etc., or certain gestures information detected by camera, as waved, then think there is interactive object near robot, open interactive mode, and play voice messaging with the form of loudspeaker broadcast, as " hello, may I ask what instruction? "

(2) in an interactive mode, user can with predefined gesture or voice and robot interactive.Embodiment is provided below with regard to the gesture mode under interactive mode, speech pattern and united mode.

A, gesture mode, user only can use the embodiment of predefined gesture control as follows:

When user stretches out a finger, robot lifts that leg with clip, simultaneously, the form broadcasting voice of broadcasting with loudspeaker, as " may I ask and will carry out clipping operation? ", when user makes OK gesture, or carry out confirmation answer with voice, as "Yes" etc., then perform robot control module, control completes particular task.

When user stretches out two fingers, robot lifts two legs, start four-footed and support, two sufficient operations, and under the control of robot control module, control completes particular task.

When user stretches out three fingers, robot lifts three legs, when user's hand moves upward, robot travels forward with " 3+3 " gait, when user's hand is to left movement, robot turns left with " 3+3 " gait, when user's hand moves right, robot moves right with " 3+3 " gait, when user's hand moves backward, robot moves backward with " 3+3 " gait, when user stretches out four fingers, robot carries out wheel leg and switches, " wheel row " mode is switched to by " leg is capable " mode, when user's hand moves upward, robot travels forward in " wheel row " mode, when user's hand is to left movement, robot turns left in " wheel row " mode, when user's hand moves right, robot bends to right in " wheel row " mode, when user's hand moves downward, robot moves backward in " wheel row " mode, when user's hand pushes away forward, robot stop motion.

When user makes the gesture (namely just stretching out thumb and little finger of toe) of numeral 6, robot six legs land, when user stretches out 4 fingers, robot switches to " wheel row " mode by " leg is capable " mode, when user's hand moves upward, robot travels forward in " wheel row " mode, when user's hand is to left movement, robot turns left in " wheel row " mode, when user's hand moves right, robot moves right in " wheel row " mode, when user's hand moves backward, robot moves backward in " wheel row " mode, when user's hand pushes away forward, robot stop motion, when user stretches out 4 fingers, robot switches to " leg is capable " mode by " wheel row " mode.

When user's hand draws circle clockwise, robot original place turns clockwise, and when user's hand draws circle counterclockwise, robot original place is rotated counterclockwise.When user's hand pushes away forward, robot stop motion.

When user stretches out five fingers, whether the form inquiry user that robot broadcasts with loudspeaker will cry for help, if robot receive represent confirm gesture information (as, the gesture of OK) or not response in setting-up time T (as 3 minutes), robot then by representing the scene photograph of voice messaging, note and the camera shooting of crying for help, sends to the particular person of pre-registration.Wherein, the scene photograph of camera shooting, be robot when receiving distress signals, by the scene photograph of RGB-D camera in sensor assembly shooting.

If there is no gesture and voice messaging input in special time T (as 3 minutes), the form prompting user that robot broadcasts with loudspeaker inputs gesture or voice messaging, if still there is no gesture and voice messaging input in special time T (as 3 minutes), robot thinks that interactive object leaves, close interactive mode, open detection pattern, turns (1).

B, speech pattern, user only can use the embodiment of predefined Voice command robot as follows:

When user says " speech pattern ", after robot identifies these four words, voice are play with the form of loudspeaker broadcast, as, " you enter speech pattern now ", when user says " advance ", after robot identifies these two words, start to travel forward, when user says " acceleration ", start to accelerate to certain speed after robot identification, when user says " left-hand rotation ", turn left after robot identification, when user says " deceleration ", certain speed is reduced to after robot identification, when user says " right-hand rotation ", turn right after robot identification, when user says " stopping ", robot identifies stop motion.

When user is in reciprocal process, want to talk with other people, terminated speech mode control, that is, user thinks only to use gesture control, user can say " gesture mode ", after robot identifies these four words, play voice with the form of loudspeaker broadcast, as, " you enter gesture mode now, and this pattern can only use gesture control.", in this mode, user can only use gesture control, and instantiation is shown in a.

When user says " help ", after robot identifies these two words, whether the form inquiry user that validation of information module is broadcasted with loudspeaker sends emergency information, if obtain the positive reply of user, as, user says the vocabulary such as "Yes", or not response in setting-up time T (as 3 minutes), robot then by representing the scene photograph of voice messaging, note and the camera shooting of crying for help, sends to the particular person of pre-registration.Wherein, the scene photograph of camera shooting, be robot when receiving distress signals, by the scene photograph of RGB-D camera in sensor assembly shooting.

C, united mode, user makes to use gesture and voice to jointly control the embodiment of robot as follows:

When user says " united mode ", voice are play with the form of loudspeaker broadcast after robot identification, as, " you enter gesture now and voice jointly control pattern ", in this mode, as long as user says a certain phonetic control command or does a certain gesture, robot can complete command adapted thereto, that is, user can use arbitrary voice or gesture control to complete specific action, and voice and gesture can cross-references, specific implementation is identical with a, b, just for the INQUIRE statement of robot, user can use gesture, and also can use voice answering.

When gesture is conflicted with phonetic order, robot plays gesture and semantic task title successively with the form that loudspeaker are broadcasted, often play a task and all inquire whether user performs, if obtain the positive reply of user, then perform this task, positive reply gestures available also available speech form is expressed, and gesture is as the gesture of OK, and voice are as "Yes" etc.After each playing task completes, if can not get user at special time T (as 3 minutes) to answer, then robot does not perform this task, if also have other tasks to need user to confirm, then play all tasks successively with the form of loudspeaker broadcast, and, all inquire after often playing whether user will perform this task, circulate successively, until all tasks all play with the form of loudspeaker broadcast, and all inquired whether user will perform, if there is no gesture and voice messaging input in special time T (as 3 minutes), the form prompting user that robot broadcasts with loudspeaker inputs gesture or voice messaging, if still there is no gesture and voice messaging input in special time T (as 3 minutes), robot thinks that interactive object leaves, close interactive mode, open detection pattern, turn (1).

Claims

1. based on a robot human-computer interaction device for gesture and speech recognition, it is characterized in that: comprise sensor assembly, gesture recognition module, sound identification module, information fusion module, validation of information module, robot control module and emergent emergency module; Sensor assembly comprises RGB-D camera, Temperature Humidity Sensor, CH ₄detecting sensor and CO detecting sensor, the RGB image information of described RGB-D camera collection and deep image information send to gesture recognition module or emergent emergency module; Described RGB-D camera has built-in Mike, and the voice messaging that Mike gathers sends to sound identification module; The form that described validation of information module is broadcasted by loudspeaker broadcasts emergency information to user, confirms to obtain user; Described Temperature Humidity Sensor, CH ₄detecting sensor and CO detecting sensor are respectively used to gather CH in humiture, air ₄with CO gas concentration, and the data of collection are sent to emergent emergency module; Gesture recognition module carries out gesture identification according to the RGB image information of RGB-D camera collection and deep image information, obtains gesture identification result and sends to information fusion module; The voice messaging that Mike gathers by sound identification module carries out speech recognition, obtains specific character information, as voice identification result, and recognition result is sent to information fusion module; Information fusion module is that gesture identification result and voice identification result are carried out information fusion at semantic layer, generate a final fusion results, when final fusion results is control information, control information is then sent to robot control module by information fusion module, and control completes particular task; When final fusion results is emergency information, information fusion module then sends emergency information to validation of information module, whether the form inquiry user that validation of information module is broadcasted by loudspeaker will perform, obtain positive reply or not response in setting-up time, validation of information module sends emergency information to emergent emergency module; After emergent emergency module receives the emergent message that validation of information module transmits, or temperature, CH that sensor assembly collects ₄when content or CO content exceed certain value, then emergent emergency module sends voice, note and multimedia message emergency to the particular person of registration.

2. a kind of robot human-computer interaction device based on gesture and speech recognition according to claim 1, is characterized in that: described information fusion module adopts decision-making level's information fusion method.

3. a kind of robot human-computer interaction device based on gesture and speech recognition according to claim 1, is characterized in that: described information fusion module, if only have gesture information to input, using gesture identification result as final fusion results; If only have voice messaging to input, using voice identification result as final fusion results; If existing voice messaging, have again gesture information to input, and gesture identification result and voice identification result do not conflict, then the final fusion results of gesture and voice is optionally issued robot control module or validation of information module by information fusion module; When gesture identification result and voice identification result conflict, two kinds of results are directly sent to validation of information module, then confirms which kind of performs operates by validation of information module to user.

4., based on a robot man-machine interaction method for gesture and speech recognition, it is characterized in that: specifically comprise the following steps,

6th step, emergent emergency;

Emergent emergency module receives the emergency information that validation of information module sends over, or temperature, CH that sensor assembly collects ₄when content or CO content exceed certain value, emergent emergency module sends voice, note or multimedia message emergency to the particular person of pre-registration.

5. a kind of robot man-machine interaction method based on gesture and speech recognition according to claim 4, is characterized in that: described detecting pattern, is used for testing environment situation; In detecting pattern, emergent emergency module is by Temperature Humidity Sensor, the CH in sensor assembly ₄detecting sensor and CO detecting sensor detected temperatures, CH ₄content or CO content, as temperature, CH ₄when content or CO content exceed certain value, after emergent emergency module detects the data that exceed standard, send voice messaging, note and multimedia message emergency to particular person.