CN117162118A

CN117162118A - Multi-mode robot man-machine co-fusion method

Info

Publication number: CN117162118A
Application number: CN202311114838.9A
Authority: CN
Inventors: 陈国军; 王宇; 陈巍; 郭铁铮; 董宏伟; 刘金辉
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-12-05

Abstract

The invention provides a multi-mode robot man-machine co-fusion method, which adopts multi-mode information such as voice information, facial information, limb action information and the like to judge whether interaction is generated, and judges whether interaction is performed with medical staff in two ways. The invention provides a man-machine co-fusion method of an intelligent drug delivery robot, which comprises a set of frame system of the intelligent drug delivery robot, a remote voice recognition technology, a visual perception system workflow, a visual servo-based tracking method and a Dialogflow-based man-machine interaction method of the intelligent drug delivery robot.

Description

Multi-mode robot man-machine co-fusion method

Technical Field

The invention belongs to the field of intelligent robots, and particularly relates to a multi-mode robot man-machine co-fusion method.

Background

In recent years, with rapid development of the robot industry, home entertainment, medical health and aerospace science and technology will gradually replace manufacturing industry as a support of the robot industry; and with the increase of the aging society, enterprises face the problem of manual shortage, so service robots gradually move into the production and life of human beings and play an increasingly important role. In the large environment of industry 4.0, robot technology has been widely applied to an automation workshop, and with the development of the technology of the internet of things, the relationship between a person and a robot is not isolated, but becomes a co-existence relationship between the person and the robot.

The level of man-machine co-fusion determines the degree of importance of robot intelligence. The intelligent robot below shows the autonomy of a single robot, the collaboration of a plurality of robots and the important characteristics of man-machine co-fusion. And man-machine co-fusion is an important feature of intelligent robots. In other words, the co-fusion is a state that the person and the robot are in the same space, can safely cooperate and can naturally communicate. Robots in this state will no longer be tools, can become friends or partners, can communicate with each other, and even partners.

At present, a medicine delivery robot has a plurality of single modes, namely, a single interaction mode is used: the voice broadcasting or display screen mode interacts with the person. (as described in patent CN202220917638.1 and patent CN 202010189951.3), the interaction degree is low, the intelligent level is low, effective communication with the dispensing personnel is difficult, and the working efficiency is improved. In addition, in hospitals, the possibility of virus infection is increased by means of touch screens. Therefore, a multi-mode interaction mode is needed to be realized so as to improve the intelligent level of the robot and improve the safety of intelligent medicine delivery.

Noun interpretation:

and (3) man-machine co-fusion: human-computer co-fusion refers to a highly interactive collaboration between humans and computer systems. The man-machine co-fusion for the service robot means that the robot can interact with human as well as human, and simultaneously meets the requirements of human.

Multimode: is presented with respect to a single modality, refers to information of multiple modalities, and a combination of two or more modalities may include text, image, video, audio, and other types of information.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a multi-mode robot man-machine co-fusion method which can realize a multi-mode interaction mode so as to improve the intelligent level of a robot and improve the safety of intelligent medicine delivery.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the invention provides a multi-mode intelligent dispensing robot-machine co-fusion method, which comprises the following steps:

performing multi-mode sensing and judging whether interaction is performed or not;

after judging to interact, executing an instruction approaching to medical staff;

after executing the instruction to approach the healthcare worker, voice interaction and task allocation are performed.

Further, performing multi-modal sensing, determining whether to perform interaction, includes:

after the special wake-up word is identified by the far-end voice recognition system, the target interaction object is confirmed;

or when the face information of the medical staff is identified, determining to interact with the medical staff after finding that the angles of eyes of the medical staff, which watch the camera, reach the threshold value and the gesture information of the medical staff is identified.

Further, the workflow of the far-end speech recognition system includes:

voice input, namely a sensor of a microphone voice array of 3 Kinetic on a robot, wherein each Kinetic sensor is provided with four microphones to acquire voice information, and the voice information is used for detecting wake-up words of the robot and also used for detecting contents in a Dilogforce-based HRI voice chat frame;

preprocessing a source voice signal, namely performing delay processing on the voice signal obtained by an intelligent robot from a hospital and filtering out surrounding information and noise in the face of problems of noise, reverberation and long distance, and performing conversion processing on the voice signal;

extracting features, namely extracting voice signals containing key words in a hospital environment where the robot is located;

The acoustic model, the language model and the dictionary form a recognition network, the acoustic model compares the recognized voice signal with the target voice signal, and the voice model calculates the occurrence probability of a sentence and comprises a set of words and phonemes such as medicine delivery in a hospital;

the method comprises the steps of voice decoding, integrating an acoustic model and a language model which are already trained and a recognition system of dictionary synthesis, and searching out the most reasonable path;

text output, i.e., translating speech decoded information into text for reading.

Further, when face information of the medical staff is identified, determining to interact with the medical staff after finding that angles of eyes of the medical staff, which are focused on the camera, reach the threshold value and gesture information of the medical staff is identified, including:

gesture recognition, namely acquiring a portrait photo of medical staff through a robot, acquiring visual information around a hospital at a speed of frames per second according to two RealSense Di cameras adopted by the robot based on the situation of the staff and the hospital, recognizing gesture information contained in joints of a human body, and taking a data set of Kinetics-as a data set of visual recognition;

extracting features, namely acquiring the positions of key points in an image space contained in a shot picture by using an open source library openpore, namely acquiring the gesture estimation of the actions of medical staff;

Firstly, setting a gesture for a medical staff to wake up the robot, generally lifting the hand over the head and the shoulder, converting the hand into a mathematical model, and comparing the tested mathematical model with the set gesture mathematical model to obtain an optimal gesture recognition result;

a YOLOv multi-person detector, each healthcare worker individual being detected using the YOLOv detector;

and detecting the head gesture, then using an OpenCV library to perform face detection, and using a specific angle of the head to perform reference standard for eye recognition, so as to improve the accuracy of eye front vision recognition.

And estimating the head posture, judging whether the head of the medical staff faces the camera or not by identifying the yaw angle, the pitch angle and the turnover angle of the head of the medical staff in the acquired image, and observing whether the medical staff looks at the camera or not by the horizontal angle and the vertical angle of the sight line.

Further, when face information of the medical staff is identified, after finding that angles of eyes of the medical staff, which are focused on the camera, reach the threshold value, and after gesture information of the medical staff is identified, interaction with the medical staff is determined, and the method further comprises:

through the gesture estimation of discernment face, use face detection in order to detect a plurality of individuals, utilize OpenCV storehouse to do face detection, at first through discernment obtain the yaw angle of picture chinese medical personnel's head, pitch angle and flip angle judge whether right against the camera, and observe whether medical personnel right looking at the camera through the horizontal angle and the vertical angle of sight, namely yaw angle (alpha), pitch angle (beta), flip angle (gamma) represent the angle of head gesture, the judgement formula is shown as follows:

Wherein α ', β ', γ ' are thresholds of yaw angle, pitch angle and flip angle;

the reliability is improved for eye front vision recognition by utilizing a specific angle of the head, an image input by a face detector is received by adopting a sight tracker, and the sight tracker consists of a convolutional neural network (convolutional neural network, CNN). The head gesture recognition and the eye gazing recognition are used for improving two groups of recognition and judging whether the medical staff is gazing at the drug delivery robot, and if the head gesture recognition result and the eye emmetropia result are judged to be true, the intention that the medical staff can interact can be indicated.

Wherein θ _α ,θ _β For horizontal angle and vertical angle, θ' _α ,θ′ _β Is their upper bound.

Further, when the medical staff only looks at the robot, the interaction between the medical staff and the robot cannot be completely determined, and when the robot also recognizes the gesture of medical care, the medical staff and the robot are judged to interact.

Further, after the interaction is judged, executing an instruction for approaching the medical staff, including:

whether the medical staff acquired in the images acquired by the robot is the tracked object is determined by YOLOv,

After calculating a parameter model of the camera and a coordinate point of the dimension of the identified tracked medical staff, comparing a difference value between the expected target pose and the current target pose, calculating the speed of the current virtual camera,

and updating the current pose state, comparing the difference value of the expected pose with the target pose again, and if the difference value is zero, stopping tracking by the robot.

Further, after calculating a parameter model where the camera is located and a coordinate point of a dimension of the identified tracked medical staff, comparing a difference value between an expected target pose and a current target pose, and calculating a current speed of the virtual camera, including:

extracting the features of the tracker, mapping the image characteristic information of the tracking medical staff and the feature information of the image of the target into the same space for comparison, and presetting the image feature information of the target in advance;

and calculating whether the comparison difference between the current feature and the expected feature is zero, comparing matrix feature information in the comparison image with preset target state feature information, and hopefully calculating an error between the matrix feature information and the preset target state feature information.

Further, updating the current pose state, and comparing the difference between the expected pose and the target pose again, if the difference is zero, the robot will stop tracking, including:

Calculating a target state and a current state matrix to obtain a matrix error between the target state and the current state matrix, and representing the matrix error as a desired error;

first, a matrix E is established _e 、E _p 、E _d To respectively represent the relationship between edge features, key point features and depth features in the graph sequence from point to line, and to respectively establish interaction matrices L of edge features, key point features and depth features representations in D space _e 、L _p 、L _d To make pose estimation for the target healthcare person.

Pose calculation of target medical staff is performed by a virtual visual servoing-based method that passes through a minimum expected state s ^* The error delta between the current state s and the target medical staff estimates the object pose of the target medical staff, and the error is shown as an exponentially decreasing formula:

e＝(s(r)-s*) (3)

where r is the pose estimate and λ is the positive scalar.

The interaction matrix is then used to relate the error change to the virtual camera speed v as follows:

in which L _s Is an interaction matrix that depends on the image features s and the corresponding depth values z in its scene. From equations (3), (4) equation (5) is derived by comparing the expected state s at each iteration ^* And the characteristics of the current state s to obtain a virtual camera speed v:

v＝-λL ⁺ _s (s(r)-s ^* ) (6)

wherein L is ⁺ _s Is L _s ∈R ^2n×6 Is pseudo-inverse of v E R ^6×1 ,s(r)-s ^* ∈R ^2n×1 And n represents the number of feature points.

Thus, the pose of the kth iteration is updated by equation (4). Δt represents the transformation between the kth iteration and the (k+1) th iteration, where the six-dimensional matrix is transformed into a four-dimensional matrix by Λ operation and an exponential mapping is created as follows:

wherein the method comprises the steps of

Updating the target matrix, and updating the pose state of the current medical staff again until the expected error is 0;

further, after executing the instruction for approaching the medical staff, performing voice interaction and task allocation, including:

after testing the voice dialogue function of Dialogflow without problems, using the generated JSON file to communicate with the central processing system of the robot;

further, after testing the voice dialogue function of Dialogflow without problem, the process of communicating with the central processing system of the robot by using the generated JSON file includes:

constructing a Dialogflow agent, firstly creating an account number of a cloud end, and carrying out a platform for creating voice dialogue training in the account number;

the intention is created, and the robot sets topics related to the contents such as robot call, scheduled time, scheduled drug delivery task, room number information and the like as topics of voice interaction according to the scene of a hospital.

Creating speaking content commonly used by medical staff as input voice content and setting the content of the response of the robot;

adding a keyword entity, wherein the keyword entity is used for storing a value of a keyword acquired by a robot, is used for voice recognition under a Dilaloghlow frame, and is also used for answering medical staff;

integration into the robot platform is a process that enables the dialalogflow to establish communication with the computer system of the robot.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention adopts the multi-mode information such as the voice information, the facial information, the limb action information and the like to judge whether interaction is generated, and two ways are adopted to judge whether interaction is performed with medical staff, when the distance is far, a far-end voice recognition system is used for recognizing the interaction information, when the distance is near, the user can recognize the limb information video of the medical staff to judge whether interaction is needed, and then the user can recognize and acquire tasks through audio content. Realizing a multi-mode interaction mode to improve the intelligent level of the robot and the safety of intelligent medicine delivery

2. The invention adds the voice dialogue function, most of the existing drug delivery robots have low intelligent degree, and can not effectively and efficiently execute drug delivery tasks only through a single-mode interaction mode. The robot only has the capability of voice broadcasting, or the interactive display screen performs simple interaction. The existing robots have the problems of low intelligent level and weak interaction capability, and are insufficient to cope with complex working environments of hospitals.

3. The invention constructs a man-machine co-fusion method of a multi-mode robot, and provides a mode based on voice dialogue and visual recognition in an interactive mode which is more relevant to human thinking in order to solve the problem of weak interaction of a single-mode robot. The intelligent level and the safety of the intelligent robot are improved.

Drawings

FIG. 1 is a diagram of a multi-modal fused intelligent dispensing robot co-fusion frame;

FIG. 2 is a flow chart of a multi-mode intelligent robot-to-human co-fusion system;

FIG. 3 is a visual perception system workflow diagram;

FIG. 4 is a flowchart of the operation of the far-end speech recognition system;

FIG. 5 is a tracking method based on visual servoing;

FIG. 6 is a flow chart of Dialogflow.

In the figure: 1. medical staff; 2. and (3) a robot.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Embodiment one:

FIG. 1 is a diagram of a multi-modal fusion overall framework. The application scene comprises medical staff 1 and a robot (taking an intelligent drug delivery robot as an example) 2. The intelligent medicine delivery robot 2 can navigate autonomously and has the capability of man-machine interaction, and is mainly used for medicine delivery service of hospital wards. For example, the healthcare worker 1 provides the medicine to the intelligent medicine delivery robot and initiates the delivery service through the robot. Specifically, the intelligent medicine delivery robot 2 interacts with the user through the voice recognition system, the visual recognition system and the voice recognition system 25, the remote voice system collects voice information of the medical staff 1 or the visual perception system collects limb actions and facial information of the user, then the intelligent medicine delivery robot 2 approaches to the medical staff through the visual servo module and performs voice dialogue with the user through Dialogflow to obtain specific task instructions, and then the interaction instructions are converted into languages executable by the intelligent medicine delivery robot to execute corresponding functions.

For example, the intelligent medicine delivery robot 2 may acquire the voice information of the medical staff 1 through the remote voice recognition module 22, and the intelligent medicine delivery robot 2 needs to recognize the correct keyword and voice event under the conditions of noise, reverberation, distance limitation, etc. to determine whether to interact with the keyword, receive the task instruction of delivery through Dialogflow, and execute the medicine delivery task to send to a specific ward. Specifically, the present application employs remote speech recognition (distance speech recognition, DSR) based on 3 Kinect microphone arrays distributed in space (K1, K2, K3) for noise and reverberation filtering, and employs delay and sum beamforming techniques with 4 microphones per Kinect for improved recognition in noisy and reverberant environments.

In addition, under the condition of close distance, the visual perception system can acquire the portrait information to identify the skeleton frame of the human body so as to acquire gesture information and make judgment, so as to acquire the intention of medical staff and whether interaction is needed; in addition, the face information of the person is also identified through the visual identification system, and whether interaction with medical staff is needed or not is judged by identifying the angles of the eyes of the face of the person, namely whether the eyes face the camera or not.

In addition, the robot can acquire voice information, limb action information and facial information through a remote voice recognition module, a gesture recognition module and a facial recognition module in a visual perception system, and can make a judgment on whether interaction is performed or not when any one of the voice information, the limb action information and the facial information is acquired.

Fig. 2 is a workflow diagram of a multi-modal humanoid robot to explain a specific workflow of an intelligent drug delivery robot.

The multi-modal human-computer co-fusion workflow in fig. 2 mainly comprises a multi-modal sensing module 21, a far-end voice recognition system 22, a visual sensing system 23, a visual servo system 24 and a dialogic flow human-computer interaction module 25, wherein the robot adapted by the application at least comprises the multi-modal sensing module 21: a remote voice recognition system 22, a visual perception system 23, a visual servo system 24 and a dialogic flow man-machine interaction module.

Wherein the multi-mode sensing module 21 comprises a remote voice recognition system 22 and a visual sensing module 23;

the far-end voice recognition system 22 can recognize a far-end sound, and the uttered interesting words of the medical staff 1 attract the attention of the intelligent medicine delivery robot 1 and gradually approach the medical staff 1.

The visual perception system 23 can determine whether to interact with the intelligent drug delivery robot 2 by identifying gesture information of a medical staff and facial information of the medical staff.

After confirming the distance between the intelligent medicine feeding machine 2 and the medical staff 1 through a vision module (not illustrated in the figure), the vision servo system 24 gives the camera speed through computer system calculation and continuously approaches to the medical staff. Until the speed is 0.

The Dialogflow human-computer interaction 25, which is used as a dialogue framework between the intelligent medicine delivery robot and the medical staff, is configured as a tool of voice dialogue, can realize a functional dialogue framework, and performs the medicine delivery function of the intelligent robot. Wherein the intelligent navigation section is not described in detail.

Specifically, the application scenario of the intelligent medicine delivery robot 2 is mainly a hospital place, the remote voice recognition system is always in a working state, the wake-up information of medical staff can be recognized in real time, and the visual perception system is also in a working state, so that the received target information is not received, and the intelligent medicine delivery robot can recognize one of the multi-mode information.

Specifically, in order to adapt to the problems of noise, reverberation, long distance, etc., in the far-end speech recognition system, a microphone array of 3 knilect modules is required to be used as the speech recognition module. The voice information is processed using delay and beamforming techniques.

Specifically, the visual perception system 23 is mainly responsible for collecting the facial information and the personal image information of the medical staff, and judging whether interaction with the medical staff is required. Firstly, a large number of identifiable data sets of medical staff are manufactured, the accuracy of identification can be improved, frame information of a human body is acquired from human image information collected by a visual perception system to judge whether the medical staff are interactive in gesture information judgment or not, or whether a person looks at a camera or not is judged by reading pitch angle, yaw angle and turnover angle of eyes in an image, and the medical staff are trained by the aid of a kinetic-400 data set and then put into a hospital for use. By combining the gesture information and the facial information, whether interaction with medical staff is needed or not is judged, and attention is paid to the fact that only the facial information is recognized, namely, the medical staff only looks at the intelligent medicine delivery robot, and the judgment of whether interaction is needed or not cannot be taken as a standard.

Specifically, after confirming that the intelligent medicine delivery robot needs to approach the medical staff after interacting with the medical staff, the intelligent medicine delivery robot confirms that the target is a tracking object through the YOLOv4 by a visual servo system, and then determines the moving speed of the robot through a distance error existing between a desired target position and the robot, and when the error is 0, the intelligent medicine delivery robot stops moving.

Specifically, for the case when the intelligent drug delivery robot reaches the target position, the medical staff 1 can perform voice conversation through the voice interactive frame Dialogflow specific to the intelligent drug delivery robot, can select to perform simple voice conversation with the robot, and can also perform drug delivery tasks for patients with reservation of designated time.

Specifically, the complete workflow of the intelligent drug delivery robot comprises: after the intelligent medicine delivery robot 2 recognizes the special wake-up word through the far-end voice recognition system, after confirming the target interaction object, the intelligent medicine delivery robot is continuously close to the target object, after approaching the target position, a specific medicine delivery task is obtained through dialogue with medical staff, and the specific medicine delivery task is converted into a robot executable instruction, so that the medicine delivery task is completed; in another case, when the intelligent drug delivery robot 2 recognizes the face information of the medical staff (see fig. 4, 234, and use yolov3 for multi-person detection), it finds that the angles of eyes of the medical staff looking at the camera reach the threshold value, and after recognizing the gesture information of the medical staff, it can determine to interact with the medical staff, and by performing voice conversation with the medical staff, a drug delivery task can be obtained, and after converting into a robot executable instruction, the robot will start to execute the drug delivery task.

Fig. 3 is a flowchart of a remote speech recognition system provided by an example of the present disclosure, where the remote speech interaction system of fig. 3 is used to obtain multi-modal information over a longer distance. The workflow of the far-end speech recognition system comprises:

the voice input 221 is formed by a sensor of a microphone voice array of 3 Kinetic on the intelligent drug delivery robot, and each Kinetic sensor is provided with four microphones, so that voice information can be acquired, and the voice input 221 can be used for detecting wake-up words of the intelligent drug delivery robot and also can be used for detecting contents in a Dilogforce-based HRI voice chat frame; the method specifically adopts a far-end voice recognition algorithm: a real-time 3D audio localization system based on an adaptive response power-phase transform (SRP-phas) algorithm is employed that is robust to noise and errors.

Remote speech recognition (distance speech recognition, DSR) using 3 Kinect microphone arrays distributed in space (K1, K2, K3) to eliminate reverberation and noise, delay and sum beamforming techniques using 4 microphones per Kinect to improve recognition in noisy and reverberant environments

The source voice signal preprocessing 222, which is used for solving the problems of noise, reverberation and long distance, needs to delay processing on voice signals obtained from hospitals by intelligent robots, filtering out surrounding unnecessary information and noise, and transforming the voice signals;

Feature extraction 223, extracting a voice signal containing key words in a hospital environment where the intelligent drug delivery robot is located;

the acoustic model and the language model and the dictionary form a recognition network 224, the acoustic model needs to compare the recognized voice signal with the target voice signal, the voice model needs to calculate the probability of occurrence of a sentence according to the result, and needs to contain a set of words and phonemes such as hospital medicine delivery;

the speech decoding 225, which needs to integrate the trained acoustic model and language model and the recognition system of dictionary synthesis, searches out the most reasonable path;

text output 226 translates the speech decoded information into text that can be read.

Specifically, an acoustic model is required to train, and the recognition voice signal of the intelligent drug delivery robot is compared with the existing target voice signal to obtain a recognition result. And then a language model is built according to the grammar and the semantics of Chinese, and the vocabulary commonly used for delivering medicine and factors are collected together to build a voice recognition system of the intelligent robot.

Specifically, in combination with the above information, the remote voice recognition system of the intelligent drug delivery robot is expressed as: the method comprises the steps of acquiring voice signals through sensors of microphone voice arrays of 3 Kinetics, processing problems of noise, reverberation, long distance and the like in a complex hospital environment, processing surrounding unnecessary information and noise, converting the voice signals into mathematical signals, extracting voice signals required by key words and voice conversations, matching the voice signals with a voice recognition system constructed by trained acoustic models, language models and dictionaries, decoding an optimal voice recognition result, and finally outputting the result to an intelligent medicine feeding robot to judge whether interaction with a medical robot is needed.

Fig. 4 is a workflow diagram of a visual perception system, which is mainly used for an intelligent drug delivery robot to judge whether interaction with medical staff is needed through the visual perception system, wherein the visual perception system of the intelligent drug delivery robot mainly comprises the steps of detecting each individual medical staff by using a YOLOv3 algorithm detector:

gesture recognition 231, acquiring a portrait photo of medical staff through the intelligent drug delivery robot, acquiring visual information around a hospital at a speed of 30 frames per second according to two RealSense D435i cameras adopted by the intelligent drug delivery robot based on the personnel and the hospital, recognizing gesture information contained in joints of a human body, and taking a data set of Kinetics-400 as a data set of visual recognition;

the feature extraction 232 can use an open source library openpore to obtain the positions of key points in an image space contained in a shot picture, namely, obtain the gesture estimation of the actions of medical staff;

the feature matching 233 is that firstly, the gesture of the intelligent medicine delivery robot, which is required to be awoken by medical staff, is set, the hand is lifted over the head and the shoulder generally, the hand is converted into a mathematical model, and the tested mathematical model is compared with the set gesture mathematical model to obtain the optimal gesture recognition result;

A YOLOv3 multi-person detector 234, each healthcare worker individual being detected using the YOLOv3 detector;

the head gesture detection 235, then, needs to use the OpenCV library for face detection, uses the specific angle of the head as a reference standard for eye recognition, and improves the accuracy of eye emmetropia recognition.

A head posture estimation 236 for judging whether the head of the medical staff is facing the camera by recognizing the yaw angle, pitch angle and flip angle of the head of the medical staff in the acquired image, and observing whether the medical staff is looking the camera by the horizontal angle and the vertical angle of the line of sight;

specifically, the intelligent medicine delivery robot acquires a portrait photo of a medical staff, and according to the situation based on the medical staff and the hospital where the medical staff is located, the adopted RealSense D435i camera needs to acquire visual information around the hospital at the speed of 30 frames per second, recognizes gesture information contained in joints of a human body, and uses a data set of Kinetics-400 as a data set of visual recognition; and recognizing the gesture estimation of the key action in the image through the Openose open source library, lifting the hand over the head and the shoulder generally according to the set gesture of the intelligent medicine delivery robot, converting the hand into a mathematical model, and comparing the tested mathematical model with the set gesture mathematical model to obtain the optimal gesture recognition result.

More specifically, or through the gesture estimation of the recognition face, a plurality of individuals can be detected by using face detection, the face detection needs to be performed by using an OpenCV library, firstly, whether the head of a medical care personnel faces a camera or not is judged by recognizing and acquiring the yaw angle, the pitch angle and the turnover angle of the head of the medical care personnel, and whether the medical care personnel looks at the camera or not is observed through the horizontal angle and the vertical angle of the sight, namely, the yaw angle (alpha), the pitch angle (beta) and the turnover angle (gamma) represent the angles of the head gesture, and the judgment formula is as follows:

More specifically, the reliability is improved for eye emmetropia recognition using a specific angle of the head, an image of the input of the face detector is received using a gaze tracker, which consists of a convolutional neural network (convolutional neural network, CNN). The head gesture recognition and the eye gaze recognition can be used for improving two groups of recognition and judging whether medical staff is gazing at the drug delivery robot, and if the head gesture recognition result and the eye emmetropia result are judged to be true, the medical staff is gazing at the intelligent drug delivery robot, and the intention of interaction is possibly presented.

And then, in particular, when the medical staff only pay attention to the intelligent medicine delivery robot, the interaction between the medical staff and the intelligent medicine delivery robot can not be completely determined, and when the intelligent medicine delivery robot also recognizes the gesture of medical care, the interaction between the medical staff and the intelligent medicine delivery robot can be judged.

Fig. 5 is a diagram of a visual servo tracking-based method, which can be used for quickly approaching medical staff after interaction between an intelligent drug delivery robot and the medical staff. The visual servo tracking method mainly comprises the following steps:

The centroid position of the object, the length and width of the bounding box, and the probability of a match are obtained by using YOLOv 4. The obtained ID has a corresponding category, and the method for confirming whether an object is an object to be tracked specifically comprises the following steps:

by adopting the tracking method based on the mixed model, stable tracking can be realized by tracking the edge features, the key point features and the depth features of the model, and the gesture, the position and the motion of the target person can be predicted by moving the key points and the depth features of the edge and color depth camera and combining with establishing an interaction matrix in the space where the target person is located.

The visual servo tracking method mainly comprises the following steps:

initializing a camera model and an initial posture 241, and calculating a parameter model of the camera and a 3-dimensional coordinate point of the identified tracked medical staff;

firstly, determining whether medical staff acquired in an image acquired by an intelligent drug delivery robot is a tracked object or not through YOLOv 4;

extracting the trace object features 242, mapping the image feature information of the trace medical staff and the feature information of the image of the target into the same space for comparison, and presetting the image feature information of the target in advance;

calculating whether the comparison difference value between the current feature and the expected feature is 0243, comparing matrix feature information in the comparison image with preset target state feature information, and hopefully calculating an error between the matrix feature information and the preset target state feature information;

The target state and the current state matrix can be calculated to obtain a matrix error between the target state and the current state matrix, and the matrix error can be expressed as a desired error;

first, a matrix E is established _e 、E _p 、E _d To respectively represent edge characteristics and keysThe relation between the point feature and the depth feature in the graph sequence from point to line, and respectively establishing an interaction matrix L of edge feature, key point feature and depth feature representation in the 3D space _e 、L _p 、L _d To make pose estimation for the target healthcare person.

A virtual visual servo-based method is adopted to calculate the pose of the target medical staff. The method is a numerical method for solving the problem of full-size nonlinear optimization. The method passes through a minimum expected state s ^* The error delta between the current state s and the target medical staff estimates the object pose of the target medical staff, and the error is shown as an exponentially decreasing formula:

e＝(s(r)-s*) (3)

where r is the pose estimate and λ is the positive scalar.

in which L _s Is an interaction matrix that depends on the image features s and the corresponding depth values z in its scene. From equations (3), (4) equation (5) can be derived by comparing the expected state s at each iteration ^* And the characteristics of the current state s, the virtual camera speed v can be obtained:

v＝-λL ⁺ _s (s(r)-s ^* ) (6)

Thus, the pose of the kth iteration may be updated by equation (4). Δt represents the transformation between the kth iteration and the (k+1) th iteration, where the six-dimensional matrix is transformed into a four-dimensional matrix by Λ operation and an exponential mapping is created as follows:

wherein the method comprises the steps of

Generally, the larger the expected error, the greater the virtual speed of the camera;

updating the target matrix 245, and updating the pose state of the current medical staff again until the expected error is 0;

specifically, when a medical staff is required to track a target, whether the medical staff acquired in an image acquired by the intelligent medicine delivery robot is a tracked object or not is firstly determined through YOLOv4, after a parameter model of the camera is calculated and a 3-dimensional coordinate point of the identified tracked medical staff is calculated, the difference value between the expected target pose and the current target pose can be compared, the speed of the current virtual camera can be calculated, then the current pose state can be updated, the difference value between the expected pose and the target pose is compared again, and if the difference value is 0, the intelligent medicine delivery robot stops tracking.

FIG. 6 is a flow chart of a Dialogflow usage, which is primarily used for voice conversations or navigation tasks with medical personnel for an intelligent drug delivery robot. Before the use of the Dilalogflow, the following procedures are mainly included:

constructing a Dialogflow agent 251, firstly creating an account number of a cloud end, and carrying out a platform for creating voice dialogue training in the account number;

the intention 252 is created, and the intelligent drug delivery robot can set topics related to the contents such as robot call, scheduled time, scheduled drug delivery task, set room number information and the like as topics of voice interaction according to the scene of a hospital. A step of

Specifically, it is necessary to create speaking contents commonly used by medical staff as voice contents of input and set contents of answers of the intelligent drug delivery robot;

adding a keyword entity 253, wherein the keyword entity is used for storing a keyword value acquired by the intelligent drug delivery robot, is used for voice recognition under a Dilaloglow framework, and can also be used for answering medical staff;

integration into the robotic platform 254, a process of implementing the dialalogflow to establish communication with the robotic computer system;

specifically, after the voice dialogue functions of the Dialogflow are tested to be free of problems, the generated JSON file is used for communicating with a central processing system of the intelligent drug delivery robot;

Specifically, we use Google Home smart speakers as hardware devices for voice interaction of the smart drug delivery robot. And NVIDIA JetsonAGX Xavier and Linux industrial personal computers were used as hardware devices for the central computer processing system.

Specifically, the Dialogflow is a proxy platform in the cloud, through which the framework of a voice dialogue is trained, and the dialogue of a voice response can be created by means of intention, so that the content in the common intelligent drug delivery process is filled. And keywords in the chat voice dialogue process are enriched in a manner of creating an entity, after the test is completed, a JSON file can be generated and integrated on a voice platform of the intelligent drug delivery robot, so that the functions of voice interaction and navigation are realized.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Compared with the prior art, the invention has the following breakthrough or advantages:

And the voice dialogue function is added, so that most of the existing drug delivery robots are low in intelligent degree, and cannot effectively and efficiently execute drug delivery tasks only through a single-mode interaction mode. The intelligent medicine delivery robot only has the capacity of voice broadcasting, or the interactive display screen performs simple interaction. The existing robots have the problems of low intelligent level and weak interaction capability, and are insufficient to cope with complex working environments of hospitals.

Therefore, a man-machine co-fusion method of the multi-mode intelligent drug delivery robot is constructed, and a mode based on voice dialogue and visual recognition is provided for solving the problem of weak interaction of the single-mode robot and in an interaction mode which is more relevant to human thinking. The intelligent level and the safety of the intelligent robot are improved.

After the intelligent medicine delivery robot 2 recognizes the special wake-up word through the far-end voice recognition system, after confirming the target interaction object, the intelligent medicine delivery robot is continuously close to the target object, after approaching the target position, a specific medicine delivery task is obtained through dialogue with medical staff, and the specific medicine delivery task is converted into a robot executable instruction, so that the medicine delivery task is completed; in another case, when the intelligent medicine delivery robot 2 recognizes the face information of the medical staff, after finding that the angles of eyes of the medical staff looking at the camera reach the threshold value, and recognizing the gesture information of the medical staff, the robot can determine to interact with the medical staff, through voice conversation with the medical staff, a medicine delivery task can be obtained, and after the voice conversation is converted into an executable instruction of the robot, the robot starts to execute the medicine delivery task.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. The multi-mode robot man-machine co-fusion method is characterized by comprising the following steps of:

after the instruction approaching the medical staff is completed, voice interaction and task allocation are performed.

2. The method of claim 1, wherein performing multi-modal sensing to determine whether to interact comprises:

3. The multi-modal robotic co-fusion method of claim 2, wherein the workflow of the far-end speech recognition system comprises:

voice input, namely acquiring voice information by using sensors of microphone voice arrays of 3 Kinetics installed on the robot, and detecting wake-up words of the robot;

Preprocessing a source voice signal, performing delay processing on the voice signal obtained by the intelligent robot from a hospital, filtering invalid information and noise, and performing conversion processing on the voice signal;

the acoustic model, the language model and the dictionary form a recognition network, the acoustic model compares the recognized voice signal with the target voice signal, and the voice model calculates the probability of occurrence of a sentence and comprises a set of words and phonemes;

and outputting text, and translating the voice decoded information into characters.

4. The method of claim 1, wherein when face information of a medical staff is identified, determining to interact with the medical staff after finding that angles of eyes of the medical staff looking at the camera reach a threshold value and gesture information of the medical staff is identified, comprising:

gesture recognition, namely acquiring a portrait photo of a medical staff through a robot, acquiring visual information around a hospital according to two cameras based on the staff and the hospital occasion, and recognizing gesture information contained in joints of a human body;

feature matching, namely setting gestures for medical staff to wake up the robot, converting the gestures into mathematical models, and comparing the tested mathematical models with the set gesture mathematical models to obtain an optimal gesture recognition result;

a YOLO multi-person detector, each healthcare worker individual being detected using the YOLO detector;

head gesture detection, using an OpenCV library to perform face detection, and using a specific angle of the head as a reference standard for eye recognition;

5. The method of claim 4, wherein when face information of the medical staff is identified, determining to interact with the medical staff after finding that angles of eyes of the medical staff looking at the camera reach a threshold value and gesture information of the medical staff is identified, further comprising:

Through the gesture estimation of the identification face, face detection is used for detecting a plurality of individuals, the OpenCV library is utilized for face detection, firstly, whether the head of a medical care worker faces a camera or not is judged through the identification of the yaw angle, the pitch angle and the turnover angle of the head of the medical care worker, and whether the medical care worker looks at the camera or not is observed through the horizontal angle and the vertical angle of the sight, namely, the yaw angle alpha, the pitch angle beta and the turnover angle gamma represent the angles of the head gesture, and a judgment formula is shown as follows:

wherein α ', β ', γ ' are thresholds of yaw angle, pitch angle and flip angle; a value of 1 for function f () indicates yes, and a value of 0 indicates no;

the reliability is improved by utilizing the specific angle of the head to recognize the front view of eyes, a sight tracker is adopted to receive the input image of the face detector, and the sight tracker consists of a convolutional neural network; the head gesture recognition and the eye gaze recognition are used for improving two groups of recognition and judging whether medical staff is gazing at the drug delivery robot, and if the head gesture recognition result and the eye front view result are judged to be true, the intention that the medical staff can interact can be indicated;

wherein a value of 1 for function g () indicates yes, and a value of 0 indicates no;

θ _α ,θ _β For horizontal angle and vertical angle, θ' _α ,θ′ _β Is their upper bound.

6. The method of claim 5, wherein the interaction between the healthcare worker and the robot is not completely determined when the healthcare worker looks at the robot, and the interaction between the healthcare worker and the robot is determined when the robot recognizes the gesture of the healthcare worker.

7. The method of claim 1, wherein executing the instructions for approaching the healthcare worker after the interaction is determined to be performed comprises:

and updating the current pose state, comparing the difference value of the expected pose with the target pose again, and stopping tracking if the difference value is zero.

8. The method of claim 7, wherein calculating the current virtual camera speed by comparing the expected target pose with the current target pose after calculating the parametric model in which the camera is located and the identified coordinate point of the dimension of the tracked healthcare worker, comprises:

9. The method of claim 1, wherein updating the current pose state, and comparing the difference between the desired pose and the target pose again, and if the difference is zero, the robot will stop tracking, comprising:

first, a matrix E is established _e 、E _p 、E _d To respectively represent the relationship between edge features, key point features and depth features in the graph sequence from point to line, and to respectively establish interaction matrices L of edge features, key point features and depth features representations in D space _e 、L _p 、L _d To estimate the pose of the target healthcare person;

e＝(s(r)-s*) (3)

where r is the pose estimate and λ is the positive scalar;

in which L _s Is an interaction matrix that depends on the image features s and the corresponding depth values z in its scene; from equations (3), (4) equation (5) is derived by comparing the expected state s at each iteration ^* And the characteristics of the current state s to obtain a virtual camera speed v:

v＝-λL ⁺ _s (s(r)-s ^* ) (6)

wherein L is ⁺ _s Is L _s ∈R ^2n×6 Is pseudo-inverse of v E R ^6×1 ,s(r)-s ^* ∈R ^2n×1 And n represents the number of feature points;

thus, the pose of the kth iteration is updated by equation (4); Δt represents the transformation between the kth iteration and the (k+1) th iteration, where the six-dimensional matrix is transformed into a four-dimensional matrix by Λ operation and an exponential mapping is created as follows:

wherein the method comprises the steps of v∈R ^6×1 ,v∧∈R ^4×4

Updating the target matrix, and updating the pose state of the current medical staff again until the expected error is zero.

10. The method of claim 1, wherein following execution of the instructions to the healthcare worker, performing voice interaction and task allocation comprises:

After testing the voice dialogue function of Dialogflow, using the generated JSON file to communicate with the central processing system of the robot;

after testing the voice dialogue function of Dialogflow, the process of communicating with the central processing system of the robot by using the generated JSON file comprises the following steps:

constructing a Dialogflow agent, creating an account of a cloud end, and carrying out a platform for creating voice dialogue training in the account;

creating intention, wherein the robot sets topics related to content as topics of voice interaction according to the scene of a hospital; the topics related to the content comprise robot call, preset time, scheduled drug delivery tasks and set room number information;